R
RunPod•11mo ago
ssssteven

RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your

I deploy a new version today but keep running into this error. Did something changed on RunPod? Thanks!
23 Replies
justin
justin•11mo ago
https://discord.com/channels/912829806415085598/1023588055174611027/1173632165423104051 Maybe filter by the CUDA version? if you are expecting a 12.0+ verison of Cuda? is my guess?
ashleyk
ashleyk•11mo ago
You can't filter by CUDA version in serverless, only GPU cloud. Will be awesome to get all machines onto the latest CUDA version though.
justin
justin•11mo ago
😮
ssssteven
sssstevenOP•11mo ago
when that happens, the workers will stuck at running state and costing money 😦 since that's part of the caching code before the handler is called. Is there any improvement coming?
ashleyk
ashleyk•11mo ago
@flash-singh ?
justin
justin•11mo ago
@Alpay Ariyak : I saw you say: https://discord.com/channels/912829806415085598/1194109966349500498/1194731299898933348 Do you have any ideas as to what to do when you need to a certain cuda for serverless, but you get handed a lower version of cuda to then lead to a crash?
ssssteven
sssstevenOP•11mo ago
It would be great to filter out the old cuda versions for serverless. However, I still think there should be a timeout on setting up the worker(the max time allowed before the handler is called)
Alpay Ariyak
Alpay Ariyak•11mo ago
You got a worker with a CUDA version lower than 11.8?
justin
justin•11mo ago
I think @ssssteven got a worker with 11.8, but Im guessing he needs a worker with 12.0+, and it caused a crashed causing the worker to hang + just paying for hang time
Alpay Ariyak
Alpay Ariyak•11mo ago
I see, the feature to specify worker cuda version is in the works to my knowledge, but not currently out, so the easiest route would be to try to make everything work with 11.8, as both workers with 11.8 and 12.0+ should be compatible that way
ashleyk
ashleyk•11mo ago
Not possible with things like Oobabooga, and latest xformers requires CUDA 12 as well, so would be better if all machines are on CUDA 12 which has been out for several months already
justin
justin•11mo ago
is v12 and v11 breaking for cuda? just wondering never tried Or is v12 always backwards compatable
ashleyk
ashleyk•11mo ago
12 is backwards compatible
justin
justin•11mo ago
Intresting.. i guess the answer.. till cuda filtering for serverless is out is 11.8... 😦
ashleyk
ashleyk•11mo ago
Not really an acceptable answer/solution since you can't use Torch 2.1.2 with xformers 0.0.23.post1 on CUDA lower than 12
ssssteven
sssstevenOP•11mo ago
can we at least implement the timeout?
justin
justin•11mo ago
@flash-singh / @Alpay Ariyak Yeah. I do think that you guys need to catch the handler.py at the very least to refresh the worker > or kill it if it fails to initialize before the handler.py
flash-singh
flash-singh•11mo ago
whats the worker id?
ssssteven
sssstevenOP•11mo ago
@flash-singh I can't find it anymore. It's not in the logs. The endpoint is d7n1ceeuq4swlp and it happened few mins before I post this question. oops.. it just happened again A100 80GB - iuot3yjoez7 bjo
flash-singh
flash-singh•11mo ago
@JM @Justin can we track this down
Justin Merrell
Justin Merrell•11mo ago
Tracking down hosts with outdated cuda?
flash-singh
flash-singh•11mo ago
yep
ssssteven
sssstevenOP•11mo ago
thank you
Want results from more Discord servers?
Add your server