RuntimeError: The NVIDIA driver on your system is too old (found version 11080). Please update your
I deploy a new version today but keep running into this error. Did something changed on RunPod? Thanks!
23 Replies
https://discord.com/channels/912829806415085598/1023588055174611027/1173632165423104051
Maybe filter by the CUDA version? if you are expecting a 12.0+ verison of Cuda?
is my guess?
You can't filter by CUDA version in serverless, only GPU cloud. Will be awesome to get all machines onto the latest CUDA version though.
😮
when that happens, the workers will stuck at running state and costing money 😦 since that's part of the caching code before the handler is called. Is there any improvement coming?
@flash-singh ?
@Alpay Ariyak : I saw you say:
https://discord.com/channels/912829806415085598/1194109966349500498/1194731299898933348
Do you have any ideas as to what to do when you need to a certain cuda for serverless, but you get handed a lower version of cuda to then lead to a crash?
It would be great to filter out the old cuda versions for serverless.
However, I still think there should be a timeout on setting up the worker(the max time allowed before the handler is called)
You got a worker with a CUDA version lower than 11.8?
I think @ssssteven got a worker with 11.8, but Im guessing he needs a worker with 12.0+, and it caused a crashed causing the worker to hang + just paying for hang time
I see, the feature to specify worker cuda version is in the works to my knowledge, but not currently out, so the easiest route would be to try to make everything work with 11.8, as both workers with 11.8 and 12.0+ should be compatible that way
Not possible with things like Oobabooga, and latest xformers requires CUDA 12 as well, so would be better if all machines are on CUDA 12 which has been out for several months already
is v12 and v11 breaking for cuda?
just wondering never tried
Or is v12 always backwards compatable
12 is backwards compatible
Intresting.. i guess the answer.. till cuda filtering for serverless is out is 11.8... 😦
Not really an acceptable answer/solution since you can't use Torch 2.1.2 with xformers 0.0.23.post1 on CUDA lower than 12
can we at least implement the timeout?
@flash-singh / @Alpay Ariyak Yeah. I do think that you guys need to catch the handler.py at the very least to refresh the worker > or kill it if it fails to initialize before the handler.py
whats the worker id?
@flash-singh I can't find it anymore. It's not in the logs. The endpoint is d7n1ceeuq4swlp and it happened few mins before I post this question.
oops.. it just happened again A100 80GB - iuot3yjoez7 bjo
@JM @Justin can we track this down
Tracking down hosts with outdated cuda?
yep
thank you