Urgent: All new gpu pods are broken
Hi, our existing pods and new pods we are creating are having all same issue where they cannot find cuda devices, all giving error
Warning: caught exception 'CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.', memory monitor disabled
11 Replies
@ashleyk Tagging you, you might know if something has changed on sd web ui repo or is there any issue on general right now
our all Prod is down
You can use @Papa Madiator 's GPU analyser tool in #📚|resources
What GPU type and region? and is it secure cloud or community cloud?
It is all gpus in community cloud.
You can try one as well, it reproduces on all 4090 pods
I don't have issues with community cloud or 4090. Which template are you using?
I don't use this template so can't help unfortunately, maybe @TheLastBen can advise why its unable to find the GPU.
Just launched new pod on secure cloud, it seems they are working
But community cloud have problems right now with either cuda version or something else
just started a fresh 4090 pod, it works fine now
@xPaghkman next time it happens provide the pod id so RunPod can check it out and perhaps delist the machine if its broken.
one of them is:
g9tct1kr323mh0
stopped others.
can't find the ids. Anyway just moved all our process to secure cloud.
Have you tried to run cuda is_available ?I just generated an image