Not all workers being utilized
In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd
18 Replies
what do you mean 7 are being utilized?
that seems like 11 running
If you see "Jobs" section, 7 in progress it shows. So, it is not utilizing all the GPUs to serve the requests. Only 7 are serving the requests
Hm what app are you running there?
maybe check the logs, each worker
and see if anything stinks
its just a SDXL model
In some of the GPUs, CUDA failure was seen and those GPUs when we remove from the list of workers, they are not spinning up
Oh any logs?
are you with him? yeah then thats why probably it fails.
try limiting the cuda versions from the settings
we limited the cuda versions to 12.1
me and @girishkd are colleagues
ohh i see, great
so its good now by setting that?
nope nope
oh so what happened now?
What kind of CUDA failure? Did it OOM for running out of VRAM?
I've seen that happen on 24GB GPUs when you add upscaling.
Attached screenshot contains the CUDA failure we are experiencing
We are using 24GB ones (4090s) only
Oh yeah that error seems to be due to a broken worker.
Okay. These broken workers are not getting respawned on its own. What should we do in that case ?
Contact RunPod support via web chat or email
Broken worker? Wow there's such thing
Yeah it happens sometimes just like broken pods. I had to terminate workers a few times.