R
RunPod3mo ago
harishp

Not all workers being utilized

In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd
No description
18 Replies
nerdylive
nerdylive3mo ago
what do you mean 7 are being utilized? that seems like 11 running
harishp
harishp3mo ago
If you see "Jobs" section, 7 in progress it shows. So, it is not utilizing all the GPUs to serve the requests. Only 7 are serving the requests
nerdylive
nerdylive3mo ago
Hm what app are you running there? maybe check the logs, each worker and see if anything stinks
harishp
harishp3mo ago
its just a SDXL model
girishkd
girishkd3mo ago
In some of the GPUs, CUDA failure was seen and those GPUs when we remove from the list of workers, they are not spinning up
nerdylive
nerdylive3mo ago
Oh any logs? are you with him? yeah then thats why probably it fails. try limiting the cuda versions from the settings
harishp
harishp3mo ago
we limited the cuda versions to 12.1 me and @girishkd are colleagues
nerdylive
nerdylive3mo ago
ohh i see, great so its good now by setting that?
harishp
harishp3mo ago
nope nope
nerdylive
nerdylive3mo ago
oh so what happened now?
digigoblin
digigoblin3mo ago
What kind of CUDA failure? Did it OOM for running out of VRAM? I've seen that happen on 24GB GPUs when you add upscaling.
girishkd
girishkd3mo ago
Attached screenshot contains the CUDA failure we are experiencing
No description
girishkd
girishkd3mo ago
We are using 24GB ones (4090s) only
digigoblin
digigoblin3mo ago
Oh yeah that error seems to be due to a broken worker.
girishkd
girishkd3mo ago
Okay. These broken workers are not getting respawned on its own. What should we do in that case ?
digigoblin
digigoblin3mo ago
Contact RunPod support via web chat or email
nerdylive
nerdylive3mo ago
Broken worker? Wow there's such thing
digigoblin
digigoblin3mo ago
Yeah it happens sometimes just like broken pods. I had to terminate workers a few times.