RunPod•10mo ago

Not all workers being utilized

In the attached image if you can see 11/12 workers spun up but only 7 are being utilized but we're being charged for all the 12 GPUs. @girishkd

18 Replies

nerdylive•10mo ago

what do you mean 7 are being utilized? that seems like 11 running

harishpOP•10mo ago

If you see "Jobs" section, 7 in progress it shows. So, it is not utilizing all the GPUs to serve the requests. Only 7 are serving the requests

nerdylive•10mo ago

Hm what app are you running there? maybe check the logs, each worker and see if anything stinks

harishpOP•10mo ago

its just a SDXL model

girishkd•10mo ago

In some of the GPUs, CUDA failure was seen and those GPUs when we remove from the list of workers, they are not spinning up

nerdylive•10mo ago

Oh any logs? are you with him? yeah then thats why probably it fails. try limiting the cuda versions from the settings

harishpOP•10mo ago

we limited the cuda versions to 12.1 me and @girishkd are colleagues

nerdylive•10mo ago

ohh i see, great so its good now by setting that?

harishpOP•10mo ago

nope nope

nerdylive•10mo ago

oh so what happened now?

digigoblin•10mo ago

What kind of CUDA failure? Did it OOM for running out of VRAM? I've seen that happen on 24GB GPUs when you add upscaling.

girishkd•10mo ago

Attached screenshot contains the CUDA failure we are experiencing

girishkd•10mo ago

We are using 24GB ones (4090s) only

digigoblin•10mo ago

Oh yeah that error seems to be due to a broken worker.

girishkd•10mo ago

Okay. These broken workers are not getting respawned on its own. What should we do in that case ?

digigoblin•10mo ago

Contact RunPod support via web chat or email

nerdylive•10mo ago

Broken worker? Wow there's such thing

digigoblin•10mo ago

Yeah it happens sometimes just like broken pods. I had to terminate workers a few times.