🆘 We've encountered a serious issue with the machines running in our production environment
🆘 We've encountered a serious issue with the machines running in our production environment on RunPod: the GPU utilization fluctuates wildly, sometimes even dropping to zero, which significantly slows down task execution. Who should I contact?
15 Replies
Tips rather than making it hard to read starting from a SOS sign, make your title clearer by telling the problem and description the problem
So what you're saying is you're not using the gpu at all, no model inference but the gpu usage is still up and down?
If so, try reporting via the website
The reason we're using SOS is because we've encountered this issue in a production environment, which directly affects the user experience, but I don't know who to turn to for help.
All good!
During the inference process, we received feedback from users that the inference speed was particularly slow. Upon checking, we confirmed that the issue was indeed related to the inference, but the GPU utilization was either zero or very low.
Did it just happen without any production changes ?
can you replicate it onto another pod?
Despite all other conditions remaining unchanged, sometimes the inference speed is fast, and at other times it is very slow, even though the model has already been loaded into the GPU memory.
it seems to me that the nvidia-smi is displaying normal
yeah
Same config?
but gpu kernels are not running at all
inference speed is extremely low
yes
Can you replicate it onto another pod?
maybe for now switch onto another pod while reporting it to runpod
GPU utilization fluctuates wildly, sometimes even dropping to zero, and we have nothing changed!
This is going to take up more of our time, and we are short-staffed. I just want to know if Runpod has technical personnel who can help us troubleshoot this issue. We have checked the code logic and found no issues.
how to report to runpod?
Well maybe yes, but not here
Contact button from the website
then you'll be redirected into another page
OK
thx
Np, hope you can resolve this soon!