RunPod•16mo ago

Intermittent Slow Performance Issue with GPU Workers

I am currently encountering an intermittent issue with some GPU workers exhibiting significantly slower performance. I have tried to measure the time taken for a specific task on a designated type of GPU worker (4090 24GB). Typically, when I send the exact identical payload input to the endpoint, the execution time is around 1 minute. However, I have observed that occasionally, a worker becomes exceptionally slow. Even with the same payload input, Docker image, tag, and GPU type, the execution time extends to a few hours. Notably, during these occurrences, the GPU utilization remains constantly at 0%. Upon reviewing the output log, it is evident that the inference speed is unusually slow when the affected worker is in operation. Have any of you experienced a similar problem, and if so, how did you resolve it? Your insights and assistance in addressing this issue would be greatly appreciated. Thank you.

4 Replies

J.•16mo ago

If you have a endpoint ID + templates + what work you are running, if the staff sees this later it would be helpful. Also if you end up finding out the worker id that is causing it.

ssssteven•16mo ago

https://discord.com/channels/912829806415085598/1194813255391125595 Could this be what I’m seeing as well?

landingpagelover24•6mo ago

We have exactly the same problem here. 4090, sometimes fast ~10it/s, sometimes slow ~5it/s (doubling the costs!) – for the exact same request/workload. Has the problem been solved for you in the meantime @n8tzto?

Jason•6mo ago

Send some request ids Or maybe worker id's too

Gaming

Programming

Intermittent Slow Performance Issue with GPU Workers

Did you find this page helpful?