Getting slow workers randomly
we’re running a custom comfy ui workflow on RTX 4090 instances with a volume attached to them. Around 70% of the time, we get normal workers where the delay time is around 8-10 seconds.
But sometimes we get random slow workers. I attached a screenshot of the requests logs, you can see that the worker with
hv5rbk09kzckc9
id takes around 11-12 seconds to execute the same exact comfy workflow with same gpu whereas the other worker with id lgmvs58602xe61
takes 2-3 seconds to execute.
When we get a slow worker, it's just slower on every aspect. GPU inference takes 5x longer. Comfy import times take 7-8x longer than a normal worker.5 Replies
Hey, after a quick check, both workers are running on servers with the exact same location and specs. You might want to run a few more tests with similar input. Also, keep in mind that the cold start time could be adding more delay than the actual processing time. So, when comparing performance, make sure both workers are warmed up and already running
@yhlong00000 thank you for the response! that makes it more weird though 😅 comfy is initialized when worker boots up, and the prompt that's sent to both of these workers are exact same. You can see the slow worker's cold start execution time was 134 seconds and warm requests was taking around 12 seconds, whereas the other takes 41 seconds for cold start and 3 seconds warm execution time
I'm not sure how vCPUs work but my guess is that these slow workers are getting bottlenecked by cpus somehow since comfy's custom node imports are taking way longer than usual in those
I agree that the example you provided highlights big differences. However, my point is that it’s typically necessary to run a larger set of tests to demonstrate that, on average, this one is consistently slower, which could indicate a potential issue. It’s difficult to determine a performance problem based on a small sample of data.
yes, for sure but the thing is i've had this issue many times for the past 2-3 months. What's the best way for me to gather large data for you to easily check and spot the issue? This screenshot is a rare scenario where i have the request logs for both slow and fast workers that popped up. AFAIK there's no way for users to gather logs from runpod serverless endpoints so this is all i could gather as evidence
I too seem to get this issue on 4090s. Is anything being done to narrow down the issue?