Job stuck in queue and workers are sitting idle

This has been the case very often. The jobs are stuck in the queue and workers are idle. How to improve this? There was not anything else going on with any other worker (or endpoint for that matter).
No description
22 Replies
hotsnr
hotsnr7d ago
+1
kironkeyz
kironkeyz7d ago
i am ahving problems with setting things up as well we need helpt
nerdylive
nerdylive7d ago
check your logs..
3WaD
3WaD7d ago
It happens to me from time to time too. There are no logs to check because nothing is running on workers. I think it might be something with the orchestrator as the only solution is to cancel the request and send a new one. If you don't, the request is usually executed after a long time. This can randomly happen with a previously perfectly working worker.
nerdylive
nerdylive6d ago
Ohh... What did you do to the endpoint sorry You guys can also open a support ticket to ask a staff to check but probably what I think might happen is you guys used a same tag to release? Or not
Justin
Justin6d ago
+1 My worker was booted up and was then in idle and the job is sometimes stuck for minutes
jim
jim4d ago
Same issue. Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening @Justin Merrell @flash-singh
flash-singh
flash-singh3d ago
this usually mean the worker isn't picking up the job, have endpoint id or anything else to look on our end?
jim
jim3d ago
TristenHarr
TristenHarr3d ago
Same issue! We can’t move into production because of this issue. https://discord.com/channels/912829806415085598/1340773964397674709
jim
jim3d ago
What is runpod doing???
nerdylive
nerdylive3d ago
Shipping some other features, that bug shouldn't happen tho, maybe staffs should check your endpoints
TristenHarr
TristenHarr3d ago
For me, there are no misconfigurations as far as I can tell. What RunPod is doing is it’ll get a request, a worker will become active, then the job will sit in the queue for 10+ minutes before getting picked up. It’s not flash-boot, (enabled) the logs say everything is ready from a worker perspective. (I’ve checked the logs extensively, it’s not stuck loading a model or anything of that nature. The worker should be ready.) This happens even with multiple workers setup where only 1 will become active then everything sits in the queue for 10 minutes. Once things spin up they seem to work fine, but everytime it’s a new spin up there’s a risk it’ll take 10+ minutes. What I’ve been doing is increasing the time before it spins down and then trying to find a “good” worker and keep it open as long as I can, even sending redundant requests just to prevent getting a “bad” or “stuck” worker. It’s also intermittent/flaky, sometimes it will spin up quick and work fine, sometimes it gets stuck like this. It’s not something that’s happening every time. I’d say maybe 10-30% of the time.
nerdylive
nerdylive3d ago
Yes, maybe there's some edge case happening there I think, that's why it's beneficial for them to check your endpoint Oh can you debug more on before you call the serverless.Start fn in Your handler py Like print every line or somehow debug it To figure out which might go wrong Is it not that your model that takes long to load in that case? Or you download model every worker startup to container disk (outside /runpod-volume) I'm just listing possibilities that might happen so if it's not that lmk
TristenHarr
TristenHarr3d ago
For sure, I think this is a known issue they are looking into! Will give them some time to dig in and if I still have problems later this week I'll follow up. 🙂
blue whale
blue whaleOP2d ago
I can share but this has been intermittently. We are planning to roll out a production ready application but this has been making us infuriating
deanQ
deanQ2d ago
Is anyone still experiencing this today? Please report and indicate an endpoint or worker ID. Thanks.
getsomedata
getsomedata2d ago
Yep I had 58 workers idle (says ready on the logs) and only 10 workers running. My queue had 58 jobs for ove 30 mins, I killed it and tried many things before joining discord and seeing others have the same problem. I am recreating a new endpoint and will share endpoint ID.
getsomedata
getsomedata2d ago
I have set 78 max workers and 78 active workers but still have only 7 workers running. Idle node log: 2/20/2025, 3:15:41 PM loading container image from cache Loaded image: xxxxxxx xxx Pulling from xxxx Digest: xxxxxx Status: Image is up to date for xxxxxx worker is ready
No description
No description
Dj
Dj2d ago
@getsomedata Can you give me your endpoint id? We're looking into this, thank you for waiting!
blue whale
blue whaleOP2d ago
Would love to know the findings. Dont want to give up on runpod
Dj
Dj2d ago
@blue whale I'm told this incident should be resolved for most users, can you share an endpoint ID if you're still seeing this problem?

Did you find this page helpful?