Job stuck in queue and workers are sitting idle
This has been the case very often. The jobs are stuck in the queue and workers are idle. How to improve this? There was not anything else going on with any other worker (or endpoint for that matter).

22 Replies
+1
i am ahving problems with setting things up as well we need helpt
check your logs..
It happens to me from time to time too. There are no logs to check because nothing is running on workers. I think it might be something with the orchestrator as the only solution is to cancel the request and send a new one. If you don't, the request is usually executed after a long time. This can randomly happen with a previously perfectly working worker.
Ohh... What did you do to the endpoint sorry
You guys can also open a support ticket to ask a staff to check but probably what I think might happen is you guys used a same tag to release? Or not
+1
My worker was booted up and was then in idle and the job is sometimes stuck for minutes
Same issue. Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening
@Justin Merrell @flash-singh
this usually mean the worker isn't picking up the job, have endpoint id or anything else to look on our end?
Yes zvhg9gcnqmkugx
Also tracking here: https://discord.com/channels/912829806415085598/1341839170787741786
Same issue! We can’t move into production because of this issue. https://discord.com/channels/912829806415085598/1340773964397674709
What is runpod doing???
Shipping some other features, that bug shouldn't happen tho, maybe staffs should check your endpoints
For me, there are no misconfigurations as far as I can tell. What RunPod is doing is it’ll get a request, a worker will become active, then the job will sit in the queue for 10+ minutes before getting picked up. It’s not flash-boot, (enabled) the logs say everything is ready from a worker perspective. (I’ve checked the logs extensively, it’s not stuck loading a model or anything of that nature. The worker should be ready.)
This happens even with multiple workers setup where only 1 will become active then everything sits in the queue for 10 minutes.
Once things spin up they seem to work fine, but everytime it’s a new spin up there’s a risk it’ll take 10+ minutes.
What I’ve been doing is increasing the time before it spins down and then trying to find a “good” worker and keep it open as long as I can, even sending redundant requests just to prevent getting a “bad” or “stuck” worker.
It’s also intermittent/flaky, sometimes it will spin up quick and work fine, sometimes it gets stuck like this. It’s not something that’s happening every time. I’d say maybe 10-30% of the time.
Yes, maybe there's some edge case happening there I think, that's why it's beneficial for them to check your endpoint
Oh can you debug more on before you call the serverless.Start fn in Your handler py
Like print every line or somehow debug it
To figure out which might go wrong
Is it not that your model that takes long to load in that case?
Or you download model every worker startup to container disk (outside /runpod-volume)
I'm just listing possibilities that might happen so if it's not that lmk
For sure, I think this is a known issue they are looking into! Will give them some time to dig in and if I still have problems later this week I'll follow up. 🙂
I can share but this has been intermittently. We are planning to roll out a production ready application but this has been making us infuriating
Is anyone still experiencing this today? Please report and indicate an endpoint or worker ID. Thanks.
Yep
I had 58 workers idle (says ready on the logs) and only 10 workers running. My queue had 58 jobs for ove 30 mins, I killed it and tried many things before joining discord and seeing others have the same problem.
I am recreating a new endpoint and will share endpoint ID.
I have set 78 max workers and 78 active workers but still have only 7 workers running.
Idle node log:
2/20/2025, 3:15:41 PM
loading container image from cache
Loaded image: xxxxxxx
xxx Pulling from xxxx
Digest: xxxxxx
Status: Image is up to date for xxxxxx
worker is ready


@getsomedata Can you give me your endpoint id?
We're looking into this, thank you for waiting!
Would love to know the findings. Dont want to give up on runpod
@blue whale I'm told this incident should be resolved for most users, can you share an endpoint ID if you're still seeing this problem?