RunPod•3mo ago

Job stuck in queue and workers are sitting idle

This has been the case very often. The jobs are stuck in the queue and workers are idle. How to improve this? There was not anything else going on with any other worker (or endpoint for that matter).

31 Replies

hotsnr•3mo ago

kironkeyz•3mo ago

i am ahving problems with setting things up as well we need helpt

Jason•3mo ago

check your logs..

3WaD•3mo ago

It happens to me from time to time too. There are no logs to check because nothing is running on workers. I think it might be something with the orchestrator as the only solution is to cancel the request and send a new one. If you don't, the request is usually executed after a long time. This can randomly happen with a previously perfectly working worker.

Jason•3mo ago

Ohh... What did you do to the endpoint sorry You guys can also open a support ticket to ask a staff to check but probably what I think might happen is you guys used a same tag to release? Or not

Justin•3mo ago

+1 My worker was booted up and was then in idle and the job is sometimes stuck for minutes

jim•3mo ago

Same issue. Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening @Justin Merrell @flash-singh

flash-singh•3mo ago

this usually mean the worker isn't picking up the job, have endpoint id or anything else to look on our end?

jim•3mo ago

Yes zvhg9gcnqmkugx Also tracking here: https://discord.com/channels/912829806415085598/1341839170787741786

TristenHarr•3mo ago

Same issue! We can’t move into production because of this issue. https://discord.com/channels/912829806415085598/1340773964397674709

jim•3mo ago

What is runpod doing???

Jason•3mo ago

Shipping some other features, that bug shouldn't happen tho, maybe staffs should check your endpoints

TristenHarr•3mo ago

For me, there are no misconfigurations as far as I can tell. What RunPod is doing is it’ll get a request, a worker will become active, then the job will sit in the queue for 10+ minutes before getting picked up. It’s not flash-boot, (enabled) the logs say everything is ready from a worker perspective. (I’ve checked the logs extensively, it’s not stuck loading a model or anything of that nature. The worker should be ready.) This happens even with multiple workers setup where only 1 will become active then everything sits in the queue for 10 minutes. Once things spin up they seem to work fine, but everytime it’s a new spin up there’s a risk it’ll take 10+ minutes. What I’ve been doing is increasing the time before it spins down and then trying to find a “good” worker and keep it open as long as I can, even sending redundant requests just to prevent getting a “bad” or “stuck” worker. It’s also intermittent/flaky, sometimes it will spin up quick and work fine, sometimes it gets stuck like this. It’s not something that’s happening every time. I’d say maybe 10-30% of the time.

Jason•3mo ago

Yes, maybe there's some edge case happening there I think, that's why it's beneficial for them to check your endpoint Oh can you debug more on before you call the serverless.Start fn in Your handler py Like print every line or somehow debug it To figure out which might go wrong Is it not that your model that takes long to load in that case? Or you download model every worker startup to container disk (outside /runpod-volume) I'm just listing possibilities that might happen so if it's not that lmk

TristenHarr•3mo ago

For sure, I think this is a known issue they are looking into! Will give them some time to dig in and if I still have problems later this week I'll follow up. 🙂

blue whaleOP•3mo ago

I can share but this has been intermittently. We are planning to roll out a production ready application but this has been making us infuriating

deanQ•3mo ago

Is anyone still experiencing this today? Please report and indicate an endpoint or worker ID. Thanks.

getsomedata•3mo ago

Yep I had 58 workers idle (says ready on the logs) and only 10 workers running. My queue had 58 jobs for ove 30 mins, I killed it and tried many things before joining discord and seeing others have the same problem. I am recreating a new endpoint and will share endpoint ID.

getsomedata•3mo ago

I have set 78 max workers and 78 active workers but still have only 7 workers running. Idle node log: 2/20/2025, 3:15:41 PM loading container image from cache Loaded image: xxxxxxx xxx Pulling from xxxx Digest: xxxxxx Status: Image is up to date for xxxxxx worker is ready

Dj•3mo ago

@getsomedata Can you give me your endpoint id? We're looking into this, thank you for waiting!

blue whaleOP•3mo ago

Would love to know the findings. Dont want to give up on runpod

Dj•3mo ago

@blue whale I'm told this incident should be resolved for most users, can you share an endpoint ID if you're still seeing this problem?

Twiix•2mo ago

same issue here no idea where to go from this point anyone solved this issue please share

andypotato•2mo ago

This is probably the same issue as I have described in a separate report https://discord.com/channels/912829806415085598/1345960498478321735 - Jobs are simply never executed despite the worker running and even other workers being available - Querying the job status will make it run immediately - Only happens with jobs started via run but not runsync - Exact same behavior in local testing environment

david•2mo ago

i have seen this issue in my own testing too, some jobs seem to just be forgotten and active workers turn off without picking it up

Jason•2mo ago

@yhlong00000 For who are still experiencing this can you share your endpoint id's maybe staff can check them, if not maybe have a open support ticket

Twiix•2mo ago

h5d68prihmval0 please help! @yhlong00000 @nerdylive

Jason•2mo ago

whats up? create a support ticket from the website

amirh1541•2mo ago

Here also

yhlong00000•2mo ago

You might want to check your code or Docker image, it looks like it’s not able to start properly and becomes inactive after some time.

slavov.tech | vidfast.ai•7d ago

Same issue with the serverless fasterwhisper template

Gaming

Programming

Job stuck in queue and workers are sitting idle

Did you find this page helpful?