RunPod•8mo ago

Jobs in queue for a long time, even when there is a worker available

Hello, Recently I've seem a lot of jobs getting stuck in queue for long times, even though my serverless endpoint has free workers left, and the queue delay is set to 4 seconds. Does anyone has any experience with this? Any ideas why does this happen? The first screenshot depicts two jobs, submitted at the same time. One is picked up by a worker, and the other sits in queue.

8 Replies

MihályOP•8mo ago

Some extra settings for context:

yhlong00000•8mo ago

Can you share the endpoint id here?

MihályOP•8mo ago

Sure! It was either noxh y2en 39n3 y3 or k5hi ftra iqq8 dw

yhlong00000•8mo ago

Hey, I checked the logs and didn’t notice any unusually long wait times. There were maybe 2-3 requests that took a bit longer to start because both workers were occupied, and each request took a couple of minutes to complete. It might just be that the UI didn’t refresh. If you have a specific request ID that you think had an exceptionally long wait despite having an available worker, feel free to share it, and I can take another look. Also, since each of your requests takes a bit of time to complete, I’d recommend configuring a higher number for max workers. It won’t cost you any extra money, but it will ensure you can scale smoothly when multiple requests come in at the same time.

MihályOP•8mo ago

Thank you, will do!

MihályOP•8mo ago

Hello @yhlong00000 I was able to find a bigger event : Here are the delay times of the last 24 hours. These happen even though there are constantly free worker available. Also, sometimes, when the delay times get high, I'm also getting a failed job. Example request IDs : 186c4d2a-31ea-4d82-8dfa-411a2bc5c83b-e1 5dd6a934-5293-4796-a5d7-8d0ddd9eef60-e1 I'm also buffled on the error message "job timed out after 1 retries", as its not coming from my container o.O Any idea what could this be?

MihályOP•8mo ago

And a more recent view :

yhlong00000•8mo ago

Let me take a look The delay is high because you’re getting more requests than the current max number of workers can handle, so the requests are piling up in the queue. You can try increasing the max number of workers and lowering the queue delay so the workers can scale up faster. "Job timed out after 1 retry” happens when your worker finishes/failed the task, but there’s an error in the output field, and worker send a message notifies our system.

Gaming

Programming

Jobs in queue for a long time, even when there is a worker available

Did you find this page helpful?