R
RunPod3mo ago
Mihály

Jobs in queue for a long time, even when there is a worker available

Hello, Recently I've seem a lot of jobs getting stuck in queue for long times, even though my serverless endpoint has free workers left, and the queue delay is set to 4 seconds. Does anyone has any experience with this? Any ideas why does this happen? The first screenshot depicts two jobs, submitted at the same time. One is picked up by a worker, and the other sits in queue.
No description
No description
8 Replies
Mihály
MihályOP3mo ago
Some extra settings for context:
No description
yhlong00000
yhlong000003mo ago
Can you share the endpoint id here?
Mihály
MihályOP3mo ago
Sure! It was either noxh y2en 39n3 y3 or k5hi ftra iqq8 dw
yhlong00000
yhlong000003mo ago
Hey, I checked the logs and didn’t notice any unusually long wait times. There were maybe 2-3 requests that took a bit longer to start because both workers were occupied, and each request took a couple of minutes to complete. It might just be that the UI didn’t refresh. If you have a specific request ID that you think had an exceptionally long wait despite having an available worker, feel free to share it, and I can take another look. Also, since each of your requests takes a bit of time to complete, I’d recommend configuring a higher number for max workers. It won’t cost you any extra money, but it will ensure you can scale smoothly when multiple requests come in at the same time.
Mihály
MihályOP3mo ago
Thank you, will do!
Mihály
MihályOP3mo ago
Hello @yhlong00000 I was able to find a bigger event : Here are the delay times of the last 24 hours. These happen even though there are constantly free worker available. Also, sometimes, when the delay times get high, I'm also getting a failed job. Example request IDs : 186c4d2a-31ea-4d82-8dfa-411a2bc5c83b-e1 5dd6a934-5293-4796-a5d7-8d0ddd9eef60-e1 I'm also buffled on the error message "job timed out after 1 retries", as its not coming from my container o.O Any idea what could this be?
No description
No description
No description
No description
Mihály
MihályOP3mo ago
And a more recent view :
No description
No description
yhlong00000
yhlong000003mo ago
Let me take a look The delay is high because you’re getting more requests than the current max number of workers can handle, so the requests are piling up in the queue. You can try increasing the max number of workers and lowering the queue delay so the workers can scale up faster. "Job timed out after 1 retry” happens when your worker finishes/failed the task, but there’s an error in the output field, and worker send a message notifies our system.
Want results from more Discord servers?
Add your server