Jobs in queue for a long time, even when there is a worker available
Hello,
Recently I've seem a lot of jobs getting stuck in queue for long times, even though my serverless endpoint has free workers left, and the queue delay is set to 4 seconds.
Does anyone has any experience with this? Any ideas why does this happen?
The first screenshot depicts two jobs, submitted at the same time. One is picked up by a worker, and the other sits in queue.
8 Replies
Some extra settings for context:
Can you share the endpoint id here?
Sure! It was either noxh y2en 39n3 y3 or k5hi ftra iqq8 dw
Hey, I checked the logs and didn’t notice any unusually long wait times. There were maybe 2-3 requests that took a bit longer to start because both workers were occupied, and each request took a couple of minutes to complete. It might just be that the UI didn’t refresh. If you have a specific request ID that you think had an exceptionally long wait despite having an available worker, feel free to share it, and I can take another look.
Also, since each of your requests takes a bit of time to complete, I’d recommend configuring a higher number for max workers. It won’t cost you any extra money, but it will ensure you can scale smoothly when multiple requests come in at the same time.
Thank you, will do!
Hello @yhlong00000
I was able to find a bigger event :
Here are the delay times of the last 24 hours.
These happen even though there are constantly free worker available.
Also, sometimes, when the delay times get high, I'm also getting a failed job.
Example request IDs :
186c4d2a-31ea-4d82-8dfa-411a2bc5c83b-e1
5dd6a934-5293-4796-a5d7-8d0ddd9eef60-e1
I'm also buffled on the error message "job timed out after 1 retries", as its not coming from my container o.O
Any idea what could this be?
And a more recent view :
Let me take a look
The delay is high because you’re getting more requests than the current max number of workers can handle, so the requests are piling up in the queue. You can try increasing the max number of workers and lowering the queue delay so the workers can scale up faster.
"Job timed out after 1 retry” happens when your worker finishes/failed the task, but there’s an error in the output field, and worker send a message notifies our system.