jobs queued for minuets despite lots of available idle worker
for the past couple of days my jobs keep getting queued for a long time despite lots of available "idle" workers - no where near my max workers. sometimes there's 9 available workers but concurrent jobs still get queued... anyone have any insight on this?
17 Replies
my queue delay is 1s
what is your endpoint id?
abmok2vq31zy61
also tried request count 1. still getting request queued for minuets dispite lots of avalible "idle" workers
Hmm, I don’t see any issues at the moment. Your endpoint is scaling up and down. Is there a specific time when you noticed a problem?
its happening pretty consistently. for example right now I have 19 idle workers but still have a job that's been in queue for ~7mins and counting...
thanks for looking into this @yhlong00000 please let me know how to fix
i monitored this queued job and it didn't start until the 1 running container finished, then it ran on that - it never scaled to one of the idle workers. this is happening pretty consistently.
Yeah, I can see that job 7a1e5539** arrived, but for some reason, it didn’t trigger the addition of a new worker. Does this also happen when using the queue delay?
yeah I've tried both a queue delay of 0 and request count of 1. happens with both
thanks for looking into it @yhlong00000
whats odd is sometimes it does scale..
Well, it should scale, but I’ll need to investigate what might be preventing it from doing so. It seems like your requests are taking a long time to complete, and I’m not sure if that could be causing some unexpected behavior on our end
@yhlong00000 yes, they are longer jobs. there's a separate issues which is that im experiencing wildly different performance across workers. some take ~40m to complete a job thats done in ~15m on a different worker with the same exact input args (and both 4090s). the jobs should take 1-15mins max (time varries depending on the inputs). Are some of the 4090s power limited or something? Is there anyway to avoid datacenters that offer 4090s that are 3x slower?
i have the same issue as well
Same in 1.7.4. no problem with 1.7.0
Rather than just the endpoint id, can you also tell us the request id? That would give us a better way to trace. Thanks
Hi @deanQ fortunately, for whatever reason, the long queue times without horizontal scaling issues have seemed to disappear.
However, I am experiencing a new issue. some of my requests are failing to return. Interestingly, from my logs, the inference completes successfully its just the result is never returned. The last status I get (from polling) is
IN_PROGRESS
and my logs show the job completed successfully. What typically happens is in a subsequent poll I get a COMPLETED
status with the output return. Instead, im seeing it hang on IN_PROGRESS
and then my endpoint.status requests start failing. This is happening maybe 5% of the time.
My result payload is ~300kb. is that too large? should I be saving it to storage and returning url? Thats the only thing I can think of. I'd appreciate some help here is its a big issue for my application.
Here are some requests that hit this issue:
Endpoint Id: mmumv0n4k99461
Id: ac74d68b-ec22-48b8-aaf1-9023d2600e97-u1
workerId: 4dxsfu0y6ylg9v
Id: 0234e98a-71a4-4ec8-a2a6-24ef9f5bc7a1-u1
workerId: gqqcsuxbczbnct
Id: 59ccf6c2-7981-4247-9691-b9de3fb3ff2a-u1
workerId: 1d6pswp366osik
Id: 80156eba-28fd-467e-9277-2e18a49a24b2-u1
workerId: o8nhl6j0fdcubz
Id: 150747b2-4271-4b5b-b806-76b8f007adb6-u1
workerId: 1d6pswp366osik@yhlong00000
I saw that you opened a support ticket. Could you share more details about your setup there? Thanks!
thanks @yhlong00000 just sent