R
RunPod2mo ago
spooky

jobs queued for minuets despite lots of available idle worker

for the past couple of days my jobs keep getting queued for a long time despite lots of available "idle" workers - no where near my max workers. sometimes there's 9 available workers but concurrent jobs still get queued... anyone have any insight on this?
No description
17 Replies
spooky
spookyOP2mo ago
my queue delay is 1s
yhlong00000
yhlong000002mo ago
what is your endpoint id?
spooky
spookyOP2mo ago
abmok2vq31zy61 also tried request count 1. still getting request queued for minuets dispite lots of avalible "idle" workers
yhlong00000
yhlong000002mo ago
Hmm, I don’t see any issues at the moment. Your endpoint is scaling up and down. Is there a specific time when you noticed a problem?
spooky
spookyOP2mo ago
its happening pretty consistently. for example right now I have 19 idle workers but still have a job that's been in queue for ~7mins and counting...
No description
spooky
spookyOP2mo ago
thanks for looking into this @yhlong00000 please let me know how to fix i monitored this queued job and it didn't start until the 1 running container finished, then it ran on that - it never scaled to one of the idle workers. this is happening pretty consistently.
yhlong00000
yhlong000002mo ago
Yeah, I can see that job 7a1e5539** arrived, but for some reason, it didn’t trigger the addition of a new worker. Does this also happen when using the queue delay?
spooky
spookyOP2mo ago
yeah I've tried both a queue delay of 0 and request count of 1. happens with both thanks for looking into it @yhlong00000 whats odd is sometimes it does scale..
yhlong00000
yhlong000002mo ago
Well, it should scale, but I’ll need to investigate what might be preventing it from doing so. It seems like your requests are taking a long time to complete, and I’m not sure if that could be causing some unexpected behavior on our end
spooky
spookyOP2mo ago
@yhlong00000 yes, they are longer jobs. there's a separate issues which is that im experiencing wildly different performance across workers. some take ~40m to complete a job thats done in ~15m on a different worker with the same exact input args (and both 4090s). the jobs should take 1-15mins max (time varries depending on the inputs). Are some of the 4090s power limited or something? Is there anyway to avoid datacenters that offer 4090s that are 3x slower?
aksay_23298
aksay_232982mo ago
i have the same issue as well
inc3pt.io
inc3pt.io2mo ago
Same in 1.7.4. no problem with 1.7.0
deanQ
deanQ2mo ago
Rather than just the endpoint id, can you also tell us the request id? That would give us a better way to trace. Thanks
spooky
spookyOP2mo ago
Hi @deanQ fortunately, for whatever reason, the long queue times without horizontal scaling issues have seemed to disappear. However, I am experiencing a new issue. some of my requests are failing to return. Interestingly, from my logs, the inference completes successfully its just the result is never returned. The last status I get (from polling) is IN_PROGRESS and my logs show the job completed successfully. What typically happens is in a subsequent poll I get a COMPLETED status with the output return. Instead, im seeing it hang on IN_PROGRESS and then my endpoint.status requests start failing. This is happening maybe 5% of the time. My result payload is ~300kb. is that too large? should I be saving it to storage and returning url? Thats the only thing I can think of. I'd appreciate some help here is its a big issue for my application. Here are some requests that hit this issue: Endpoint Id: mmumv0n4k99461 Id: ac74d68b-ec22-48b8-aaf1-9023d2600e97-u1 workerId: 4dxsfu0y6ylg9v

 Id: 0234e98a-71a4-4ec8-a2a6-24ef9f5bc7a1-u1
 workerId: gqqcsuxbczbnct

 Id: 59ccf6c2-7981-4247-9691-b9de3fb3ff2a-u1
 workerId: 1d6pswp366osik

 Id: 80156eba-28fd-467e-9277-2e18a49a24b2-u1
 workerId: o8nhl6j0fdcubz Id: 150747b2-4271-4b5b-b806-76b8f007adb6-u1
 workerId: 1d6pswp366osik
Mihály
Mihály2mo ago
@yhlong00000
yhlong00000
yhlong000002mo ago
I saw that you opened a support ticket. Could you share more details about your setup there? Thanks!
spooky
spookyOP2mo ago
thanks @yhlong00000 just sent
Want results from more Discord servers?
Add your server