RunPod•6mo ago

jobs queued for minuets despite lots of available idle worker

for the past couple of days my jobs keep getting queued for a long time despite lots of available "idle" workers - no where near my max workers. sometimes there's 9 available workers but concurrent jobs still get queued... anyone have any insight on this?

17 Replies

spookyOP•6mo ago

my queue delay is 1s

yhlong00000•6mo ago

what is your endpoint id?

spookyOP•6mo ago

abmok2vq31zy61 also tried request count 1. still getting request queued for minuets dispite lots of avalible "idle" workers

yhlong00000•6mo ago

Hmm, I don’t see any issues at the moment. Your endpoint is scaling up and down. Is there a specific time when you noticed a problem?

spookyOP•6mo ago

its happening pretty consistently. for example right now I have 19 idle workers but still have a job that's been in queue for ~7mins and counting...

spookyOP•6mo ago

thanks for looking into this @yhlong00000 please let me know how to fix i monitored this queued job and it didn't start until the 1 running container finished, then it ran on that - it never scaled to one of the idle workers. this is happening pretty consistently.

yhlong00000•6mo ago

Yeah, I can see that job 7a1e5539** arrived, but for some reason, it didn’t trigger the addition of a new worker. Does this also happen when using the queue delay?

spookyOP•6mo ago

yeah I've tried both a queue delay of 0 and request count of 1. happens with both thanks for looking into it @yhlong00000 whats odd is sometimes it does scale..

yhlong00000•6mo ago

Well, it should scale, but I’ll need to investigate what might be preventing it from doing so. It seems like your requests are taking a long time to complete, and I’m not sure if that could be causing some unexpected behavior on our end

spookyOP•6mo ago

@yhlong00000 yes, they are longer jobs. there's a separate issues which is that im experiencing wildly different performance across workers. some take ~40m to complete a job thats done in ~15m on a different worker with the same exact input args (and both 4090s). the jobs should take 1-15mins max (time varries depending on the inputs). Are some of the 4090s power limited or something? Is there anyway to avoid datacenters that offer 4090s that are 3x slower?

aksay_23298•6mo ago

i have the same issue as well

inc3pt.io•6mo ago

Same in 1.7.4. no problem with 1.7.0

deanQ•6mo ago

Rather than just the endpoint id, can you also tell us the request id? That would give us a better way to trace. Thanks

spookyOP•6mo ago

Hi @deanQ fortunately, for whatever reason, the long queue times without horizontal scaling issues have seemed to disappear. However, I am experiencing a new issue. some of my requests are failing to return. Interestingly, from my logs, the inference completes successfully its just the result is never returned. The last status I get (from polling) is IN_PROGRESS and my logs show the job completed successfully. What typically happens is in a subsequent poll I get a COMPLETED status with the output return. Instead, im seeing it hang on IN_PROGRESS and then my endpoint.status requests start failing. This is happening maybe 5% of the time. My result payload is ~300kb. is that too large? should I be saving it to storage and returning url? Thats the only thing I can think of. I'd appreciate some help here is its a big issue for my application. Here are some requests that hit this issue: Endpoint Id: mmumv0n4k99461 Id: ac74d68b-ec22-48b8-aaf1-9023d2600e97-u1 workerId: 4dxsfu0y6ylg9v   Id: 0234e98a-71a4-4ec8-a2a6-24ef9f5bc7a1-u1  workerId: gqqcsuxbczbnct   Id: 59ccf6c2-7981-4247-9691-b9de3fb3ff2a-u1  workerId: 1d6pswp366osik   Id: 80156eba-28fd-467e-9277-2e18a49a24b2-u1  workerId: o8nhl6j0fdcubz Id: 150747b2-4271-4b5b-b806-76b8f007adb6-u1  workerId: 1d6pswp366osik

Mihály•6mo ago

@yhlong00000

yhlong00000•6mo ago

I saw that you opened a support ticket. Could you share more details about your setup there? Thanks!

spookyOP•6mo ago

thanks @yhlong00000 just sent

Gaming

Programming

jobs queued for minuets despite lots of available idle worker

Did you find this page helpful?