spooky Comments - Answer Overflow

spooky

Posts Comments

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

thanks @yhlong00000 just sent

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

Hi @deanQ fortunately, for whatever reason, the long queue times without horizontal scaling issues have seemed to disappear. However, I am experiencing a new issue. some of my requests are failing to return. Interestingly, from my logs, the inference completes successfully its just the result is never returned. The last status I get (from polling) is IN_PROGRESS and my logs show the job completed successfully. What typically happens is in a subsequent poll I get a COMPLETED status with the output return. Instead, im seeing it hang on IN_PROGRESS and then my endpoint.status requests start failing. This is happening maybe 5% of the time. My result payload is ~300kb. is that too large? should I be saving it to storage and returning url? Thats the only thing I can think of. I'd appreciate some help here is its a big issue for my application. Here are some requests that hit this issue: Endpoint Id: mmumv0n4k99461 Id: ac74d68b-ec22-48b8-aaf1-9023d2600e97-u1 workerId: 4dxsfu0y6ylg9v   Id: 0234e98a-71a4-4ec8-a2a6-24ef9f5bc7a1-u1  workerId: gqqcsuxbczbnct   Id: 59ccf6c2-7981-4247-9691-b9de3fb3ff2a-u1  workerId: 1d6pswp366osik   Id: 80156eba-28fd-467e-9277-2e18a49a24b2-u1  workerId: o8nhl6j0fdcubz Id: 150747b2-4271-4b5b-b806-76b8f007adb6-u1  workerId: 1d6pswp366osik

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

@yhlong00000 yes, they are longer jobs. there's a separate issues which is that im experiencing wildly different performance across workers. some take ~40m to complete a job thats done in ~15m on a different worker with the same exact input args (and both 4090s). the jobs should take 1-15mins max (time varries depending on the inputs). Are some of the 4090s power limited or something? Is there anyway to avoid datacenters that offer 4090s that are 3x slower?

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

whats odd is sometimes it does scale..

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

thanks for looking into it @yhlong00000

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

yeah I've tried both a queue delay of 0 and request count of 1. happens with both

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

i monitored this queued job and it didn't start until the 1 running container finished, then it ran on that - it never scaled to one of the idle workers. this is happening pretty consistently.

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker

thanks for looking into this @yhlong00000 please let me know how to fix

21 replies

RRunPod

•Created by spooky on 10/30/2024 in #⚡｜serverless

jobs queued for minuets despite lots of available idle worker