spooky
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
thanks @yhlong00000 just sent
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
Hi @deanQ fortunately, for whatever reason, the long queue times without horizontal scaling issues have seemed to disappear.
However, I am experiencing a new issue. some of my requests are failing to return. Interestingly, from my logs, the inference completes successfully its just the result is never returned. The last status I get (from polling) is
IN_PROGRESS
and my logs show the job completed successfully. What typically happens is in a subsequent poll I get a COMPLETED
status with the output return. Instead, im seeing it hang on IN_PROGRESS
and then my endpoint.status requests start failing. This is happening maybe 5% of the time.
My result payload is ~300kb. is that too large? should I be saving it to storage and returning url? Thats the only thing I can think of. I'd appreciate some help here is its a big issue for my application.
Here are some requests that hit this issue:
Endpoint Id: mmumv0n4k99461
Id: ac74d68b-ec22-48b8-aaf1-9023d2600e97-u1
workerId: 4dxsfu0y6ylg9v
Id: 0234e98a-71a4-4ec8-a2a6-24ef9f5bc7a1-u1
workerId: gqqcsuxbczbnct
Id: 59ccf6c2-7981-4247-9691-b9de3fb3ff2a-u1
workerId: 1d6pswp366osik
Id: 80156eba-28fd-467e-9277-2e18a49a24b2-u1
workerId: o8nhl6j0fdcubz
Id: 150747b2-4271-4b5b-b806-76b8f007adb6-u1
workerId: 1d6pswp366osik21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
@yhlong00000 yes, they are longer jobs. there's a separate issues which is that im experiencing wildly different performance across workers. some take ~40m to complete a job thats done in ~15m on a different worker with the same exact input args (and both 4090s). the jobs should take 1-15mins max (time varries depending on the inputs). Are some of the 4090s power limited or something? Is there anyway to avoid datacenters that offer 4090s that are 3x slower?
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
whats odd is sometimes it does scale..
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
thanks for looking into it @yhlong00000
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
yeah I've tried both a queue delay of 0 and request count of 1. happens with both
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
i monitored this queued job and it didn't start until the 1 running container finished, then it ran on that - it never scaled to one of the idle workers. this is happening pretty consistently.
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
thanks for looking into this @yhlong00000 please let me know how to fix
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
abmok2vq31zy61
also tried request count 1. still getting request queued for minuets dispite lots of avalible "idle" workers
21 replies
RRunPod
•Created by spooky on 10/30/2024 in #⚡|serverless
jobs queued for minuets despite lots of available idle worker
my queue delay is 1s
21 replies
RRunPod
•Created by vitalik on 10/10/2024 in #⚡|serverless
Job retry after successful run
that seems to have resolved it
27 replies
RRunPod
•Created by vitalik on 10/10/2024 in #⚡|serverless
Job retry after successful run
trying upgrading the SDK from 1.7.1 to ,1.7.2 will see
27 replies
RRunPod
•Created by vitalik on 10/10/2024 in #⚡|serverless
Job retry after successful run
Same issue. Job is successfully returning. but my status always says IN_PROGRESS and eventually retries then job timed out after 1 retries
27 replies