Issue with unresponsive workers
We've just launched our model to production a few days ago... and we've had this problem happen to us two times.
Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".
Relevant Logs:
Request ID:
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1
Worker: RTX A5000 - p5y3srv0gsjtjk
Latest Worker Log:
Other Workers:
RTX A5000 - 217s1y508zuj48
, RTX A5000 - vj8i7gy9eujei6
RTX A5000 - 1ij40acwnngaxc
, RTX A5000 - 3ysqauzbfjwd7h
Attempted Solutions:
- Maxxing out the worker limit to 5
(as suggested by support staff)
- Using less in-demand GPUs such as RTX A5000
s
- Booting off some unresponsive workers (did nothing)14 Replies
:/
a request just got processed, but this failed job is still stuck... this seems like an issue on RunPod's job distribution system
are you using request count or queue delay ? I had similar issue when using request count and so did a few others. It was advised to use queue delay
We're using request count. But in that case I'll try queue delay... What settings were recommended?
that depends on your use case. I use LLMs so I just set it to 10s.
nevermind it was already in queue delay
Ah. so I guess the issue is indeed with their job system
Have you tried with >= 1 active worker ? And then send several requests to test scaling.
Haven't stress-tested it that much, but here's our current settings:
I might just write some code to automatically cancel jobs that take more than 1 minute
it's a necessary fail-safe anyways
I cancelled the job via API (
curl
) and it magically finished? wuttt
It had a result and everythingwhats the endpoint id?
Endpoint ID:
isme01qeaw1yd4
Another case:
Endpoint ID: isme01qeaw1yd4
Request ID: dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1
Worker ID: vj8i7gy9eujei6
Worker Logs:
Job Results (STATUS):
(automatically cancelled after 1 minute)did you cancel it?
uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-safe)
got it, thanks
random:
https://docs.runpod.io/docs/serverless-usage#--execution-policy
If not already there is an execution policy it seems that i added to my request payloads bc this support ticket made me aware i should do it haha
that's actually great to know, we might try that xd