R
RunPodβ€’12mo ago
marshall

Issue with unresponsive workers

We've just launched our model to production a few days ago... and we've had this problem happen to us two times. Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES. Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue. Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes. New / Existing Problem: On our two day experience, this has happened twice. Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled". Relevant Logs: Request ID: 1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1 Worker: RTX A5000 - p5y3srv0gsjtjk Latest Worker Log:
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
Other Workers: RTX A5000 - 217s1y508zuj48, RTX A5000 - vj8i7gy9eujei6
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
RTX A5000 - 1ij40acwnngaxc, RTX A5000 - 3ysqauzbfjwd7h
2023-12-24T21:20:21Z worker is ready
2023-12-24T21:20:21Z worker is ready
Attempted Solutions: - Maxxing out the worker limit to 5 (as suggested by support staff) - Using less in-demand GPUs such as RTX A5000s - Booting off some unresponsive workers (did nothing)
14 Replies
marshall
marshallOPβ€’12mo ago
:/ a request just got processed, but this failed job is still stuck... this seems like an issue on RunPod's job distribution system
🐧
πŸ§β€’12mo ago
are you using request count or queue delay ? I had similar issue when using request count and so did a few others. It was advised to use queue delay
marshall
marshallOPβ€’12mo ago
We're using request count. But in that case I'll try queue delay... What settings were recommended?
🐧
πŸ§β€’12mo ago
that depends on your use case. I use LLMs so I just set it to 10s.
marshall
marshallOPβ€’12mo ago
nevermind it was already in queue delay
🐧
πŸ§β€’12mo ago
Ah. so I guess the issue is indeed with their job system Have you tried with >= 1 active worker ? And then send several requests to test scaling.
marshall
marshallOPβ€’12mo ago
Haven't stress-tested it that much, but here's our current settings: I might just write some code to automatically cancel jobs that take more than 1 minute it's a necessary fail-safe anyways I cancelled the job via API (curl) and it magically finished? wuttt It had a result and everything
flash-singh
flash-singhβ€’12mo ago
whats the endpoint id?
marshall
marshallOPβ€’12mo ago
Endpoint ID: isme01qeaw1yd4 Another case: Endpoint ID: isme01qeaw1yd4 Request ID: dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1 Worker ID: vj8i7gy9eujei6 Worker Logs:
2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}
2023-12-27T04:56:23.704924482Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-stream/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:23.704980406Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished", "level": "INFO"}
2023-12-27T04:56:25.707261692Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/isme01qeaw1yd4/job-done/vj8i7gy9eujei6/dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1?gpu=NVIDIA+RTX+A5000", "level": "ERROR"}
2023-12-27T04:56:25.707312002Z {"requestId": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1", "message": "Finished.", "level": "INFO"}
Job Results (STATUS):
{
"delayTime": 1774,
"executionTime": 58332,
"id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
"status": "CANCELLED"
}
{
"delayTime": 1774,
"executionTime": 58332,
"id": "dc59efbb-a0b1-485b-947e-0a39c62d9bcc-u1",
"status": "CANCELLED"
}
(automatically cancelled after 1 minute)
flash-singh
flash-singhβ€’12mo ago
did you cancel it?
marshall
marshallOPβ€’12mo ago
uh, yeah our systems now cancel jobs that take more than 1 minute (as a fail-safe)
flash-singh
flash-singhβ€’12mo ago
got it, thanks
justin
justinβ€’12mo ago
random: https://docs.runpod.io/docs/serverless-usage#--execution-policy If not already there is an execution policy it seems that i added to my request payloads bc this support ticket made me aware i should do it haha
RunPod
πŸ–‡οΈ | Using Your Endpoint
The method in which jobs are submitted and returned.
marshall
marshallOPβ€’12mo ago
that's actually great to know, we might try that xd
Want results from more Discord servers?
Add your server