marshall
RRunPod
•Created by marshall on 11/18/2024 in #⚡|serverless
Failed to get job. - 404 Not Found
the endpoint is receiving the jobs but errors out (worker logs below):
4 replies
RRunPod
•Created by marshall on 10/12/2024 in #⚡|serverless
Jobs randomly dropping - {'error': 'request does not exist'}
RunPod worker errors:
from
lbeoz75vjlfck0
the request ID does not show up on the Requests
tab. The error also does not get logged to the daily statistics as it seems to be a RunPod job routing issue, not a worker image runtime error.
What we receive from the endpoint:
worker runpod SDK version: 1.6.2 might update once https://discord.com/channels/912829806415085598/1293773578738864158 is fixed4 replies
RRunPod
•Created by marshall on 2/3/2024 in #⚡|serverless
vllm + Ray issue: Stuck on "Started a local Ray instance."
Trying to run
TheBloke/goliath-120b-AWQ
on vllm + runpod with 2x48GB
GPUs:
It's stuck on Started a local Ray instance.
and I've tried both with and without RunPod's FlashBoot
has anyone encountered this issue before?
requirements.txt:
build script:
initialization code:
10 replies
RRunPod
•Created by marshall on 12/24/2023 in #⚡|serverless
Issue with unresponsive workers
We've just launched our model to production a few days ago... and we've had this problem happen to us two times.
Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES.
Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue.
Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes.
New / Existing Problem: On our two day experience, this has happened twice.
Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled".
Relevant Logs:
Request ID:
1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1
Worker: RTX A5000 - p5y3srv0gsjtjk
Latest Worker Log:
Other Workers:
RTX A5000 - 217s1y508zuj48
, RTX A5000 - vj8i7gy9eujei6
RTX A5000 - 1ij40acwnngaxc
, RTX A5000 - 3ysqauzbfjwd7h
Attempted Solutions:
- Maxxing out the worker limit to 5
(as suggested by support staff)
- Using less in-demand GPUs such as RTX A5000
s
- Booting off some unresponsive workers (did nothing)24 replies