marshall
marshall
RRunPod
Created by marshall on 11/18/2024 in #⚡|serverless
Failed to get job. - 404 Not Found
the endpoint is receiving the jobs but errors out (worker logs below):
2024-11-18T13:50:42.510726100Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
2024-11-18T13:50:42.848129909Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
2024-11-18T13:50:43.167233569Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
2024-11-18T13:50:42.510726100Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
2024-11-18T13:50:42.848129909Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
2024-11-18T13:50:43.167233569Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}
4 replies
RRunPod
Created by marshall on 10/12/2024 in #⚡|serverless
Jobs randomly dropping - {'error': 'request does not exist'}
RunPod worker errors:
2024-10-12T18:25:21.522075786Z {"requestId": "51124010-27f8-4cfa-b737-a50e6d436623-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:25:22.723756821Z {"requestId": "51124010-27f8-4cfa-b737-a50e6d436623-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:27:09.433322101Z {"requestId": null, "message": "Failed to get job, status code: 404", "level": "ERROR"}
2024-10-12T18:27:09.602268203Z {"requestId": "b88fe3f1-1212-4eee-acda-e5c58626b69a-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:27:11.082924318Z {"requestId": "b88fe3f1-1212-4eee-acda-e5c58626b69a-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:29:43.434273977Z {"requestId": null, "message": "Failed to get job, status code: 404", "level": "ERROR"}
2024-10-12T18:29:43.613420319Z {"requestId": "d964329c-2abc-4931-bd8e-53f7d5089d59-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:29:44.956554990Z {"requestId": "d964329c-2abc-4931-bd8e-53f7d5089d59-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:29:49.734447718Z {"requestId": "4cc76d9e-7e65-4b3f-afaf-5382d0bd8dd6-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:29:50.975923513Z {"requestId": "4cc76d9e-7e65-4b3f-afaf-5382d0bd8dd6-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:25:21.522075786Z {"requestId": "51124010-27f8-4cfa-b737-a50e6d436623-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:25:22.723756821Z {"requestId": "51124010-27f8-4cfa-b737-a50e6d436623-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:27:09.433322101Z {"requestId": null, "message": "Failed to get job, status code: 404", "level": "ERROR"}
2024-10-12T18:27:09.602268203Z {"requestId": "b88fe3f1-1212-4eee-acda-e5c58626b69a-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:27:11.082924318Z {"requestId": "b88fe3f1-1212-4eee-acda-e5c58626b69a-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:29:43.434273977Z {"requestId": null, "message": "Failed to get job, status code: 404", "level": "ERROR"}
2024-10-12T18:29:43.613420319Z {"requestId": "d964329c-2abc-4931-bd8e-53f7d5089d59-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:29:44.956554990Z {"requestId": "d964329c-2abc-4931-bd8e-53f7d5089d59-u1", "message": "Finished.", "level": "INFO"}
2024-10-12T18:29:49.734447718Z {"requestId": "4cc76d9e-7e65-4b3f-afaf-5382d0bd8dd6-u1", "message": "Started.", "level": "INFO"}
2024-10-12T18:29:50.975923513Z {"requestId": "4cc76d9e-7e65-4b3f-afaf-5382d0bd8dd6-u1", "message": "Finished.", "level": "INFO"}
from lbeoz75vjlfck0 the request ID does not show up on the Requests tab. The error also does not get logged to the daily statistics as it seems to be a RunPod job routing issue, not a worker image runtime error. What we receive from the endpoint:
https://api.runpod.ai/v2/***/status/5c7ae484-b1df-4efd-a06d-a283b6d42e3a-u1 {'error': 'request does not exist'}
https://api.runpod.ai/v2/***/status/5c7ae484-b1df-4efd-a06d-a283b6d42e3a-u1 {'error': 'request does not exist'}
worker runpod SDK version: 1.6.2 might update once https://discord.com/channels/912829806415085598/1293773578738864158 is fixed
4 replies
RRunPod
Created by marshall on 2/3/2024 in #⚡|serverless
vllm + Ray issue: Stuck on "Started a local Ray instance."
Trying to run TheBloke/goliath-120b-AWQ on vllm + runpod with 2x48GB GPUs:
2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465 INFO worker.py:1724 -- Started a local Ray instance.
2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465 INFO worker.py:1724 -- Started a local Ray instance.
It's stuck on Started a local Ray instance. and I've tried both with and without RunPod's FlashBoot has anyone encountered this issue before? requirements.txt:
vllm==0.2.7
runpod==1.4.0
ray==2.9.1
vllm==0.2.7
runpod==1.4.0
ray==2.9.1
build script:
from huggingface_hub import snapshot_download

snapshot_download(
"TheBloke/goliath-120b-AWQ",
local_dir="model",
local_dir_use_symlinks=False
)
from huggingface_hub import snapshot_download

snapshot_download(
"TheBloke/goliath-120b-AWQ",
local_dir="model",
local_dir_use_symlinks=False
)
initialization code:
from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)
from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)
10 replies
RRunPod
Created by marshall on 12/24/2023 in #⚡|serverless
Issue with unresponsive workers
We've just launched our model to production a few days ago... and we've had this problem happen to us two times. Problem: Unresponsive workers, most of them are "ready" but are "idle" despite requests queuing up for MINUTES. Expected Behavior: Idle workers should respond as soon as a request is not yet taken from the queue. Actual Behavior: Workers stay idle, queue does not get processed and delayed for minutes. New / Existing Problem: On our two day experience, this has happened twice. Steps to Reproduce: It's up for chance when most RunPod GPUs are under heavy load, where all 3090s are "throttled". Relevant Logs: Request ID: 1c90bd6a-0716-4b3c-8465-144d0b49d8be-u1 Worker: RTX A5000 - p5y3srv0gsjtjk Latest Worker Log:
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
2023-12-24T21:16:46.461288541Z {"requestId": null, "message": "Failed to get job, status code: 500", "level": "ERROR"}
Other Workers: RTX A5000 - 217s1y508zuj48, RTX A5000 - vj8i7gy9eujei6
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
2023-12-24T04:39:48Z worker is ready
2023-12-24T04:39:48Z start container
2023-12-24T07:00:19Z stop container
2023-12-24T07:00:21Z remove container
2023-12-24T07:00:21Z remove network
RTX A5000 - 1ij40acwnngaxc, RTX A5000 - 3ysqauzbfjwd7h
2023-12-24T21:20:21Z worker is ready
2023-12-24T21:20:21Z worker is ready
Attempted Solutions: - Maxxing out the worker limit to 5 (as suggested by support staff) - Using less in-demand GPUs such as RTX A5000s - Booting off some unresponsive workers (did nothing)
24 replies