R
RunPod10mo ago
Concept

VLLM Worker Error that doesn't time out.

2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n async with session.get(_job_get_url()) as response:\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n self._resp = await self._coro\n ^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n message, payload = await protocol.read() # type: ignore[union-attr]\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}
Worker ran for 20 hours stuck on this error. Had to kill the worker and job. What causes this?
Solution:
refresh_worker does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?
Jump to solution
6 Replies
Concept
ConceptOP10mo ago
IS there a way to kill workers when they error?
Concept
ConceptOP10mo ago
No description
Solution
ashleyk
ashleyk10mo ago
refresh_worker does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?
Justin Merrell
Justin Merrell10mo ago
@Concept Are you using an existing worker, or did you launch your own custom endpoint?
Concept
ConceptOP10mo ago
Existing worker on the newest SDK. I believe it was a JSON serialization error, which would be an error on my side but it shouldn't keep on running like that after erroring. using runpod vllm
fredericp5433
fredericp543310mo ago
I have the same problem. I think the problem is here: File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_logger.py", line 81, in log print(json.dumps(log_json), flush=True) when log_json is not serializable, it fails to report the error and keeps the worker running.
Want results from more Discord servers?
Add your server