RunPod•15mo ago

VLLM Worker Error that doesn't time out.

2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n  File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n    async with session.get(_job_get_url()) as response:\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n    self._resp = await self._coro\n                 ^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n    await resp.start(conn)\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n    message, payload = await protocol.read()  # type: ignore[union-attr]\n                       ^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n    await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}

2024-02-01T18:08:19.928745487Z {"requestId": null, "message": "Traceback: Traceback (most recent call last):\n  File \"/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_job.py\", line 55, in get_job\n    async with session.get(_job_get_url()) as response:\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 1187, in __aenter__\n    self._resp = await self._coro\n                 ^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client.py\", line 601, in _request\n    await resp.start(conn)\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py\", line 965, in start\n    message, payload = await protocol.read()  # type: ignore[union-attr]\n                       ^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/dist-packages/aiohttp/streams.py\", line 622, in read\n    await self._waiter\naiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer\n", "level": "ERROR"}
2024-02-01T18:08:19.929440753Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer", "level": "ERROR"}

Worker ran for 20 hours stuck on this error. Had to kill the worker and job. What causes this?

Solution:

refresh_worker does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?

Jump to solution

6 Replies

ConceptOP•15mo ago

IS there a way to kill workers when they error?

ConceptOP•15mo ago

Solution

ashleyk•15mo ago

Justin Merrell•15mo ago

@Concept Are you using an existing worker, or did you launch your own custom endpoint?

ConceptOP•15mo ago

Existing worker on the newest SDK. I believe it was a JSON serialization error, which would be an error on my side but it shouldn't keep on running like that after erroring. using runpod vllm

fredericp5433•15mo ago

I have the same problem. I think the problem is here: File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_logger.py", line 81, in log print(json.dumps(log_json), flush=True) when log_json is not serializable, it fails to report the error and keeps the worker running.

Gaming

Programming

VLLM Worker Error that doesn't time out.

Did you find this page helpful?