VLLM Worker Error that doesn't time out.
Worker ran for 20 hours stuck on this error. Had to kill the worker and job. What causes this?
Solution:Jump to solution
refresh_worker
does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?6 Replies
IS there a way to kill workers when they error?
Solution
refresh_worker
does it but don't think it works for the RunPod internal stuff, its more for when your handler raises an Exception, but @Justin Merrell will have to confirm. I assume this is the latest version of the SDK?@Concept Are you using an existing worker, or did you launch your own custom endpoint?
Existing worker on the newest SDK. I believe it was a JSON serialization error, which would be an error on my side but it shouldn't keep on running like that after erroring.
using runpod vllm
I have the same problem. I think the problem is here:
File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_logger.py", line 81, in log
print(json.dumps(log_json), flush=True)
when log_json is not serializable, it fails to report the error and keeps the worker running.