How to retire a worker and retry its job?
We're noticing that every so often, a worker gets corrupted, and doesn't produce correct output. It's easy enough for us to detect it inside the handler when it happens. Is there a built-in way to tell runpod the job failed, the worker is bad, and it should be refreshed and requeued? Or should I do this manually with "refresh_worker" and use the API to requeue?
Solution:Jump to solution
Ohhh ashelyk is right u can just return a true to selectively do it. I just read the how to do it on start but not the return:
Return
refresh_worker=True
as a top-level dictionary key in the handler return. This can selectively be used to refresh the worker based on the job return.
Example:...7 Replies
I don't think this solves your issue, but I wonder if you add a worker_refresh flag, if it would help stop the corruption?
Maybe you have multiple jobs coming in and the previous job causing an issue with future jobs? No clue.
https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/docs/serverless/worker.md?plain=1#L29
GitHub
runpod-python/docs/serverless/worker.md at 5645bb1758c9725d7dd914f1...
🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python
We used to have the refresh_worker flag, but that seemed to break autoscaling for us (we'd have 3000 jobs in the queue and one worker plugging away on all of them, even though we had several idle workers)
it would definitely fix the issue though
it's because our jobs are super bursty, and requests are scheduled when they come in, not dynamically as jobs complete (i dont know if this has changed)
so we schedule 1000 jobs all at once, the scheduler spawns workers to take them, and then once those workers refresh, they can't pick up additional jobs since no new jobs have come in.
Assuming that your advance settings for the endpoint worker is correct, I find that autoscaling breaking is weird, that sounds like a runpod staff issue.
https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/runpod/api/mutations/endpoints.py
But just giving a shot, looking at their graphql mutations, it doesn't look like it owuld fix your issue. Hm 😢
This is being done through Runpod queue system?
flash mentioned it was a limitation of the current scheduler implementation, but i'm unclear if it's changed recently.
yeah, just hitting the serverless endpoint
Ah I see. I'll leave it to flash to answer it when they are online then haha. But interesting 😮
It does seem like be nice to be able to refresh an individual worker programatically through the graphql mutation - at least me reading it doesn't seem possible
Yes, you need to use
refresh_worker
but you can do it automatically when an error is detected, don't do anything manually.Solution
Ohhh ashelyk is right u can just return a true to selectively do it. I just read the how to do it on start but not the return:
Return
refresh_worker=True
as a top-level dictionary key in the handler return. This can selectively be used to refresh the worker based on the job return.
Example:
```python
def handler_with_selective_refresh(job):
if job["input"].get("refresh", False):
# Handle the job and return the output with refresh_worker flag
return {"output": "Job completed successfully", "refresh_worker": True}
else:
# Handle the job and return the output
return {"output": "Job completed successfully"}