R
RunPod12mo ago
r1

How to retire a worker and retry its job?

We're noticing that every so often, a worker gets corrupted, and doesn't produce correct output. It's easy enough for us to detect it inside the handler when it happens. Is there a built-in way to tell runpod the job failed, the worker is bad, and it should be refreshed and requeued? Or should I do this manually with "refresh_worker" and use the API to requeue?
Solution:
Ohhh ashelyk is right u can just return a true to selectively do it. I just read the how to do it on start but not the return: Return refresh_worker=True as a top-level dictionary key in the handler return. This can selectively be used to refresh the worker based on the job return. Example:...
Jump to solution
7 Replies
justin
justin12mo ago
I don't think this solves your issue, but I wonder if you add a worker_refresh flag, if it would help stop the corruption? Maybe you have multiple jobs coming in and the previous job causing an issue with future jobs? No clue. https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/docs/serverless/worker.md?plain=1#L29
GitHub
runpod-python/docs/serverless/worker.md at 5645bb1758c9725d7dd914f1...
🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python
r1
r1OP12mo ago
We used to have the refresh_worker flag, but that seemed to break autoscaling for us (we'd have 3000 jobs in the queue and one worker plugging away on all of them, even though we had several idle workers) it would definitely fix the issue though it's because our jobs are super bursty, and requests are scheduled when they come in, not dynamically as jobs complete (i dont know if this has changed) so we schedule 1000 jobs all at once, the scheduler spawns workers to take them, and then once those workers refresh, they can't pick up additional jobs since no new jobs have come in.
justin
justin12mo ago
Assuming that your advance settings for the endpoint worker is correct, I find that autoscaling breaking is weird, that sounds like a runpod staff issue. https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/runpod/api/mutations/endpoints.py But just giving a shot, looking at their graphql mutations, it doesn't look like it owuld fix your issue. Hm 😢 This is being done through Runpod queue system?
r1
r1OP12mo ago
flash mentioned it was a limitation of the current scheduler implementation, but i'm unclear if it's changed recently. yeah, just hitting the serverless endpoint
justin
justin12mo ago
Ah I see. I'll leave it to flash to answer it when they are online then haha. But interesting 😮 It does seem like be nice to be able to refresh an individual worker programatically through the graphql mutation - at least me reading it doesn't seem possible
ashleyk
ashleyk12mo ago
Yes, you need to use refresh_worker but you can do it automatically when an error is detected, don't do anything manually.
Solution
justin
justin12mo ago
Ohhh ashelyk is right u can just return a true to selectively do it. I just read the how to do it on start but not the return: Return refresh_worker=True as a top-level dictionary key in the handler return. This can selectively be used to refresh the worker based on the job return. Example: ```python def handler_with_selective_refresh(job): if job["input"].get("refresh", False): # Handle the job and return the output with refresh_worker flag return {"output": "Job completed successfully", "refresh_worker": True} else: # Handle the job and return the output return {"output": "Job completed successfully"}
Want results from more Discord servers?
Add your server