RunPod•17mo ago

How to retire a worker and retry its job?

We're noticing that every so often, a worker gets corrupted, and doesn't produce correct output. It's easy enough for us to detect it inside the handler when it happens. Is there a built-in way to tell runpod the job failed, the worker is bad, and it should be refreshed and requeued? Or should I do this manually with "refresh_worker" and use the API to requeue?

Solution:

Jump to solution

7 Replies

J.•17mo ago

I don't think this solves your issue, but I wonder if you add a worker_refresh flag, if it would help stop the corruption? Maybe you have multiple jobs coming in and the previous job causing an issue with future jobs? No clue. https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/docs/serverless/worker.md?plain=1#L29

GitHub

runpod-python/docs/serverless/worker.md at 5645bb1758c9725d7dd914f1...

🐍 | Python library for RunPod API and serverless worker SDK. - runpod/runpod-python

r1OP•17mo ago

We used to have the refresh_worker flag, but that seemed to break autoscaling for us (we'd have 3000 jobs in the queue and one worker plugging away on all of them, even though we had several idle workers) it would definitely fix the issue though it's because our jobs are super bursty, and requests are scheduled when they come in, not dynamically as jobs complete (i dont know if this has changed) so we schedule 1000 jobs all at once, the scheduler spawns workers to take them, and then once those workers refresh, they can't pick up additional jobs since no new jobs have come in.

J.•17mo ago

Assuming that your advance settings for the endpoint worker is correct, I find that autoscaling breaking is weird, that sounds like a runpod staff issue. https://github.com/runpod/runpod-python/blob/5645bb1758c9725d7dd914f127df1047293b9d7c/runpod/api/mutations/endpoints.py But just giving a shot, looking at their graphql mutations, it doesn't look like it owuld fix your issue. Hm 😢 This is being done through Runpod queue system?

r1OP•17mo ago

flash mentioned it was a limitation of the current scheduler implementation, but i'm unclear if it's changed recently. yeah, just hitting the serverless endpoint

J.•17mo ago

Ah I see. I'll leave it to flash to answer it when they are online then haha. But interesting 😮 It does seem like be nice to be able to refresh an individual worker programatically through the graphql mutation - at least me reading it doesn't seem possible

ashleyk•17mo ago

Yes, you need to use refresh_worker but you can do it automatically when an error is detected, don't do anything manually.

Solution

J.•17mo ago

Ohhh ashelyk is right u can just return a true to selectively do it. I just read the how to do it on start but not the return: Return refresh_worker=True as a top-level dictionary key in the handler return. This can selectively be used to refresh the worker based on the job return. Example: ```python def handler_with_selective_refresh(job): if job["input"].get("refresh", False): # Handle the job and return the output with refresh_worker flag return {"output": "Job completed successfully", "refresh_worker": True} else: # Handle the job and return the output return {"output": "Job completed successfully"}

Gaming

Programming

How to retire a worker and retry its job?

Did you find this page helpful?