R
RunPod4w ago
Simon

Serverless worker keeps failing

We run several serverless workers in parallel to run the inference. Sometimes a serverless worker starts failing with OOM and all the following runs on the same worker will fail until the worker is terminated. We have noticed that the retries initiated by our backend always end up on the same worker. Let's say we have 10 prompts, and we run one prompt per worker, the retries with the same prompt always end up on the same worker. If a random worker were used every time, this issue wouldn't be a problem because retrying will eventually succeed, but since it's always the same worker, all the retries fail. How is the target worker selected? Is there a hash on the input? Or on the webhook? Can we add some random data to the input to always have a different worker selected?
4 Replies
profondob1ue
profondob1ue4w ago
+1 on this we are scaling like crazy and this is currently blocking us @yhlong00000
vesper
vesper4w ago
I am not sure if it will work in production but I added this to my code:
if torch.cuda.memory_allocated() / 1024**3 > 12:
return {
"refresh_worker": True,
"job_results": {
"status": "FAILED",
"error": f"CUDA out of memory: {str(e)}"
}
}
if torch.cuda.memory_allocated() / 1024**3 > 12:
return {
"refresh_worker": True,
"job_results": {
"status": "FAILED",
"error": f"CUDA out of memory: {str(e)}"
}
}
I got this from here: https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls
Skoo Wu
Skoo Wu3w ago
I have the same issue. Now I need to remove the worker manually,I want to call an API to remove worker when it fails. It doesn't work to me.😢
vesper
vesper3w ago
So it failed and set refresh_worker to true but the next request failed due to out of memory error?

Did you find this page helpful?