R
RunPod3mo ago
Simon

Serverless worker keeps failing

We run several serverless workers in parallel to run the inference. Sometimes a serverless worker starts failing with OOM and all the following runs on the same worker will fail until the worker is terminated. We have noticed that the retries initiated by our backend always end up on the same worker. Let's say we have 10 prompts, and we run one prompt per worker, the retries with the same prompt always end up on the same worker. If a random worker were used every time, this issue wouldn't be a problem because retrying will eventually succeed, but since it's always the same worker, all the retries fail. How is the target worker selected? Is there a hash on the input? Or on the webhook? Can we add some random data to the input to always have a different worker selected?
5 Replies
profondob1ue
profondob1ue3mo ago
+1 on this we are scaling like crazy and this is currently blocking us @yhlong00000
vesper
vesper3mo ago
I am not sure if it will work in production but I added this to my code:
if torch.cuda.memory_allocated() / 1024**3 > 12:
return {
"refresh_worker": True,
"job_results": {
"status": "FAILED",
"error": f"CUDA out of memory: {str(e)}"
}
}
if torch.cuda.memory_allocated() / 1024**3 > 12:
return {
"refresh_worker": True,
"job_results": {
"status": "FAILED",
"error": f"CUDA out of memory: {str(e)}"
}
}
I got this from here: https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls
Skoo Wu
Skoo Wu3mo ago
I have the same issue. Now I need to remove the worker manually,I want to call an API to remove worker when it fails. It doesn't work to me.😢
vesper
vesper3mo ago
So it failed and set refresh_worker to true but the next request failed due to out of memory error?
juergengunz
juergengunz2mo ago
i have the same issue, any way to fix that?

Did you find this page helpful?