RunPod•3mo ago

Serverless worker keeps failing

We run several serverless workers in parallel to run the inference. Sometimes a serverless worker starts failing with OOM and all the following runs on the same worker will fail until the worker is terminated. We have noticed that the retries initiated by our backend always end up on the same worker. Let's say we have 10 prompts, and we run one prompt per worker, the retries with the same prompt always end up on the same worker. If a random worker were used every time, this issue wouldn't be a problem because retrying will eventually succeed, but since it's always the same worker, all the retries fail. How is the target worker selected? Is there a hash on the input? Or on the webhook? Can we add some random data to the input to always have a different worker selected?

5 Replies

profondob1ue•3mo ago

+1 on this we are scaling like crazy and this is currently blocking us @yhlong00000

vesper•3mo ago

I am not sure if it will work in production but I added this to my code:

if torch.cuda.memory_allocated() / 1024**3 > 12:
  return {
      "refresh_worker": True,
      "job_results": {
          "status": "FAILED",
          "error": f"CUDA out of memory: {str(e)}"
      }
  }

if torch.cuda.memory_allocated() / 1024**3 > 12:
  return {
      "refresh_worker": True,
      "job_results": {
          "status": "FAILED",
          "error": f"CUDA out of memory: {str(e)}"
      }
  }

I got this from here: https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls

Skoo Wu•3mo ago

I have the same issue. Now I need to remove the worker manually，I want to call an API to remove worker when it fails. It doesn't work to me.😢

vesper•3mo ago

So it failed and set refresh_worker to true but the next request failed due to out of memory error?

juergengunz•2mo ago

i have the same issue, any way to fix that?

Gaming

Programming

Serverless worker keeps failing

Did you find this page helpful?