Serverless worker keeps failing
We run several serverless workers in parallel to run the inference.
Sometimes a serverless worker starts failing with OOM and all the following runs on the same worker will fail until the worker is terminated.
We have noticed that the retries initiated by our backend always end up on the same worker. Let's say we have 10 prompts, and we run one prompt per worker, the retries with the same prompt always end up on the same worker.
If a random worker were used every time, this issue wouldn't be a problem because retrying will eventually succeed, but since it's always the same worker, all the retries fail.
How is the target worker selected? Is there a hash on the input? Or on the webhook? Can we add some random data to the input to always have a different worker selected?
4 Replies
+1 on this we are scaling like crazy and this is currently blocking us @yhlong00000
I am not sure if it will work in production but I added this to my code:
I got this from here:
https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls
I have the same issue. Now I need to remove the worker manually,I want to call an API to remove worker when it fails.
It doesn't work to me.😢
So it failed and set refresh_worker to true but the next request failed due to out of memory error?