Serverless instances die when concurrent
I have a limit of 5 workers. But when I run 3 or so in parallel, often 1-2 of them will die randomly. Doesn't always happen though, not easily reproducible. Is this due to resource constraints?
Anyone else see it? What's the workaround?
6 Replies
Do You access any mutual DB or files across the workers?
If there are no incoming requests, the worker will automatically stop running. Unless you configure an idle timeout for x seconds, the worker will continue running for that duration after completing its tasks.
When you say the worker randomly stops, do you mean it’s still processing requests and then stops in the middle of one? Have you seen any error messages?
There is a shared network volume but nothing is written to it.
Understand that it stops after completing its task. But this is premature stopping. I haven't been able to capture error messages because I was away. Trying to catch it as it happens. runpod doesn't persist logs either, so that doesn't help.
The serverless instance has a side effect of state update in a database when it completes, which it does not do in these cases. Happens only occasionally. I catch exception conditions and notify myself but that doesn't happen in these failing cases. So I suspect the container is just shutdown somehow.
https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#update-progress
you can try to add this
progress_update
to know the status. Also for your system, you might want to have logic to handle request failed, network issue and etc.Additional controls | RunPod Documentation
Send progress updates during job execution using the runpod.serverless.progress_update function, and refresh workers for long-running or complex jobs by returning a dictionary with a 'refresh_worker' flag in your handler.
OOM is possible, but yeah still not sure why it dies
Thanks all. Will try these things. I think of cuda oom or other exceptions happen my exception handler will catch it now and email/notify me with a stack trace. So let's see. If it silently dies, it maybe something else. Will report back