Serverless instances die when concurrent

I have a limit of 5 workers. But when I run 3 or so in parallel, often 1-2 of them will die randomly. Doesn't always happen though, not easily reproducible. Is this due to resource constraints? Anyone else see it? What's the workaround?
6 Replies
P4jMepR
P4jMepR3mo ago
Do You access any mutual DB or files across the workers?
yhlong00000
yhlong000003mo ago
If there are no incoming requests, the worker will automatically stop running. Unless you configure an idle timeout for x seconds, the worker will continue running for that duration after completing its tasks. When you say the worker randomly stops, do you mean it’s still processing requests and then stops in the middle of one? Have you seen any error messages?
superuser
superuserOP3mo ago
There is a shared network volume but nothing is written to it. Understand that it stops after completing its task. But this is premature stopping. I haven't been able to capture error messages because I was away. Trying to catch it as it happens. runpod doesn't persist logs either, so that doesn't help. The serverless instance has a side effect of state update in a database when it completes, which it does not do in these cases. Happens only occasionally. I catch exception conditions and notify myself but that doesn't happen in these failing cases. So I suspect the container is just shutdown somehow.
yhlong00000
yhlong000003mo ago
https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#update-progress you can try to add this progress_update to know the status. Also for your system, you might want to have logic to handle request failed, network issue and etc.
Additional controls | RunPod Documentation
Send progress updates during job execution using the runpod.serverless.progress_update function, and refresh workers for long-running or complex jobs by returning a dictionary with a 'refresh_worker' flag in your handler.
nerdylive
nerdylive3mo ago
OOM is possible, but yeah still not sure why it dies
superuser
superuserOP3mo ago
Thanks all. Will try these things. I think of cuda oom or other exceptions happen my exception handler will catch it now and email/notify me with a stack trace. So let's see. If it silently dies, it maybe something else. Will report back
Want results from more Discord servers?
Add your server