RunPod•5mo ago

Serverless instances die when concurrent

I have a limit of 5 workers. But when I run 3 or so in parallel, often 1-2 of them will die randomly. Doesn't always happen though, not easily reproducible. Is this due to resource constraints? Anyone else see it? What's the workaround?

6 Replies

P4jMepR•5mo ago

Do You access any mutual DB or files across the workers?

yhlong00000•5mo ago

If there are no incoming requests, the worker will automatically stop running. Unless you configure an idle timeout for x seconds, the worker will continue running for that duration after completing its tasks. When you say the worker randomly stops, do you mean it’s still processing requests and then stops in the middle of one? Have you seen any error messages?

superuserOP•5mo ago

There is a shared network volume but nothing is written to it. Understand that it stops after completing its task. But this is premature stopping. I haven't been able to capture error messages because I was away. Trying to catch it as it happens. runpod doesn't persist logs either, so that doesn't help. The serverless instance has a side effect of state update in a database when it completes, which it does not do in these cases. Happens only occasionally. I catch exception conditions and notify myself but that doesn't happen in these failing cases. So I suspect the container is just shutdown somehow.

yhlong00000•5mo ago

https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#update-progress you can try to add this progress_update to know the status. Also for your system, you might want to have logic to handle request failed, network issue and etc.

Additional controls | RunPod Documentation

Send progress updates during job execution using the runpod.serverless.progress_update function, and refresh workers for long-running or complex jobs by returning a dictionary with a 'refresh_worker' flag in your handler.

nerdylive•5mo ago

OOM is possible, but yeah still not sure why it dies

superuserOP•5mo ago

Thanks all. Will try these things. I think of cuda oom or other exceptions happen my exception handler will catch it now and email/notify me with a stack trace. So let's see. If it silently dies, it maybe something else. Will report back

Gaming

Programming

Serverless instances die when concurrent

Did you find this page helpful?