2 active workers on serverless endpoint keep rebooting
We have 2 active workers on a serverless endpoint, sometimes the workers reboot at the same time for some reason, which causes major problems in our system.
2024-04-03T14:37:16Z create pod network 2024-04-03T14:37:16Z create container endpoint-image:1.2 2024-04-03T14:37:17Z start container 2024-04-03T15:27:23Z stop container 2024-04-03T15:27:24Z remove container 2024-04-03T15:27:24Z remove network 2024-04-03T15:27:30Z create pod network 2024-04-03T15:27:30Z create container endpoint-image:1.2 2024-04-03T15:27:30Z start container 2024-04-03T17:34:51Z stop container 2024-04-03T17:34:51Z remove container 2024-04-03T17:34:51Z remove networkHas anyone ever had this problem? How to fix it?
Runpods version : 1.3.0 Docker Image : Python 3.11-slim Our image version : 1.2
8 Replies
your serverless worker needs to have startup command and you just run plain python docker image
Our Docker image already has a command to start with, should I add one anyway in our Runpods templates?
Not sure if ur saying u had this api working before, and suddenly just these two workers these things happen, or if ur saying ur trying to deploy serverless.
If the latter, ur trying to deploy, and running into this issue, as madiator said make sure ur calling specifically the handler.py which needs to have a runpod.start() call in the file to be triggered
GitHub
runpodWhisperx/Dockerfile at master · justinwlin/runpodWhisperx
Runpod WhisperX Docker Container Repo. Contribute to justinwlin/runpodWhisperx development by creating an account on GitHub.
Are u doing so?
https://blog.runpod.io/serverless-create-a-basic-api/
Ex. of runpod blog walking thro the setup
RunPod Blog
Serverless | Create a Custom Basic API
RunPod's Serverless platform allows for the creation of API endpoints that automatically scale to meet demand. The tutorial guides you through creating a basic worker and turning it into an API endpoint on the RunPod serverless platform. For this tutorial, we will create an API endpoint that helps us accomplish
Thanks for the answer
Yes I have a handler.py file with :
And in my dockerfile, I got this command:
Everyhitng works fines normally but now every X hours, the active worker reboots for no reason at all
active workers can shuffle, thats normal, there is no single active worker that is dedicated to being an active worker, its last man standing algorithm, its meant to optimize for cost