GPU Pod was down all the night
Hi, we just woke up to a production issue where our all apis were down because our pods just shut down and looks like restarted for some reason, and when we looked at we sat maintenance scheduled text for next week.
Can someone help what was the issue, and why it went down itself ?
Pod ID: clxu7lem3ph9xu
13 Replies
@Madiator2011 Could you help us on this issue ?
Usually even when pod restarts it should start the last running app automaticly make sure to check pod logs
We have 3 different service running in pod, and when it restarted, all had to be restarted
The thing is also, we cannot see what happened in the pod, or why it restarted, only thing we see is now "Maintenance Scheduled"
Maintenance means the pod is going to be down for upgrades or fixes
What do you suggest we should do in such cases where pod restarts for some reason or machine has problems, and when it restarts, how could we automate all the services to be run back again.
Is there any API available on runpod where we can see if pod is down, or active etc
or can we trigger something
make bash script to run all services on pod start
I'm not sure what are you running so cant tell
We are running sd-web-ui for (API), text generation web ui for llm, and our custom fast api service in another port etc
all in single pod?
2x4090
I mean single pod with 2x4090
or two pods with single 4090 each
single pod with 2x4090
you probably will need to make own custom startup script like this
https://github.com/runpod/containers/blob/main/container-template/start.sh
GitHub
containers/container-template/start.sh at main · runpod/containers
🐳 | Dockerfiles for the RunPod container images used for our official templates. - runpod/containers
I see, makes sense, I will have a look