R
RunPod10mo ago
Ercan

Urgent Prod Issue

Pod is stuck, and not restarting
32 Replies
Ercan
ErcanOP10mo ago
Hi @Madiator2011 [EU] We have prod issue where our pod is suddenly stuck and cannot also restart nothing is working
Madiator2011
Madiator201110mo ago
logs?
Ercan
ErcanOP10mo ago
2024-02-16T16:32:15Z start container 2024-02-16T16:32:23Z restart container 2024-02-16T16:34:13Z restart container 2024-02-20T15:44:47Z stop container 2024-02-20T15:45:07Z stop container 2024-02-20T15:45:14Z remove container 2024-02-20T15:45:14Z create container runpod/stable-diffusion:fast-stable-diffusion-2.4.0 2024-02-20T15:45:14Z fast-stable-diffusion-2.4.0 Pulling from runpod/stable-diffusion 2024-02-20T15:45:14Z Digest: sha256:c3f7815767b580ac7d2994e46e461ba3cf85b1610eff1301fb341d27451c7033 2024-02-20T15:45:14Z Status: Image is up to date for runpod/stable-diffusion:fast-stable-diffusion-2.4.0 2024-02-20T15:45:16Z start container And pod id: g9tct1kr323mh0
Madiator2011
Madiator201110mo ago
what are other logs
Ercan
ErcanOP10mo ago
Okay now this log is also gone, it only says No Container logs yet, or system logs There is nothing I can see right now, no idea what is happening with the pod.
Madiator2011
Madiator201110mo ago
0 GPU?
Ercan
ErcanOP10mo ago
No I think. There are gpus available at least it gives me option to restart with them, but restart does not work too @Madiator2011 [EU] Okay it just started working and gave me a warning that this machine has been reported for some reason and root cause trying to be found.
justin
justin10mo ago
You can ping Flash if it costed you money and can get a refund for credits on it if necessary. Sounds like a technical issues
Ercan
ErcanOP10mo ago
Money is not the problem, we are already running more than 3 gpu 24/7. Just need a way to actually quick help in these such cases, I generally write on the website support but it is offline all the time.
justin
justin10mo ago
Yeah, I think Discord, and pinging flash / a staff like madiator / justin merell for these sort of issues when they are online in PST time is quite decent for these issues. Usually they just need podid + secure or community cloud
Ercan
ErcanOP10mo ago
Yes, so far so good, they are helping out. And fast response in discord.
ashleyk
ashleyk10mo ago
Problem is RunPod support is only available during US hours most of the time and the rest of us in the rest of the world can't get our issues attended to when we have a critical problem and have to wait for the US to come online. Community members can only assist so far, and staff like @Madiator2011 [EU] only have limited access and can't help us with a lot of our issues. We really need people to be available to help resolve critical issues 24/7. I have to be on standby/callout 24/7 and people call me at 3am etc when there is a production issue. RunPod need to offer the same. I have mentioned this several times before but it has still not happened, and this makes RunPod very much less appealing for using as a production service when we can't get immediate support when we have a production issue.
Madiator2011
Madiator201110mo ago
I'm usually up most of the time
ashleyk
ashleyk10mo ago
Yeah but you don't have access to everything when we have an issue 😢
Ercan
ErcanOP10mo ago
Yes def agree. We would be really ok to pay a subscription style plan like "Professional' etc and it can include these support packages
ashleyk
ashleyk10mo ago
Yeah good idea to offer premium support for paying extra 👍 cc: @JM
Ercan
ErcanOP10mo ago
Justin, do you know anyone we can tag right now, our all production pods suddenly went down, and showing no status I have no idea whats happening in these days @justin [Not Staff] Okay the pods are suddenly back with their existing states as if nothing happened. I believe there was network issue on runpods system
justin
justin10mo ago
what do u mean suddenly went down? want to share the endpoint so staff can look into it? can u share time frame too? @flash-singh / @Justin Merrell prob ur best bet
flash-singh
flash-singh10mo ago
i see there was a blip of network like an 30 mins ago but its been back online for the past 20 mins
Ercan
ErcanOP10mo ago
Yes exactly that time!
flash-singh
flash-singh10mo ago
the ui doesn't show anything? is it working now?
Ercan
ErcanOP10mo ago
Pods disappeared for 5-6 minutes from dashboard. Yes they are working fine now. But would be great to have some visibility in these cases, so our product team don't panic : )
flash-singh
flash-singh10mo ago
i see the blip last about 5 mins
Ercan
ErcanOP10mo ago
And also I directly checked here https://uptime.runpod.io/ But looks like that network blip is not reflected here
flash-singh
flash-singh10mo ago
those are global services, single gpu servers dont impact reliability of those we have plans to implement dc level reliability metrics there, that will help if whole dc gets impacted
Ercan
ErcanOP10mo ago
Great to hear!
flash-singh
flash-singh10mo ago
also your using community cloud, the fact this caused a prod issue for you, i would highly encourage to use secure cloud or improve HA of your stack
komet98
komet9810mo ago
2024-02-23T05:18:21Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp 54.236.113.205:443: connect: no route to host I cannot restart my pod. @flash-singh any incident currently going on? pod_id: fvz9rf9l3lme2a
ashleyk
ashleyk10mo ago
I also had this issue yesterday, very annoying because we are charged while this goes into an infinite loop but the pod never comes up.
Ercan
ErcanOP10mo ago
Hi again, some weirds things are happening again with pods, seems like network issues again
ashleyk
ashleyk10mo ago
What things? Which region?
Ercan
ErcanOP10mo ago
US So weird, same things happening like pods status are disappearing again from dashboard and in that time cannot connect them. @ashleyk Do you know anyone who might be awake and be able to help right now? I think this is a general network issue again. Things getting normal again. Maybe when you are available @flash-singh, if you may confirm if there was network blip again at 3:00am - 3:20am EST. Update: Network issues seem to be gone for last 2 days. Everything has been working stable.
Want results from more Discord servers?
Add your server