RunPod•15mo ago

Urgent Prod Issue

Pod is stuck, and not restarting

32 Replies

ErcanOP•15mo ago

Hi @Madiator2011 [EU] We have prod issue where our pod is suddenly stuck and cannot also restart nothing is working

Madiator2011•15mo ago

logs?

ErcanOP•15mo ago

2024-02-16T16:32:15Z start container 2024-02-16T16:32:23Z restart container 2024-02-16T16:34:13Z restart container 2024-02-20T15:44:47Z stop container 2024-02-20T15:45:07Z stop container 2024-02-20T15:45:14Z remove container 2024-02-20T15:45:14Z create container runpod/stable-diffusion:fast-stable-diffusion-2.4.0 2024-02-20T15:45:14Z fast-stable-diffusion-2.4.0 Pulling from runpod/stable-diffusion 2024-02-20T15:45:14Z Digest: sha256:c3f7815767b580ac7d2994e46e461ba3cf85b1610eff1301fb341d27451c7033 2024-02-20T15:45:14Z Status: Image is up to date for runpod/stable-diffusion:fast-stable-diffusion-2.4.0 2024-02-20T15:45:16Z start container And pod id: g9tct1kr323mh0

Madiator2011•15mo ago

what are other logs

ErcanOP•15mo ago

Okay now this log is also gone, it only says No Container logs yet, or system logs There is nothing I can see right now, no idea what is happening with the pod.

Madiator2011•15mo ago

0 GPU?

ErcanOP•15mo ago

No I think. There are gpus available at least it gives me option to restart with them, but restart does not work too @Madiator2011 [EU] Okay it just started working and gave me a warning that this machine has been reported for some reason and root cause trying to be found.

J.•15mo ago

You can ping Flash if it costed you money and can get a refund for credits on it if necessary. Sounds like a technical issues

ErcanOP•15mo ago

Money is not the problem, we are already running more than 3 gpu 24/7. Just need a way to actually quick help in these such cases, I generally write on the website support but it is offline all the time.

J.•15mo ago

Yeah, I think Discord, and pinging flash / a staff like madiator / justin merell for these sort of issues when they are online in PST time is quite decent for these issues. Usually they just need podid + secure or community cloud

ErcanOP•15mo ago

Yes, so far so good, they are helping out. And fast response in discord.

ashleyk•15mo ago

Problem is RunPod support is only available during US hours most of the time and the rest of us in the rest of the world can't get our issues attended to when we have a critical problem and have to wait for the US to come online. Community members can only assist so far, and staff like @Madiator2011 [EU] only have limited access and can't help us with a lot of our issues. We really need people to be available to help resolve critical issues 24/7. I have to be on standby/callout 24/7 and people call me at 3am etc when there is a production issue. RunPod need to offer the same. I have mentioned this several times before but it has still not happened, and this makes RunPod very much less appealing for using as a production service when we can't get immediate support when we have a production issue.

Madiator2011•15mo ago

I'm usually up most of the time

ashleyk•15mo ago

Yeah but you don't have access to everything when we have an issue 😢

ErcanOP•15mo ago

Yes def agree. We would be really ok to pay a subscription style plan like "Professional' etc and it can include these support packages

ashleyk•15mo ago

Yeah good idea to offer premium support for paying extra 👍 cc: @JM

ErcanOP•15mo ago

Justin, do you know anyone we can tag right now, our all production pods suddenly went down, and showing no status I have no idea whats happening in these days @justin [Not Staff] Okay the pods are suddenly back with their existing states as if nothing happened. I believe there was network issue on runpods system

J.•15mo ago

what do u mean suddenly went down? want to share the endpoint so staff can look into it? can u share time frame too? @flash-singh / @Justin Merrell prob ur best bet

flash-singh•15mo ago

i see there was a blip of network like an 30 mins ago but its been back online for the past 20 mins

ErcanOP•15mo ago

Yes exactly that time!

flash-singh•15mo ago

the ui doesn't show anything? is it working now?

ErcanOP•15mo ago

Pods disappeared for 5-6 minutes from dashboard. Yes they are working fine now. But would be great to have some visibility in these cases, so our product team don't panic : )

flash-singh•15mo ago

i see the blip last about 5 mins

ErcanOP•15mo ago

And also I directly checked here https://uptime.runpod.io/ But looks like that network blip is not reflected here

flash-singh•15mo ago

those are global services, single gpu servers dont impact reliability of those we have plans to implement dc level reliability metrics there, that will help if whole dc gets impacted

ErcanOP•15mo ago

Great to hear!

flash-singh•15mo ago

also your using community cloud, the fact this caused a prod issue for you, i would highly encourage to use secure cloud or improve HA of your stack

komet98•15mo ago

2024-02-23T05:18:21Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp 54.236.113.205:443: connect: no route to host I cannot restart my pod. @flash-singh any incident currently going on? pod_id: fvz9rf9l3lme2a

ashleyk•15mo ago

I also had this issue yesterday, very annoying because we are charged while this goes into an infinite loop but the pod never comes up.

ErcanOP•15mo ago

Hi again, some weirds things are happening again with pods, seems like network issues again

ashleyk•15mo ago

What things? Which region?

ErcanOP•14mo ago

US So weird, same things happening like pods status are disappearing again from dashboard and in that time cannot connect them. @ashleyk Do you know anyone who might be awake and be able to help right now? I think this is a general network issue again. Things getting normal again. Maybe when you are available @flash-singh, if you may confirm if there was network blip again at 3:00am - 3:20am EST. Update: Network issues seem to be gone for last 2 days. Everything has been working stable.

Gaming

Programming

Urgent Prod Issue

Did you find this page helpful?