32 Replies
Hi @Madiator2011 [EU]
We have prod issue where our pod is suddenly stuck and cannot also restart nothing is working
logs?
2024-02-16T16:32:15Z start container
2024-02-16T16:32:23Z restart container
2024-02-16T16:34:13Z restart container
2024-02-20T15:44:47Z stop container
2024-02-20T15:45:07Z stop container
2024-02-20T15:45:14Z remove container
2024-02-20T15:45:14Z create container runpod/stable-diffusion:fast-stable-diffusion-2.4.0
2024-02-20T15:45:14Z fast-stable-diffusion-2.4.0 Pulling from runpod/stable-diffusion
2024-02-20T15:45:14Z Digest: sha256:c3f7815767b580ac7d2994e46e461ba3cf85b1610eff1301fb341d27451c7033
2024-02-20T15:45:14Z Status: Image is up to date for runpod/stable-diffusion:fast-stable-diffusion-2.4.0
2024-02-20T15:45:16Z start container
And pod id: g9tct1kr323mh0
what are other logs
Okay now this log is also gone, it only says
No Container logs yet, or system logs
There is nothing I can see right now, no idea what is happening with the pod.
0 GPU?
No I think. There are gpus available
at least it gives me option to restart with them, but restart does not work too
@Madiator2011 [EU] Okay it just started working and gave me a warning that this machine has been reported for some reason and root cause trying to be found.
You can ping Flash if it costed you money and can get a refund for credits on it if necessary. Sounds like a technical issues
Money is not the problem, we are already running more than 3 gpu 24/7.
Just need a way to actually quick help in these such cases, I generally write on the website support
but it is offline all the time.
Yeah, I think Discord, and pinging flash / a staff like madiator / justin merell for these sort of issues when they are online in PST time is quite decent for these issues. Usually they just need podid + secure or community cloud
Yes, so far so good, they are helping out. And fast response in discord.
Problem is RunPod support is only available during US hours most of the time and the rest of us in the rest of the world can't get our issues attended to when we have a critical problem and have to wait for the US to come online. Community members can only assist so far, and staff like @Madiator2011 [EU] only have limited access and can't help us with a lot of our issues. We really need people to be available to help resolve critical issues 24/7. I have to be on standby/callout 24/7 and people call me at 3am etc when there is a production issue. RunPod need to offer the same.
I have mentioned this several times before but it has still not happened, and this makes RunPod very much less appealing for using as a production service when we can't get immediate support when we have a production issue.
I'm usually up most of the time
Yeah but you don't have access to everything when we have an issue 😢
Yes def agree. We would be really ok to pay a subscription style plan like "Professional' etc and it can include these support packages
Yeah good idea to offer premium support for paying extra 👍 cc: @JM
Justin, do you know anyone we can tag right now, our all production pods suddenly went down, and showing no status
I have no idea whats happening in these days
@justin [Not Staff] Okay the pods are suddenly back with their existing states as if nothing happened. I believe there was network issue on runpods system
what do u mean suddenly went down? want to share the endpoint so staff can look into it? can u share time frame too?
@flash-singh / @Justin Merrell prob ur best bet
i see there was a blip of network like an 30 mins ago but its been back online for the past 20 mins
Yes exactly that time!
the ui doesn't show anything?
is it working now?
Pods disappeared for 5-6 minutes from dashboard.
Yes they are working fine now.
But would be great to have some visibility in these cases, so our product team don't panic : )
i see the blip last about 5 mins
And also I directly checked here https://uptime.runpod.io/
But looks like that network blip is not reflected here
those are global services, single gpu servers dont impact reliability of those
we have plans to implement dc level reliability metrics there, that will help if whole dc gets impacted
Great to hear!
also your using community cloud, the fact this caused a prod issue for you, i would highly encourage to use secure cloud or improve HA of your stack
2024-02-23T05:18:21Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp 54.236.113.205:443: connect: no route to host
I cannot restart my pod.
@flash-singh any incident currently going on?
pod_id: fvz9rf9l3lme2a
I also had this issue yesterday, very annoying because we are charged while this goes into an infinite loop but the pod never comes up.
Hi again, some weirds things are happening again with pods, seems like network issues again
What things? Which region?
US
So weird, same things happening like pods status are disappearing again from dashboard and in that time cannot connect them.
@ashleyk Do you know anyone who might be awake and be able to help right now?
I think this is a general network issue again.
Things getting normal again.
Maybe when you are available @flash-singh, if you may confirm if there was network blip again at 3:00am - 3:20am EST.
Update: Network issues seem to be gone for last 2 days. Everything has been working stable.