R
RunPod12mo ago
Nautilus

"We have detected a critical error on this machine...failing pods

I get a lot of this errors lately "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." I lost pods (H100 in the secure cloud) and don't know why, I had the 6th pod failing today in 2 weeks. Runpod support is not helping either. Someone can help me? I'm not going to use runpod's service anymore till this issue is adressed, thanks. Current pod failing: ID: jfktfsgsvw19i1
8 Replies
ashleyk
ashleyk12mo ago
@flash-singh any idea whats going on here?
flash-singh
flash-singh12mo ago
these are errors when we detect a hardware failure and need to service the whole server, were others that failed also H100?
Nautilus
NautilusOP12mo ago
Yes but there were also different locations CA/US Current PODs affected: H100, ID: jfktfsgsvw19i1, secure cloud, CA, still running but no access H100, ID: 9ewem9xe6u8oy4, community cloud, US, still running I can provide more instances that I stopped, but don't have the screenshot at hand right now...
Nautilus
NautilusOP12mo ago
Example: No access to this one, but process on pod is still running. Thank you for your help.
No description
flash-singh
flash-singh12mo ago
@JM we know any further details on these H100s?
JM
JM12mo ago
Hello. We are investigating this!
Nautilus
NautilusOP12mo ago
Thanks! Update: ID: jfktfsgsvw19i1 is working again. Thanks!
JM
JM12mo ago
❤️
Want results from more Discord servers?
Add your server