"We have detected a critical error on this machine...failing pods
I get a lot of this errors lately "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." I lost pods (H100 in the secure cloud) and don't know why, I had the 6th pod failing today in 2 weeks. Runpod support is not helping either. Someone can help me? I'm not going to use runpod's service anymore till this issue is adressed, thanks.
Current pod failing: ID: jfktfsgsvw19i1
8 Replies
@flash-singh any idea whats going on here?
these are errors when we detect a hardware failure and need to service the whole server, were others that failed also H100?
Yes but there were also different locations CA/US
Current PODs affected:
H100, ID: jfktfsgsvw19i1, secure cloud, CA, still running but no access
H100, ID: 9ewem9xe6u8oy4, community cloud, US, still running
I can provide more instances that I stopped, but don't have the screenshot at hand right now...
Example: No access to this one, but process on pod is still running. Thank you for your help.
@JM we know any further details on these H100s?
Hello. We are investigating this!
Thanks!
Update: ID: jfktfsgsvw19i1 is working again. Thanks!
❤️