RunPod•12mo ago

We have detected a critical error on this machine which may affect some pods.

Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine. After 24 hours, we are getting this error on the Runpod GUI: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime. Running nvidia-smi gives an ERR! for the 3rd GPU. What are our options here? Is this an error that will be fixed by Runpod, or have I paid for a faulty machine? Similarly, is there any way to use the persistent volume disk we are currently paying for and have it attached to a different H100 machine, so we do not have to spend another 24 hours transferring data & paying additional fees? Please advise.

18 Replies

ashleyk•12mo ago

You can use network storage to be able to attach your persistent storage to different pods, but it restricts your availability. Can you provide the pod id so that someone can check it out. @flash-singh @Papa Madiator more issues with H100.

Daniel T.OP•12mo ago

We did not use network storage as we were unable to find any availability for H100s. Pod ID: brpigbe2fzkzrh

Madiator2011•12mo ago

Forwarded to team @Daniel T. do you have any data in there?

flash-singh•12mo ago

looking into it, this is a hardware issue and hopefully doesn't require RMA

Daniel T.OP•12mo ago

Yes -- as mentioned above we have terabytes of data we paid to transfer to the machine on a persistent volume disk Got it. I've seen this happen with cables not being plugged in fully or a GPU getting overheated significantly, hopefully not RMA 🙏 Lmk when you have an update

JM•12mo ago

Hey @Daniel T.! Request has been sent to a tech in the datacenter already like others mentioned. That being said, I have reconciliated fees for the pod ID you provided, and credited your account for the entire duration, plus a sizable buffer. We will reach out soon with any relevant updates! Let me know if you have any other question in the mean time 🙂

Daniel T.OP•12mo ago

Thank you for the update and for crediting the account. Right now it seems like two GPUs are in the error state. Do you have any clue regarding the timing for the tech to fix the issue? Would you recommend spinning up a new instance, or waiting for the issue to be fixed by the tech? For reference, it takes ~24 hours to transfer data and egress costs are substantial.

JM•12mo ago

Hey! Even though I don't have a crystal ball, from my experience that type of error is most often associated with Nvidia GPU failure. How much data do you have on that server by the way? Those servers are on a 10gbps backbone, so it could be quite fast to transfer from one to another, given that you rent from the same location. I would definitely recommend that so you can get started very soon! You can transfer data between pods leveraging out CTL tool here: https://docs.runpod.io/references/runpodctl/#options

Daniel T.OP•12mo ago

~3TB, we have a > 16 TB we want to transfer for when we fully start training. Transferring between pods should work - ty

Daniel T.OP•12mo ago

I'm assuming CA is Canada, correct?

Daniel T.OP•12mo ago

Do you have datacenters not listed on the UI? When I use the dropdown to select a particular region, it says GPUs are not available in any region. When I select "any", GPUs are available.

Daniel T.OP•12mo ago

Could you increase my spending limit when you have a chance?

JM•12mo ago

Good point, let me check where is your pod hosted It's in Canada. Indeed, most dc tags like these ones are some with network storage, or plans for network storage. @flash-singh Could we get a dc tag for this location? It is quite sizable

Daniel T.OP•12mo ago

Could you increase my spending limit? Trying to rent another machine to transfer the data over, but running into spending limits

flash-singh•12mo ago

pm me details and can increase it for you

Daniel T.OP•12mo ago

runpodctl send times out/fails for large files when transferring data. This is a relatively common problem for approaches that do not have enough retries or networking instability. rsync and wget work due to robust retries. Given the inability to obtain a pod in the given geographical region, I'm giving up on the approach of transferring data via two connected pods and will pay the cloud provider fees.

ashleyk•12mo ago

Yeah lots of people have issues with transferring large amounts of data using runpodctl, I suggested adding more relays in this post: https://discord.com/channels/912829806415085598/1207427186974396556

Madiator2011•12mo ago

@Daniel T. if you going next time use H100 fell free to use my experimental tool #RunPod GPU Tester (recomended for H100 users) Plan to smooth it out but it should let you test machine before running something longer.

Gaming

Programming

We have detected a critical error on this machine which may affect some pods.

Did you find this page helpful?