We have detected a critical error on this machine which may affect some pods.
Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine.
After 24 hours, we are getting this error on the Runpod GUI:
We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime.
Running nvidia-smi gives an ERR! for the 3rd GPU.
What are our options here? Is this an error that will be fixed by Runpod, or have I paid for a faulty machine?
Similarly, is there any way to use the persistent volume disk we are currently paying for and have it attached to a different H100 machine, so we do not have to spend another 24 hours transferring data & paying additional fees? Please advise.
18 Replies
You can use network storage to be able to attach your persistent storage to different pods, but it restricts your availability.
Can you provide the pod id so that someone can check it out.
@flash-singh @Papa Madiator more issues with H100.
We did not use network storage as we were unable to find any availability for H100s.
Pod ID: brpigbe2fzkzrh
Forwarded to team
@Daniel T. do you have any data in there?
looking into it, this is a hardware issue and hopefully doesn't require RMA
Yes -- as mentioned above we have terabytes of data we paid to transfer to the machine on a persistent volume disk
Got it. I've seen this happen with cables not being plugged in fully or a GPU getting overheated significantly, hopefully not RMA 🙏
Lmk when you have an update
Hey @Daniel T.!
Request has been sent to a tech in the datacenter already like others mentioned. That being said, I have reconciliated fees for the pod ID you provided, and credited your account for the entire duration, plus a sizable buffer. We will reach out soon with any relevant updates!
Let me know if you have any other question in the mean time 🙂
Thank you for the update and for crediting the account. Right now it seems like two GPUs are in the error state. Do you have any clue regarding the timing for the tech to fix the issue? Would you recommend spinning up a new instance, or waiting for the issue to be fixed by the tech? For reference, it takes ~24 hours to transfer data and egress costs are substantial.
Hey! Even though I don't have a crystal ball, from my experience that type of error is most often associated with Nvidia GPU failure. How much data do you have on that server by the way?
Those servers are on a 10gbps backbone, so it could be quite fast to transfer from one to another, given that you rent from the same location. I would definitely recommend that so you can get started very soon! You can transfer data between pods leveraging out CTL tool here:
https://docs.runpod.io/references/runpodctl/#options
~3TB, we have a > 16 TB we want to transfer for when we fully start training.
Transferring between pods should work - ty
I'm assuming CA is Canada, correct?
Do you have datacenters not listed on the UI? When I use the dropdown to select a particular region, it says GPUs are not available in any region. When I select "any", GPUs are available.
Could you increase my spending limit when you have a chance?
Good point, let me check where is your pod hosted
It's in Canada. Indeed, most dc tags like these ones are some with network storage, or plans for network storage. @flash-singh Could we get a dc tag for this location? It is quite sizable
Could you increase my spending limit?
Trying to rent another machine to transfer the data over, but running into spending limits
pm me details and can increase it for you
runpodctl send
times out/fails for large files when transferring data. This is a relatively common problem for approaches that do not have enough retries or networking instability. rsync
and wget
work due to robust retries. Given the inability to obtain a pod in the given geographical region, I'm giving up on the approach of transferring data via two connected pods and will pay the cloud provider fees.Yeah lots of people have issues with transferring large amounts of data using runpodctl, I suggested adding more relays in this post: https://discord.com/channels/912829806415085598/1207427186974396556
@Daniel T. if you going next time use H100 fell free to use my experimental tool #RunPod GPU Tester (recomended for H100 users)
Plan to smooth it out but it should let you test machine before running something longer.