CUDA error: uncorrectable ECC error encountered

I just provisioned an 8xH100 NVL machine, made it load a very large model and then the container got stuck into a restart loop trying to load the model stuck on this error: 2024-08-04T16:43:13.809833249Z RuntimeError: CUDA error: uncorrectable ECC error encountered This looks like a hardware defect. Is there a way to get my credits back for that run?
1 Reply
Marcus
Marcus4mo ago
Log a support ticket on the website.
Want results from more Discord servers?
Add your server