MarioHachemer
CUDA error: uncorrectable ECC error encountered
I just provisioned an 8xH100 NVL machine, made it load a very large model and then the container got stuck into a restart loop trying to load the model stuck on this error:
2024-08-04T16:43:13.809833249Z RuntimeError: CUDA error: uncorrectable ECC error encountered
This looks like a hardware defect. Is there a way to get my credits back for that run?
3 replies