MarioHachemer
MarioHachemer
RRunPod
Created by MarioHachemer on 8/4/2024 in #⛅|pods
CUDA error: uncorrectable ECC error encountered
I just provisioned an 8xH100 NVL machine, made it load a very large model and then the container got stuck into a restart loop trying to load the model stuck on this error: 2024-08-04T16:43:13.809833249Z RuntimeError: CUDA error: uncorrectable ECC error encountered This looks like a hardware defect. Is there a way to get my credits back for that run?
3 replies