Runpod•9mo ago

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id.

Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
>>> torch.cuda.device_count()
5
>>> torch.tensor([1], device='cuda:0')
tensor([1], device='cuda:0')
>>> torch.tensor([1], device='cuda:1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:4')
tensor([1], device='cuda:4')

Dj•4/11/25, 4:52 PM

Hey, can you share your pod id or dm me your account email?

441see I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device ...

Jason•4/11/25, 5:12 PM

Your container cuda version, and what cuda version is your pod?

Jason•4/11/25, 5:12 PM

Run nvidia-smi and nvcc - - version i think to check

bghira•4/11/25, 5:31 PM

actually same error here occuring on H100 pods. (secure cloud)

bghira•4/11/25, 5:32 PM

@Dj pod id is hs6vtqnu343wwj

JJason Your container cuda version, and what cuda version is your pod?

Jason•4/11/25, 5:34 PM

@bghira just wondering

bghira•4/11/25, 5:36 PM

Cu124

bghira•4/11/25, 5:36 PM

it's H100s. it's not going to be 11.8.

Jason•4/11/25, 5:41 PM

All 12.4?

Jason•4/11/25, 5:41 PM

Try 12.8

bghira•4/11/25, 5:45 PM

it's not the problem..

bghira•4/11/25, 5:46 PM

also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production

Jason•4/11/25, 5:48 PM

Oh I see what's the problem?

bghira•4/11/25, 5:49 PM

see the original post in this thread

bghira•4/11/25, 5:52 PM

support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason

bghira•4/11/25, 5:53 PM

at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too

Dj•4/11/25, 5:54 PM

I'm also taking a look, we had a small outage yesterday which maybe related but I'm working on going through the relevant logs (similiar to what support would be doing)

bghira•4/11/25, 5:55 PM

been there. i understand

bghira•4/11/25, 5:55 PM

thanks for taking a look

Bbghira also, 12.8 will need an entirely different version of pytorch. that isn't someth...

riverfog7•4/11/25, 5:55 PM

can confirm im dying waiting for cuda kernels to compile

Rriverfog7 can confirm im dying waiting for cuda kernels to compile

Dj•4/11/25, 5:57 PM

Are you on the same org or just also seeing the same error?

bghira•4/11/25, 5:58 PM

different case

riverfog7•4/11/25, 5:58 PM

my experience with H100s were fine (cuz i dont use them a lot)

bghira•4/11/25, 5:59 PM

this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)

Rriverfog7 can confirm im dying waiting for cuda kernels to compile

riverfog7•4/11/25, 5:59 PM

that was about making vllm to work with blackwell

Dj•4/11/25, 5:59 PM

Just got freed from my meeting, hunting down the error now

riverfog7•4/11/25, 5:59 PM

i think someone got nvlink errors on H100s

riverfog7•4/11/25, 5:59 PM

in other thread

bghira•4/11/25, 6:00 PM

that happens if the host system has the fabric manager crash

Rriverfog7 i think someone got nvlink errors on H100s