RunpodR
Runpod9mo ago
41see

CUDA device uncorrectable ECC error

I'm using a 5xH100 pod and got uncorrectable ECC error for device 1,2,3. Device 0 and 4 can be used without a problem. It seems the device or the system needs a reboot. Any help on this? I've already submitted a ticket on the website with the pod id.

Python 3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 18:27:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import torch
>>> torch.cuda.device_count()
5
>>> torch.tensor([1], device='cuda:0')
tensor([1], device='cuda:0')
>>> torch.tensor([1], device='cuda:1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:3')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
>>> torch.tensor([1], device='cuda:4')
tensor([1], device='cuda:4')
Was this page helpful?