RunPod•14mo ago

Broken CUDA / PyTorch on H100

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

Tried reinstalling PyTorch, did not help.

19 Replies

DreamGenOP•14mo ago

nvidia-smi

DreamGenOP•14mo ago

message.txt

DreamGenOP•14mo ago

nvcc --version

root@2583eec93fb6:/workspace/axolotl# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

root@2583eec93fb6:/workspace/axolotl# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Madiator2011•14mo ago

Is it normal H100 or SMX?

DreamGenOP•14mo ago

It was SXM tried re-creating, tried reinstalling several times, did not work, gave up

Madiator2011•14mo ago

are you using nightly PyTorch if that wont work send me pod id and region

DreamGenOP•14mo ago

I already removed the pod, it's >$15/hour and I did not want to just waste money, sorry -- will try back some other time

Madiator2011•14mo ago

Though now I do not have much info what machine might be broken

Dhruv Mullick•14mo ago

Had the same issue with H100 machines, and someone pointed out that this might be because of Cuda 12.3 https://discord.com/channels/912829806415085598/1210557591483318282/1210557591483318282

DreamGenOP•14mo ago

Thanks for sharing! I don't think I can do much about the installed drivers on the machine, and there were no machines with otehr drivers.

Dhruv Mullick•14mo ago

I don't think this is a Cuda 12.3 issue anymore, since I realised I'm getting the same error with 12.2 too now.

Dhruv Mullick•14mo ago

Runpod team, in case you investigate this: plmr6hilhh382m

Madiator2011•14mo ago

Is it SMX? @Dhruv Mullick

Dhruv Mullick•14mo ago

PCIe

Madiator2011•14mo ago

try switch to PyTorch nightly

Dhruv Mullick•14mo ago

I've relesed the pod now (too costly), but I'll try this when I encounter the problem again and update here Thanks!

kopyl•14mo ago

I was also having severe issues with H100 to the point I was unable to train anything. I don’t understand why someone from RunPod can’t just switch all hosts to CUDA 11.8 or 12.2. You can’t reinstall CUDA on a host

Dhruv Mullick•14mo ago

I'm not downgrading CUDA. I provisioned a new VM with 12.2 this time. @kopyl , how did you get around this whole mess?

kopyl•14mo ago

I switched to a100

Gaming

Programming

Broken CUDA / PyTorch on H100

Did you find this page helpful?