R
RunPod11mo ago
DreamGen

Broken CUDA / PyTorch on H100

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Tried reinstalling PyTorch, did not help.
19 Replies
DreamGen
DreamGenOP11mo ago
nvidia-smi
DreamGen
DreamGenOP11mo ago
DreamGen
DreamGenOP11mo ago
nvcc --version
root@2583eec93fb6:/workspace/axolotl# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
root@2583eec93fb6:/workspace/axolotl# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Madiator2011
Madiator201111mo ago
Is it normal H100 or SMX?
DreamGen
DreamGenOP11mo ago
It was SXM tried re-creating, tried reinstalling several times, did not work, gave up
Madiator2011
Madiator201111mo ago
are you using nightly PyTorch if that wont work send me pod id and region
DreamGen
DreamGenOP11mo ago
I already removed the pod, it's >$15/hour and I did not want to just waste money, sorry -- will try back some other time
Madiator2011
Madiator201111mo ago
Though now I do not have much info what machine might be broken
Dhruv Mullick
Dhruv Mullick11mo ago
Had the same issue with H100 machines, and someone pointed out that this might be because of Cuda 12.3 https://discord.com/channels/912829806415085598/1210557591483318282/1210557591483318282
DreamGen
DreamGenOP10mo ago
Thanks for sharing! I don't think I can do much about the installed drivers on the machine, and there were no machines with otehr drivers.
Dhruv Mullick
Dhruv Mullick10mo ago
I don't think this is a Cuda 12.3 issue anymore, since I realised I'm getting the same error with 12.2 too now.
No description
Dhruv Mullick
Dhruv Mullick10mo ago
Runpod team, in case you investigate this: plmr6hilhh382m
Madiator2011
Madiator201110mo ago
Is it SMX? @Dhruv Mullick
Dhruv Mullick
Dhruv Mullick10mo ago
PCIe
Madiator2011
Madiator201110mo ago
try switch to PyTorch nightly
Dhruv Mullick
Dhruv Mullick10mo ago
I've relesed the pod now (too costly), but I'll try this when I encounter the problem again and update here Thanks!
kopyl
kopyl10mo ago
I was also having severe issues with H100 to the point I was unable to train anything. I don’t understand why someone from RunPod can’t just switch all hosts to CUDA 11.8 or 12.2. You can’t reinstall CUDA on a host
Dhruv Mullick
Dhruv Mullick10mo ago
I'm not downgrading CUDA. I provisioned a new VM with 12.2 this time. @kopyl , how did you get around this whole mess?
kopyl
kopyl10mo ago
I switched to a100
Want results from more Discord servers?
Add your server