19 Replies
nvidia-smi
nvcc --version
Is it normal H100 or SMX?
It was SXM
tried re-creating, tried reinstalling several times, did not work, gave up
are you using nightly PyTorch if that wont work send me pod id and region
I already removed the pod, it's >$15/hour and I did not want to just waste money, sorry -- will try back some other time
Though now I do not have much info what machine might be broken
Had the same issue with H100 machines, and someone pointed out that this might be because of Cuda 12.3
https://discord.com/channels/912829806415085598/1210557591483318282/1210557591483318282
Thanks for sharing! I don't think I can do much about the installed drivers on the machine, and there were no machines with otehr drivers.
I don't think this is a Cuda 12.3 issue anymore, since I realised I'm getting the same error with 12.2 too now.
Runpod team, in case you investigate this: plmr6hilhh382m
Is it SMX?
@Dhruv Mullick
PCIe
try switch to PyTorch nightly
I've relesed the pod now (too costly), but I'll try this when I encounter the problem again and update here
Thanks!
I was also having severe issues with H100 to the point I was unable to train anything.
I don’t understand why someone from RunPod can’t just switch all hosts to CUDA 11.8 or 12.2.
You can’t reinstall CUDA on a host
I'm not downgrading CUDA. I provisioned a new VM with 12.2 this time.
@kopyl , how did you get around this whole mess?
I switched to a100