Frequent GPU problem with H100
Hello,
I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch.
For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken?
@JM or someone from the RunPod team, can you please see since it's happening extremely frequently now?
ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Please find image attached. Note that this is a freshly provisioned VM, with NO commands executed but the ones shown in the screenshot.
Solution:Jump to solution
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.
It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!...
11 Replies
@Dhruv Mullick could you give a try to my tool
#RunPod GPU Tester (recomended for H100 users)
Thanks @Papa Madiator , I've released the H100 now (to save costs) and instead provisioned an A100 cluster. But I'll post here once I run into the problem again.
Provisioned another one where Cuda doesn't work, and here are the results
{
"PyTorch Version": "2.2.0+cu121",
"Environment Info": {
"RUNPOD_POD_ID": "7zb8qedy1qzr0v",
"Template CUDA_VERSION": "Not Available",
"NVIDIA_DRIVER_CAPABILITIES": "Not Available",
"NVIDIA_VISIBLE_DEVICES": "Not Available",
"NVIDIA_PRODUCT_NAME": "Not Available",
"RUNPOD_GPU_COUNT": "4",
"machineId": "krn533olhyna"
},
"Host Machine Info": {
"CUDA Version": "12.2",
"Driver Version": "535.154.05",
"GPU Name": "NVIDIA H100 PCIe"
},
"CUDA Test Result": {
"GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 1": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 2": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.",
"GPU 3": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
}
}
@Papa Madiator
next time try use my tool and share errors as they help debug
btw what template do you use?
I'm using: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
I don't think there's any specific one for 12.2?
btw next time you can just upload json file
btw after you get file you can remove pod as I saved machine ID
Perfect, thanks!
I made that script to help get info on broken H100 trust me they are problematic
In meantime pls enjoy woman crying over broken GPU.
Btw fell free to give feedback about my tool
Solution
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues.
It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!
Thank you!!
It would be great if an update post could be made once that happens
@Dhruv Mullick I remembered you sir! 😉
So, we got a very good detection tool in place now, but it's manual
I believe the problem is largelly solved for H100s. We will be looking to automate the script now to expand it to all servers on RunPod. In the mean time, do not hesitate to reach out if you have any question 🙂