torch.cuda.is_available() is False
Spinning up several H100s (burning money 😅) and no matter which official docker image I use,
torch.cuda.is_available()
is always False
, which prevents me from actually using these GPUs.
I've tried the following docker images:
pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04
Output of various commands:
If the official images don't work, what am I supposed to do? Thanks for any help!4 Replies
@jherrm Run my tool and send me output #RunPod GPU Tester (recomended for H100 users)
thanks @Papa Madiator for the quick response. I just spun up the instance again and ran your tool (no other runtime changes were made or programs ran). Here is the contents of the gpu_diagnostics.json file
im having the same issue, im using an RTX 4000, after upgrading to A1111 1.8 (did not try to run the old version today)
{
"PyTorch Version": "2.1.2+cu121",
"Environment Info": {
"RUNPOD_POD_ID": "s4tupctvnyggri",
"Template CUDA_VERSION": "Not Available",
"NVIDIA_DRIVER_CAPABILITIES": "Not Available",
"NVIDIA_VISIBLE_DEVICES": "Not Available",
"NVIDIA_PRODUCT_NAME": "Not Available",
"RUNPOD_GPU_COUNT": "1",
"machineId": "xb8r2j839zjl"
},
"Host Machine Info": {
"CUDA Version": "12.2",
"Driver Version": "535.104.12",
"GPU Name": "NVIDIA RTX 4000 Ada Gene..."
},
"CUDA Test Result": {
"GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
}
{
"PyTorch Version": "2.1.2+cu121",
"Environment Info": {
"RUNPOD_POD_ID": "s4tupctvnyggri",
"Template CUDA_VERSION": "Not Available",
"NVIDIA_DRIVER_CAPABILITIES": "Not Available",
"NVIDIA_VISIBLE_DEVICES": "Not Available",
"NVIDIA_PRODUCT_NAME": "Not Available",
"RUNPOD_GPU_COUNT": "1",
"machineId": "xb8r2j839zjl"
},
"Host Machine Info": {
"CUDA Version": "12.2",
"Driver Version": "535.104.12",
"GPU Name": "NVIDIA RTX 4000 Ada Gene..."
},
"CUDA Test Result": {
"GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
}
intresting what template?