R
RunPod•10mo ago
jherrm

torch.cuda.is_available() is False

Spinning up several H100s (burning money 😅) and no matter which official docker image I use, torch.cuda.is_available() is always False, which prevents me from actually using these GPUs. I've tried the following docker images: pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04 pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 Output of various commands:
print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
If the official images don't work, what am I supposed to do? Thanks for any help!
4 Replies
Madiator2011
Madiator2011•10mo ago
@jherrm Run my tool and send me output #RunPod GPU Tester (recomended for H100 users)
jherrm
jherrmOP•10mo ago
thanks @Papa Madiator for the quick response. I just spun up the instance again and ran your tool (no other runtime changes were made or programs ran). Here is the contents of the gpu_diagnostics.json file
Brolios
Brolios•10mo ago
im having the same issue, im using an RTX 4000, after upgrading to A1111 1.8 (did not try to run the old version today) { "PyTorch Version": "2.1.2+cu121", "Environment Info": { "RUNPOD_POD_ID": "s4tupctvnyggri", "Template CUDA_VERSION": "Not Available", "NVIDIA_DRIVER_CAPABILITIES": "Not Available", "NVIDIA_VISIBLE_DEVICES": "Not Available", "NVIDIA_PRODUCT_NAME": "Not Available", "RUNPOD_GPU_COUNT": "1", "machineId": "xb8r2j839zjl" }, "Host Machine Info": { "CUDA Version": "12.2", "Driver Version": "535.104.12", "GPU Name": "NVIDIA RTX 4000 Ada Gene..." }, "CUDA Test Result": { "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero." } { "PyTorch Version": "2.1.2+cu121", "Environment Info": { "RUNPOD_POD_ID": "s4tupctvnyggri", "Template CUDA_VERSION": "Not Available", "NVIDIA_DRIVER_CAPABILITIES": "Not Available", "NVIDIA_VISIBLE_DEVICES": "Not Available", "NVIDIA_PRODUCT_NAME": "Not Available", "RUNPOD_GPU_COUNT": "1", "machineId": "xb8r2j839zjl" }, "Host Machine Info": { "CUDA Version": "12.2", "Driver Version": "535.104.12", "GPU Name": "NVIDIA RTX 4000 Ada Gene..." }, "CUDA Test Result": { "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero." }
Madiator2011
Madiator2011•10mo ago
intresting what template?
Want results from more Discord servers?
Add your server