RunPod•14mo ago

torch.cuda.is_available() is False

Spinning up several H100s (burning money 😅) and no matter which official docker image I use, torch.cuda.is_available() is always False, which prevents me from actually using these GPUs. I've tried the following docker images: pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04 pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 Output of various commands:

print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

If the official images don't work, what am I supposed to do? Thanks for any help!

4 Replies

Madiator2011•14mo ago

@jherrm Run my tool and send me output #RunPod GPU Tester (recomended for H100 users)

jherrmOP•14mo ago

thanks @Papa Madiator for the quick response. I just spun up the instance again and ran your tool (no other runtime changes were made or programs ran). Here is the contents of the gpu_diagnostics.json file

gpu_diagnostics.json

Brolios•14mo ago

im having the same issue, im using an RTX 4000, after upgrading to A1111 1.8 (did not try to run the old version today) { "PyTorch Version": "2.1.2+cu121", "Environment Info": { "RUNPOD_POD_ID": "s4tupctvnyggri", "Template CUDA_VERSION": "Not Available", "NVIDIA_DRIVER_CAPABILITIES": "Not Available", "NVIDIA_VISIBLE_DEVICES": "Not Available", "NVIDIA_PRODUCT_NAME": "Not Available", "RUNPOD_GPU_COUNT": "1", "machineId": "xb8r2j839zjl" }, "Host Machine Info": { "CUDA Version": "12.2", "Driver Version": "535.104.12", "GPU Name": "NVIDIA RTX 4000 Ada Gene..." }, "CUDA Test Result": { "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero." }

{
    "PyTorch Version": "2.1.2+cu121",
    "Environment Info": {
        "RUNPOD_POD_ID": "s4tupctvnyggri",
        "Template CUDA_VERSION": "Not Available",
        "NVIDIA_DRIVER_CAPABILITIES": "Not Available",
        "NVIDIA_VISIBLE_DEVICES": "Not Available",
        "NVIDIA_PRODUCT_NAME": "Not Available",
        "RUNPOD_GPU_COUNT": "1",
        "machineId": "xb8r2j839zjl"
    },
    "Host Machine Info": {
        "CUDA Version": "12.2",
        "Driver Version": "535.104.12",
        "GPU Name": "NVIDIA RTX 4000 Ada Gene..."
    },
    "CUDA Test Result": {
        "GPU 0": "Error: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero."
    }

Madiator2011•14mo ago

intresting what template?

Gaming

Programming

torch.cuda.is_available() is False

Did you find this page helpful?