Cuda not connecting to image provisioned for GPU
Started a community pod with 1 GPU (4090) using the Runpod pytorch image/template (runpod/pytorch:2.4.0-py3.11-cuda12.4). Immediately after starting pod, GPU is unavailable even though nvidia-smi seems to see the GPU. This is happening about 20% of the time I start images with this official container. No errors thrown in system or container logs.
root@5c367a0d4ea2:/# python -c "import torch; print(torch.cuda.is_available())"
/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
root@5c367a0d4ea2:/# nvidia-smi
Mon Mar 24 15:59:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 26C P8 11W / 450W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
(abridged due to message length)
2 Replies
Another pod. Immediately after starting the pod the GPU is not available even though it is set to 1 4090 GPU. ssh [email protected] -i ~/.ssh/ided25519
-- RUNPOD.IO --
Enjoy your Pod #93qymj5jda8e60 ^^
__ __
(__ \ ( \ | |
) ) _) ) | |
| __ / | | | || \ | // \ / |
| | \ \ | || | | | | || |( (| |
|| |_||/ || |||| _/ _|
For detailed documentation and guides, please visit:
https://docs.runpod.io/ and https://blog.runpod.io/
root@773fb48759c7:/# python -c "import torch; print(torch.cuda.is_available())"
/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Hello from RunPod Documentation | RunPod Documentation
RunPod enables you to run your workloads on GPUs in the Cloud
nvcc --version
can you try running that?
which template are you using? which one specifically