Pod is unable to find/use GPU in python
Hi,
I'm trying to connect to this pod:
RunPod Pytorch 2.2.10
ID: zgel6p985mjmmn
1 x A30
8 vCPU 31 GB RAM
runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
On-Demand - Community Cloud
Running
40 GB Disk
20 GB Pod Volume
Volume Path: /workspace
I can see that it has a GPU with nvidia-smi, and the cuda and pytorch version seem correct, but I cannot use the GPU with torch...
Can anyone help?
Best
```
root@54be7382bee1:~# python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.is_available() /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False torch.version '2.2.0+cu121' exit() root@54be7382bee1:~# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
Solution:Jump to solution
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions"
as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works....
17 Replies
root@54be7382bee1:~# nvidia-smi
Fri Feb 23 11:56:47 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:06.0 Off | On | | N/A 45C P0 31W / 165W | 0MiB / 24576MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:06.0 Off | On | | N/A 45C P0 31W / 165W | 0MiB / 24576MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+
Maybe because the machine is running CUDA 12.3 which is not production ready.
most machines use CUDA 12.3 and with the 48GB GPU it works
@JM said they should all be on 12.2 because 12.3 is not production ready.
I haven't seen any machines on 12.3 personally.
hm just double checked and you are right. my 48GB GPU is actually on 12.2...
will keep an eye open for thin in the future...
@ashleyk how do we use 12.2? I spawned an H100 SXM5 pod with the image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04, but still nvidia-smi shows that cuda is 12.3
ID: axwx9s1edwts9x
Facing the same issue as @annah_do
This happens even if I change my template to: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Solution
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions"
as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works.
Awesome, thank you @annah_do ! I thought it was the image that was controlling this.
Even with Cuda 12.2 I'm seeing the same error now
How did you install torch?
Probably conda breaking stuff, conda sucks
I just used the torch from the latest torch + Cuda template ( I think it was runpod/pytorch
:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 but I've now deleted the pod)
RunPod templates don't use conda though as far as I'm aware. Your application probt installed it
This is clean VM, with no other commands executed but the ones shown above 😅
Thats not true, it does not say
(torch_env)
in front of my prompt like yours does with a clean pod.That only happens when that crap conda gets installed.
And it shows that CUDA is available on A100.
So I don't know what you are doing, but you are clearly doing something wrong.
Hey guys!
Yep, thanks @ashleyk
Indeed, it might be possible that there would be some machines that slip off with 12.3, but the biggest bulk is on 12.2. Like already mentionned, 12.3 is beta and we recommend production ready drivers 🙂