R
RunPod•10mo ago
annah_do

Pod is unable to find/use GPU in python

Hi, I'm trying to connect to this pod: RunPod Pytorch 2.2.10 ID: zgel6p985mjmmn 1 x A30 8 vCPU 31 GB RAM runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 On-Demand - Community Cloud Running 40 GB Disk 20 GB Pod Volume Volume Path: /workspace I can see that it has a GPU with nvidia-smi, and the cuda and pytorch version seem correct, but I cannot use the GPU with torch... Can anyone help? Best ``` root@54be7382bee1:~# python Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch torch.cuda.is_available() /usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False torch.version '2.2.0+cu121' exit() root@54be7382bee1:~# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
Solution:
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions" as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works....
Jump to solution
17 Replies
annah_do
annah_doOP•10mo ago
root@54be7382bee1:~# nvidia-smi Fri Feb 23 11:56:47 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A30 On | 00000000:00:06.0 Off | On | | N/A 45C P0 31W / 165W | 0MiB / 24576MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+
ashleyk
ashleyk•10mo ago
Maybe because the machine is running CUDA 12.3 which is not production ready.
annah_do
annah_doOP•10mo ago
most machines use CUDA 12.3 and with the 48GB GPU it works
ashleyk
ashleyk•10mo ago
@JM said they should all be on 12.2 because 12.3 is not production ready. I haven't seen any machines on 12.3 personally.
annah_do
annah_doOP•10mo ago
hm just double checked and you are right. my 48GB GPU is actually on 12.2... will keep an eye open for thin in the future...
Dhruv Mullick
Dhruv Mullick•10mo ago
@ashleyk how do we use 12.2? I spawned an H100 SXM5 pod with the image: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04, but still nvidia-smi shows that cuda is 12.3 ID: axwx9s1edwts9x Facing the same issue as @annah_do This happens even if I change my template to: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
Solution
annah_do
annah_do•10mo ago
@Dhruv Mullick I don't think it has to do with the image... If you select it from the runpod website, there is a filter button at the top and then a drop down menu where you can select 12.2 as "Allowed CUDA Versions" as @ashleyk pointed out earlier 'the machine is running CUDA 12.3 which is not production ready'. if I select 12.2 it works.
annah_do
annah_doOP•10mo ago
No description
Dhruv Mullick
Dhruv Mullick•10mo ago
Awesome, thank you @annah_do ! I thought it was the image that was controlling this.
Dhruv Mullick
Dhruv Mullick•10mo ago
Even with Cuda 12.2 I'm seeing the same error now
No description
ashleyk
ashleyk•10mo ago
How did you install torch? Probably conda breaking stuff, conda sucks
Dhruv Mullick
Dhruv Mullick•10mo ago
I just used the torch from the latest torch + Cuda template ( I think it was runpod/pytorch :2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 but I've now deleted the pod)
ashleyk
ashleyk•10mo ago
RunPod templates don't use conda though as far as I'm aware. Your application probt installed it
Dhruv Mullick
Dhruv Mullick•10mo ago
This is clean VM, with no other commands executed but the ones shown above 😅
ashleyk
ashleyk•10mo ago
Thats not true, it does not say (torch_env) in front of my prompt like yours does with a clean pod.
No description
ashleyk
ashleyk•10mo ago
That only happens when that crap conda gets installed. And it shows that CUDA is available on A100.
>>> torch.cuda.is_available()
True
>>> torch.cuda.is_available()
True
So I don't know what you are doing, but you are clearly doing something wrong.
JM
JM•9mo ago
Hey guys! Yep, thanks @ashleyk Indeed, it might be possible that there would be some machines that slip off with 12.3, but the biggest bulk is on 12.2. Like already mentionned, 12.3 is beta and we recommend production ready drivers 🙂
Want results from more Discord servers?
Add your server