CUDA 12.3 support

I created a template with a custom image (based on runpod/containes) to run CUDA 12.3, but when I use pytorch 2.1.2 + python 3.10, it tells me that it's not working.
python3 -c "import torch; print(torch.cuda.is_available())"

CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
python3 -c "import torch; print(torch.cuda.is_available())"

CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
The same docker image works locally on my machine, so I assume this is something on your side or am I wrong?
43 Replies
ashleyk
ashleyk6mo ago
Does your pod actually have CUDA 12.3 when you run nvidia-smi ? Probably 12.1 or 12.2 and not 12.3
NERDDISCO
NERDDISCO6mo ago
Yes, it does
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
ashleyk
ashleyk6mo ago
And RunPod still haven't added CUDA 12.3 to the CUDA filter dropdown 😱 Oh thats weird
ashleyk
ashleyk6mo ago
I wonder whether the hosts that were upgraded to 12.3 were actually rebooted when they were upgraded. Which region is your pod in?
NERDDISCO
NERDDISCO6mo ago
EU-RO-1 on a 4090
ashleyk
ashleyk6mo ago
We were also getting this issue on the 4090's in EU-RO-1 in the other thread. https://discord.com/channels/912829806415085598/1195065705486360576
NERDDISCO
NERDDISCO6mo ago
yeah exactly, I was just searching on Discord for this problem, this is why I found the other thread 😄 as I couldn't explain why this setup wouldn't work
ashleyk
ashleyk6mo ago
Have you tried another region to see if you still get the errror? Seems there are a few broken 4090's in RO
NERDDISCO
NERDDISCO6mo ago
oh no, I haven't done that yet
ashleyk
ashleyk6mo ago
Some 4090 in RO are fine, its basically like flipping a coin to determine whether you get a good one or a bad one
NERDDISCO
NERDDISCO6mo ago
oh nice 😄 RunPod Roulette
Madiator2011
Madiator20116mo ago
drivers issues most likely
NERDDISCO
NERDDISCO6mo ago
so something that I can't fix myself right?
ashleyk
ashleyk6mo ago
Yeah @Madiator2011 seems to be driver issues, @JM is looking into it for us. Unfortunately not.
NERDDISCO
NERDDISCO6mo ago
ok perfect, then I sit back an wait
ashleyk
ashleyk6mo ago
You can try US-OR-1 if you don't need network storage. Oh my bad, US-OR-1 only has CUDA 12.2 not 12.3
root@7ed16fef8b00:~# nvidia-smi
Tue Jan 16 16:07:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:81:00.0 Off | Off |
| 0% 33C P8 23W / 450W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@7ed16fef8b00:~# nvidia-smi
Tue Jan 16 16:07:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:81:00.0 Off | Off |
| 0% 33C P8 23W / 450W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
@JM we really need CUDA 12.3 to be added to the CUDA filter dropdown. Shoudn't be hard or take a lot of time to add it.
JM
JM6mo ago
Good morning both! @ashleyk @NERDDISCO CUDA 12.3 is a beta driver on Linux; we use production drivers (12.2 atm). CUDA 12.3 = 545+ drivers which I wouldn't trust for production at the moment.
No description
JM
JM6mo ago
I am curious to know how we can better support you though. Why did you specifically need 12.3?
NERDDISCO
NERDDISCO6mo ago
I wanted to try if my pipeline is faster when using the latest CUDA, as I don't have a 4090 locally. And for my real-time app every ms counts that I can get off. And if it's actually that good, I need a provider to actually give me the 4090 with CUDA 12.3. but I can totally understand that this is not a use case for you. I will try to find another way for this
ashleyk
ashleyk6mo ago
By the way, if its the case that you don't use beta drivers for production, then why did @NERDDISCO get this when he ran nvidia-smi on his pod in EU-RO-1?
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
JM
JM6mo ago
By leveraging docker containers, we are able to achieve those lighting fast cold start times and have that amount of flexibility for deployment. The downside of it is that the dockers images are relying on the BareMetal infrastructure that's below. If this one has 12.2; dockers won't be able to run 12.2. When running nvidia-smi inside a docker, you get the cuda installed by your container, not the baremetal one 🙂
ashleyk
ashleyk6mo ago
I always get the baremetal one
JM
JM6mo ago
And that's why even though it says 12.3; it doesn't work Oh really that's weird
ashleyk
ashleyk6mo ago
Yep
JM
JM6mo ago
@NERDDISCO Could you provide me with your pod ID I can tell you which Cuda is installed I am surprised, because you can run 11.8 template for example, and 11.8 shows
ashleyk
ashleyk6mo ago
So for example, my pod id s53c7bzefygmvt, is running this Docker image:
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
So based on what you are saying, nvidia-smi should say CUDA 11.8, but it shows CUDA 12.2:
root@a5733c0056a2:~# nvidia-smi
Tue Jan 16 16:56:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:01:00.0 Off | Off |
| 95% 26C P8 19W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@a5733c0056a2:~# nvidia-smi
Tue Jan 16 16:56:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:01:00.0 Off | Off |
| 95% 26C P8 19W / 230W | 2MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Which I am 99.999999999999999999% confident is the CUDA version of the host machine. This is what shows me the version of the container:
root@a5733c0056a2:~# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
root@a5733c0056a2:~# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
NERDDISCO
NERDDISCO6mo ago
ah sorry, it is already terminated. I will start one up later
JM
JM6mo ago
Ah, you're right @ashleyk. Reminder that I have the upmost dedication to improve things, but you're still more knowledgeable than me considering that's not my area of expertise. That's helpful to know though! No worries thanks 🙂 @ashleyk for NERDDISCO I am confused as of why they are getting 12.3 haha
ashleyk
ashleyk6mo ago
Maybe a new deployment of 4090's was done in RO and it installed the latest version instead of capping it at 12.2?
JM
JM6mo ago
Let me validate that @ashleyk Of all the ones there, there are a total of 2 servers in RO with CUDA 12.3. Those were probably for a PoC. That's like less than 1% of servers.
ashleyk
ashleyk6mo ago
Ah yeah I noticed, I tried to get one with 12.3 but failed 😆 Seems to be mostly a mix of 12.0 and 12.2
jxxg2jijy0fpcu: CUDA Version: 12.0
bbaaa1xs99m7s1: CUDA Version: 12.2
jas98z02y6cqu2: CUDA Version: 12.2
zngyrqx0zejzrc: CUDA Version: 12.0
bumt2dhr5fbna4: CUDA Version: 12.0
eihobh5zw0hyd4: CUDA Version: 12.2
73s5ao5j3yc9dl: CUDA Version: 12.2
hc0nfnhj1tlizt: CUDA Version: 12.0
nyc6ziaphuxbpt: CUDA Version: 12.2
j8hq6ba8knfrua: CUDA Version: 12.2
6r9cxermo7lo68: CUDA Version: 12.0
3xkss00fasldtb: CUDA Version: 12.0
jg6pdbwlc34hfy: CUDA Version: 12.2
0ffqkqpk0o111r: CUDA Version: 12.0
co8rm1lrtm5jqz: CUDA Version: 12.0
bye4vydo740epb: CUDA Version: 12.2
jxxg2jijy0fpcu: CUDA Version: 12.0
bbaaa1xs99m7s1: CUDA Version: 12.2
jas98z02y6cqu2: CUDA Version: 12.2
zngyrqx0zejzrc: CUDA Version: 12.0
bumt2dhr5fbna4: CUDA Version: 12.0
eihobh5zw0hyd4: CUDA Version: 12.2
73s5ao5j3yc9dl: CUDA Version: 12.2
hc0nfnhj1tlizt: CUDA Version: 12.0
nyc6ziaphuxbpt: CUDA Version: 12.2
j8hq6ba8knfrua: CUDA Version: 12.2
6r9cxermo7lo68: CUDA Version: 12.0
3xkss00fasldtb: CUDA Version: 12.0
jg6pdbwlc34hfy: CUDA Version: 12.2
0ffqkqpk0o111r: CUDA Version: 12.0
co8rm1lrtm5jqz: CUDA Version: 12.0
bye4vydo740epb: CUDA Version: 12.2
NERDDISCO
NERDDISCO6mo ago
ok I'm back and will try to reproduce the issue @JM so I created a new pod with my CUDA 12.3 image and this is the result
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
@ashleyk @JM which command should I use to find out the underlying version? As I understand that you said it shouldn't be possible to have a CUDA 12.3 machine right? I guess I see this in the cli because I'm using nvidia/cuda:12.3.1-devel-ubuntu22.04 as the base model? ID is pxaoil4kxl6j9k so inside my container I can of course have any version of CUDA I want or not? But if this is supported by the actual server is the question and this is why I get the error with pytorch or?
JM
JM6mo ago
Checked this pod. Confirming that this is Cuda 12.2. Your nvida-smi probably doesn't give underlying version because it's not compatible with the on your are attempting to run. That would be my hypothesis. Try this command:
/usr/local/cuda/bin/nvcc --version
NERDDISCO
NERDDISCO6mo ago
/usr/local/cuda/bin/nvcc --version is what I used I guess so too Would it be an option that you provide users the information, that CUDA 12.3 (or any other version in the future) is still in beta and people should use it at their own risk? Because in the end it should be the decision of the user if they want to use the latest shit or? or is it just not possible to have this fine-grained control, as the server itself has to use the latest version of CUDA to be able to support all the versions in the docker container? Meaning that the server itself would need to run CUDA 12.3, even when the user would choose CUDA 12.1 in their pod?
JM
JM6mo ago
Well, I am fine with users attempting to use any version they want. The problem I see though is that even if they attempt to, they won't find much unfortunately Exactly The BareMetal driver version is the limiter. Eg: if the server has CUDA 12.0 installed, that's the latest version that will be able to run
NERDDISCO
NERDDISCO6mo ago
yeah I see, makes sense that you don't want to have the BareMetal driver on something that is BETA where do you have this screenshot from? I would love to monitor when CUDA 12.3 becomes stable
JM
JM6mo ago
Oh, that is the nvidia website haha Of course, let me provide the link to you @NERDDISCO https://www.nvidia.com/download/find.aspx
JM
JM6mo ago
No description
NERDDISCO
NERDDISCO6mo ago
Awesome thanks! And you say that CUDA 12.3 runs only with beta drivers right? And that is the concern here correct?
JM
JM6mo ago
Indeed, 535.154.05 is 12.2
NERDDISCO
NERDDISCO6mo ago
Thanks for the support 🙏❤️
JM
JM6mo ago
Of course!