RunPod•16mo ago

CUDA 12.3 support

I created a template with a custom image (based on runpod/containes) to run CUDA 12.3, but when I use pytorch 2.1.2 + python 3.10, it tells me that it's not working.

python3 -c "import torch; print(torch.cuda.is_available())"

CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

python3 -c "import torch; print(torch.cuda.is_available())"

CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

The same docker image works locally on my machine, so I assume this is something on your side or am I wrong?

43 Replies

ashleyk•16mo ago

Does your pod actually have CUDA 12.3 when you run nvidia-smi ? Probably 12.1 or 12.2 and not 12.3

NERDDISCOOP•16mo ago

Yes, it does

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

ashleyk•16mo ago

And RunPod still haven't added CUDA 12.3 to the CUDA filter dropdown 😱 Oh thats weird

NERDDISCOOP•16mo ago

it is using the https://hub.docker.com/layers/nvidia/cuda/12.3.1-devel-ubuntu22.04/images/sha256-af963432696b5f170a64ceddcac9fc4df01292f0f35d52e24bc0750ec169332d?context=explore as the base image

ashleyk•16mo ago

I wonder whether the hosts that were upgraded to 12.3 were actually rebooted when they were upgraded. Which region is your pod in?

NERDDISCOOP•16mo ago

EU-RO-1 on a 4090

ashleyk•16mo ago

We were also getting this issue on the 4090's in EU-RO-1 in the other thread. https://discord.com/channels/912829806415085598/1195065705486360576

NERDDISCOOP•16mo ago

yeah exactly, I was just searching on Discord for this problem, this is why I found the other thread 😄 as I couldn't explain why this setup wouldn't work

ashleyk•16mo ago

Have you tried another region to see if you still get the errror? Seems there are a few broken 4090's in RO

NERDDISCOOP•16mo ago

oh no, I haven't done that yet

ashleyk•16mo ago

Some 4090 in RO are fine, its basically like flipping a coin to determine whether you get a good one or a bad one

NERDDISCOOP•16mo ago

oh nice 😄 RunPod Roulette

Madiator2011•16mo ago

drivers issues most likely

NERDDISCOOP•16mo ago

so something that I can't fix myself right?

ashleyk•16mo ago

Yeah @Madiator2011 seems to be driver issues, @JM is looking into it for us. Unfortunately not.

NERDDISCOOP•16mo ago

ok perfect, then I sit back an wait

ashleyk•16mo ago

You can try US-OR-1 if you don't need network storage. Oh my bad, US-OR-1 only has CUDA 12.2 not 12.3

root@7ed16fef8b00:~# nvidia-smi
Tue Jan 16 16:07:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:81:00.0 Off |                  Off |
|  0%   33C    P8              23W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@7ed16fef8b00:~# nvidia-smi
Tue Jan 16 16:07:30 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:81:00.0 Off |                  Off |
|  0%   33C    P8              23W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@JM we really need CUDA 12.3 to be added to the CUDA filter dropdown. Shoudn't be hard or take a lot of time to add it.

JM•16mo ago

Good morning both! @ashleyk @NERDDISCO CUDA 12.3 is a beta driver on Linux; we use production drivers (12.2 atm). CUDA 12.3 = 545+ drivers which I wouldn't trust for production at the moment.

JM•16mo ago

I am curious to know how we can better support you though. Why did you specifically need 12.3?

NERDDISCOOP•16mo ago

I wanted to try if my pipeline is faster when using the latest CUDA, as I don't have a 4090 locally. And for my real-time app every ms counts that I can get off. And if it's actually that good, I need a provider to actually give me the 4090 with CUDA 12.3. but I can totally understand that this is not a use case for you. I will try to find another way for this

ashleyk•16mo ago

By the way, if its the case that you don't use beta drivers for production, then why did @NERDDISCO get this when he ran nvidia-smi on his pod in EU-RO-1?

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

JM•16mo ago

By leveraging docker containers, we are able to achieve those lighting fast cold start times and have that amount of flexibility for deployment. The downside of it is that the dockers images are relying on the BareMetal infrastructure that's below. If this one has 12.2; dockers won't be able to run 12.2. When running nvidia-smi inside a docker, you get the cuda installed by your container, not the baremetal one 🙂

ashleyk•16mo ago

I always get the baremetal one

JM•16mo ago

And that's why even though it says 12.3; it doesn't work Oh really that's weird

ashleyk•16mo ago

Yep

JM•16mo ago

@NERDDISCO Could you provide me with your pod ID I can tell you which Cuda is installed I am surprised, because you can run 11.8 template for example, and 11.8 shows

ashleyk•16mo ago

So for example, my pod id s53c7bzefygmvt, is running this Docker image:

nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

So based on what you are saying, nvidia-smi should say CUDA 11.8, but it shows CUDA 12.2:

root@a5733c0056a2:~# nvidia-smi
Tue Jan 16 16:56:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 95%   26C    P8              19W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@a5733c0056a2:~# nvidia-smi
Tue Jan 16 16:56:24 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 95%   26C    P8              19W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Which I am 99.999999999999999999% confident is the CUDA version of the host machine. This is what shows me the version of the container:

root@a5733c0056a2:~# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

root@a5733c0056a2:~# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

NERDDISCOOP•16mo ago

ah sorry, it is already terminated. I will start one up later

JM•16mo ago

Ah, you're right @ashleyk. Reminder that I have the upmost dedication to improve things, but you're still more knowledgeable than me considering that's not my area of expertise. That's helpful to know though! No worries thanks 🙂 @ashleyk for NERDDISCO I am confused as of why they are getting 12.3 haha

ashleyk•16mo ago

Maybe a new deployment of 4090's was done in RO and it installed the latest version instead of capping it at 12.2?

JM•16mo ago

Let me validate that @ashleyk Of all the ones there, there are a total of 2 servers in RO with CUDA 12.3. Those were probably for a PoC. That's like less than 1% of servers.

ashleyk•16mo ago

Ah yeah I noticed, I tried to get one with 12.3 but failed 😆 Seems to be mostly a mix of 12.0 and 12.2

jxxg2jijy0fpcu: CUDA Version: 12.0
bbaaa1xs99m7s1: CUDA Version: 12.2
jas98z02y6cqu2: CUDA Version: 12.2
zngyrqx0zejzrc: CUDA Version: 12.0
bumt2dhr5fbna4: CUDA Version: 12.0
eihobh5zw0hyd4: CUDA Version: 12.2
73s5ao5j3yc9dl: CUDA Version: 12.2
hc0nfnhj1tlizt: CUDA Version: 12.0
nyc6ziaphuxbpt: CUDA Version: 12.2
j8hq6ba8knfrua: CUDA Version: 12.2
6r9cxermo7lo68: CUDA Version: 12.0
3xkss00fasldtb: CUDA Version: 12.0
jg6pdbwlc34hfy: CUDA Version: 12.2
0ffqkqpk0o111r: CUDA Version: 12.0
co8rm1lrtm5jqz: CUDA Version: 12.0
bye4vydo740epb: CUDA Version: 12.2

jxxg2jijy0fpcu: CUDA Version: 12.0
bbaaa1xs99m7s1: CUDA Version: 12.2
jas98z02y6cqu2: CUDA Version: 12.2
zngyrqx0zejzrc: CUDA Version: 12.0
bumt2dhr5fbna4: CUDA Version: 12.0
eihobh5zw0hyd4: CUDA Version: 12.2
73s5ao5j3yc9dl: CUDA Version: 12.2
hc0nfnhj1tlizt: CUDA Version: 12.0
nyc6ziaphuxbpt: CUDA Version: 12.2
j8hq6ba8knfrua: CUDA Version: 12.2
6r9cxermo7lo68: CUDA Version: 12.0
3xkss00fasldtb: CUDA Version: 12.0
jg6pdbwlc34hfy: CUDA Version: 12.2
0ffqkqpk0o111r: CUDA Version: 12.0
co8rm1lrtm5jqz: CUDA Version: 12.0
bye4vydo740epb: CUDA Version: 12.2

NERDDISCOOP•16mo ago

ok I'm back and will try to reproduce the issue @JM so I created a new pod with my CUDA 12.3 image and this is the result

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

@ashleyk @JM which command should I use to find out the underlying version? As I understand that you said it shouldn't be possible to have a CUDA 12.3 machine right? I guess I see this in the cli because I'm using nvidia/cuda:12.3.1-devel-ubuntu22.04 as the base model? ID is pxaoil4kxl6j9k so inside my container I can of course have any version of CUDA I want or not? But if this is supported by the actual server is the question and this is why I get the error with pytorch or?

JM•16mo ago

Checked this pod. Confirming that this is Cuda 12.2. Your nvida-smi probably doesn't give underlying version because it's not compatible with the on your are attempting to run. That would be my hypothesis. Try this command:

/usr/local/cuda/bin/nvcc --version

NERDDISCOOP•16mo ago

/usr/local/cuda/bin/nvcc --version is what I used I guess so too Would it be an option that you provide users the information, that CUDA 12.3 (or any other version in the future) is still in beta and people should use it at their own risk? Because in the end it should be the decision of the user if they want to use the latest shit or? or is it just not possible to have this fine-grained control, as the server itself has to use the latest version of CUDA to be able to support all the versions in the docker container? Meaning that the server itself would need to run CUDA 12.3, even when the user would choose CUDA 12.1 in their pod?

JM•16mo ago

Well, I am fine with users attempting to use any version they want. The problem I see though is that even if they attempt to, they won't find much unfortunately Exactly The BareMetal driver version is the limiter. Eg: if the server has CUDA 12.0 installed, that's the latest version that will be able to run

NERDDISCOOP•16mo ago

yeah I see, makes sense that you don't want to have the BareMetal driver on something that is BETA where do you have this screenshot from? I would love to monitor when CUDA 12.3 becomes stable

JM•16mo ago

Oh, that is the nvidia website haha Of course, let me provide the link to you @NERDDISCO https://www.nvidia.com/download/find.aspx

JM•16mo ago

NERDDISCOOP•16mo ago

Awesome thanks! And you say that CUDA 12.3 runs only with beta drivers right? And that is the concern here correct?

JM•16mo ago

Indeed, 535.154.05 is 12.2

NERDDISCOOP•16mo ago

Thanks for the support 🙏❤️

JM•16mo ago

Of course!

Gaming

Programming

CUDA 12.3 support

Did you find this page helpful?