CUDA 12.3 support
I created a template with a custom image (based on runpod/containes) to run CUDA 12.3, but when I use pytorch 2.1.2 + python 3.10, it tells me that it's not working.
The same docker image works locally on my machine, so I assume this is something on your side or am I wrong?
43 Replies
Does your pod actually have CUDA 12.3 when you run
nvidia-smi
?
Probably 12.1 or 12.2 and not 12.3Yes, it does
And RunPod still haven't added CUDA 12.3 to the CUDA filter dropdown 😱
Oh thats weird
I wonder whether the hosts that were upgraded to 12.3 were actually rebooted when they were upgraded.
Which region is your pod in?
EU-RO-1
on a 4090We were also getting this issue on the 4090's in EU-RO-1 in the other thread.
https://discord.com/channels/912829806415085598/1195065705486360576
yeah exactly, I was just searching on Discord for this problem, this is why I found the other thread 😄
as I couldn't explain why this setup wouldn't work
Have you tried another region to see if you still get the errror? Seems there are a few broken 4090's in RO
oh no, I haven't done that yet
Some 4090 in RO are fine, its basically like flipping a coin to determine whether you get a good one or a bad one
oh nice 😄
RunPod Roulette
drivers issues most likely
so something that I can't fix myself right?
Yeah @Madiator2011 seems to be driver issues, @JM is looking into it for us.
Unfortunately not.
ok perfect, then I sit back an wait
You can try US-OR-1 if you don't need network storage.
Oh my bad, US-OR-1 only has CUDA 12.2 not 12.3
@JM we really need CUDA 12.3 to be added to the CUDA filter dropdown.
Shoudn't be hard or take a lot of time to add it.
Good morning both! @ashleyk @NERDDISCO CUDA 12.3 is a beta driver on Linux; we use production drivers (12.2 atm). CUDA 12.3 = 545+ drivers which I wouldn't trust for production at the moment.
I am curious to know how we can better support you though. Why did you specifically need 12.3?
I wanted to try if my pipeline is faster when using the latest CUDA, as I don't have a 4090 locally. And for my real-time app every ms counts that I can get off. And if it's actually that good, I need a provider to actually give me the 4090 with CUDA 12.3.
but I can totally understand that this is not a use case for you. I will try to find another way for this
By the way, if its the case that you don't use beta drivers for production, then why did @NERDDISCO get this when he ran
nvidia-smi
on his pod in EU-RO-1?
By leveraging docker containers, we are able to achieve those lighting fast cold start times and have that amount of flexibility for deployment.
The downside of it is that the dockers images are relying on the BareMetal infrastructure that's below. If this one has 12.2; dockers won't be able to run 12.2.
When running nvidia-smi inside a docker, you get the cuda installed by your container, not the baremetal one 🙂
I always get the baremetal one
And that's why even though it says 12.3; it doesn't work
Oh really
that's weird
Yep
@NERDDISCO Could you provide me with your pod ID
I can tell you which Cuda is installed
I am surprised, because you can run 11.8 template for example, and 11.8 shows
So for example, my pod id
s53c7bzefygmvt
, is running this Docker image:
So based on what you are saying, nvidia-smi
should say CUDA 11.8, but it shows CUDA 12.2:
Which I am 99.999999999999999999% confident is the CUDA version of the host machine.
This is what shows me the version of the container:
ah sorry, it is already terminated. I will start one up later
Ah, you're right @ashleyk. Reminder that I have the upmost dedication to improve things, but you're still more knowledgeable than me considering that's not my area of expertise. That's helpful to know though!
No worries thanks 🙂
@ashleyk for NERDDISCO I am confused as of why they are getting 12.3 haha
Maybe a new deployment of 4090's was done in RO and it installed the latest version instead of capping it at 12.2?
Let me validate that
@ashleyk Of all the ones there, there are a total of 2 servers in RO with CUDA 12.3. Those were probably for a PoC. That's like less than 1% of servers.
Ah yeah I noticed, I tried to get one with 12.3 but failed 😆
Seems to be mostly a mix of 12.0 and 12.2
ok I'm back and will try to reproduce the issue @JM
so I created a new pod with my CUDA 12.3 image and this is the result
@ashleyk @JM which command should I use to find out the underlying version? As I understand that you said it shouldn't be possible to have a CUDA 12.3 machine right?
I guess I see this in the cli because I'm using
nvidia/cuda:12.3.1-devel-ubuntu22.04
as the base model?
ID is pxaoil4kxl6j9k
so inside my container I can of course have any version of CUDA I want or not? But if this is supported by the actual server is the question and this is why I get the error with pytorch or?Checked this pod. Confirming that this is Cuda 12.2.
Your nvida-smi probably doesn't give underlying version because it's not compatible with the on your are attempting to run. That would be my hypothesis.
Try this command:
/usr/local/cuda/bin/nvcc --version
/usr/local/cuda/bin/nvcc --version
is what I used
I guess so too
Would it be an option that you provide users the information, that CUDA 12.3 (or any other version in the future) is still in beta and people should use it at their own risk? Because in the end it should be the decision of the user if they want to use the latest shit or?
or is it just not possible to have this fine-grained control, as the server itself has to use the latest version of CUDA to be able to support all the versions in the docker container? Meaning that the server itself would need to run CUDA 12.3, even when the user would choose CUDA 12.1 in their pod?Well, I am fine with users attempting to use any version they want. The problem I see though is that even if they attempt to, they won't find much unfortunately
Exactly
The BareMetal driver version is the limiter. Eg: if the server has CUDA 12.0 installed, that's the latest version that will be able to run
yeah I see, makes sense that you don't want to have the BareMetal driver on something that is BETA
where do you have this screenshot from?
I would love to monitor when CUDA 12.3 becomes stable
Oh, that is the nvidia website haha
Of course, let me provide the link to you
@NERDDISCO
https://www.nvidia.com/download/find.aspx
Awesome thanks! And you say that CUDA 12.3 runs only with beta drivers right?
And that is the concern here correct?
Indeed, 535.154.05 is 12.2
Thanks for the support 🙏❤️
Of course!