Error 804: forward compatibility was attempted on non supported HW
Writing to the online chat bounces the messages, despite me being obviously connected.
120 Replies
The messages written are actually the main issue I wanted to solve, but since I ran into this, I'm also submitting it.
It looks like PyTorch issue
The chat messages bouncing or the issue written in the chat?
I mean your error message
It seems that way but the usual reason is version mismatch and the solution is restarting (which I obviously can't do): https://github.com/pytorch/pytorch/issues/40671, https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch/45319156#45319156
GitHub
Issues Β· pytorch/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration - Issues Β· pytorch/pytorch
Stack Overflow
Nvidia NVML Driver/library version mismatch
When I run nvidia-smi, I get the following message:
Failed to initialize NVML: Driver/library version mismatch
An hour ago I received the same message and uninstalled my CUDA library and I was ab...
u can use a pytorch template by runpod
or if u know the cuda version u can filter gpu pods by cuda versions
I did that, the problem is that the machines have different drivers. All of them are 12.2 but only some of them actually work.
Looking at the Pytorch issue, it really seems that it's because some of the machines have outdated drivers (535.x vs. 525.x).
I can provide more information when I run into the issue again, but it's extremely weird that only the machines with older drivers exhibit this error and suggests that it's not the issue with the image I'm using.
Yeah. definitely somethign to flag staff about
that is certaintly weird / strange π’
In that case I'll update the issue when I run into it again. Which information should I provide? I presume some way to identify the machine?
Yeah, I think a pod identifier, and you can stop the pod so u aren't burning money
and just @ one of the active staff
they are generally in the US time
Thanks, will do π
Just ran into the same issue.
The pod ID is
efm8o6l8qebm1y
, the nvidia-smi output is the following:
Pinging @Madiator2011 as you suggested. Can't pause the Pod (I think because I have a volume mounted).
The full message is the following:
MRE:
Pytorch is installed with pip install torch torchaudio torchvision
. I am using Python 3.10.13 installed with pyenv
.probably outdated pytorch version
@Madiator2011 this only seems to happen on certain machines though (more specifically certain 4090s).
The image has stayed the same.
new models had same issues with H100
@TomS what output do you get from nvcc --version
I don't seem to have that command available
what docker image are you using?
I am using Nvidia's
nvidia/cuda:12.2.0-devel-ubuntu22.04
imagewierd nvcc mostly comes with CUDA.
Probably just not in the path, probably have to run something like
/usr/local/cuda/bin/nvcc --version
You're right, my bad!
mind trying
Getting the same error.
The package versions are attached (just for completeness).
Also tried this for cu118 but no luck.
Terminating the pod.
Was it community cloud or secure cloud?
Secure cloud.
(all of the ones I tried - both the non-affected 4090s and the affected ones)
Is it possible that the issue is with outdated drivers on certain machines, like the Pytorch GitHub issue suggests?
(some are 525.x and some 535.x, like I mentioned?)
@flash-singh / @Justin / @JM this is an issue for people using my templates as well, can you do something to fix these broken drivers please.
This issue is specific to 4090. They are more expensive than 3090, A5000 etc but their drivers are broken making them completely unusable.
@TomS Could you provide me the pod ID of one of those machines that you are facing this problem?
Person having issues with my template was using 4090 in RO
@Finn do you have the pod id?
Also, to clarify, do you have a hard requirement with CUDA 12.2+?
My requirement is CUDA 12.1+
9gi3jqiqlts2ou
jvkvnd5uu2crj2
oonjzmqb2rw7qj
qswdrg5ltpr0v1
9gi3jqiqlts2ou
I tried with 4 different 4090s
It has the correct version of CUDA, but also 525.x driver version like Tom and not 535.x.
That's a 12.0 Cuda machine. Make sure to filter using the UI
Tom also gave his pod id above: efm8o6l8qebm1y
Sorry, was gone for a bit. Yes, the specified ID was filtered to be 12.2.
How do I do that?
nvidia-smi
correctly showed CUDA 12.2Yeah this issue is happening on machines that have the correct CUDA version
As shown in the screenshot above
Its in the filters at the top of the page.
- 9gi3jqiqlts2ou: 12.0
- jvkvnd5uu2crj2: 12.2
- oonjzmqb2rw7qj: 12.2
- qswdrg5ltpr0v1: 12.0
- 9gi3jqiqlts2ou: 12.0
So 2 of these should work, but they don't π€·ββοΈ
Correct. But that's a start right?
CUDA is very good for backward compatibility, but horrible for forward. Those versions I provided as the CUDA installed on the BareMetal machines.
That being said, I see some are running on Ubuntu 20.04 and Ubuntu 22.04. Do you know if your image also has some kernel requirements? I know some require 5.15+ for example.
This must be
jvkvnd5uu2crj2
and its 12.1 but it throws the Error 804: forward compatibility was attempted on non supported HW
error.If that's the case, that would be extremely valuable information
My issue occurred on
efm8o6l8qebm1y
, which is flagged as 12.2 but whose drivers are older than those of other 12.2 machines where this issue didn't arise
What info should I provide that will help debug this?Yeah I think you were onto something with the 525.x and 535.x.
- jvkvnd5uu2crj2: Ubuntu 22.04.3 LTS, 6.2.0-36-generic, CUDA 12.2
Oh but you said its 12.1 in your list of pod ids?
Edited. Alright, let me investigate, that's weird
Yeah its also weird that when @Finn ran
nvidia-smi
it showed CUDA 12.1 and none of your list of pod ids has 12.1 π€·ββοΈ@JM let me know if you need something that will help you debug, can e.g. share the Dockerfile
@TomS which region was your pod in?
@ashleyk EU-RO-1
That's most likely a driver + library mismatch
Finn's pods were also in EU-RO-1.
I will sort this out. Thanks a lot for uncovering this
We're unable to access any ports from the 5 GPUs we've now spun up
what can we do as our real-time service is currently down
Did you try 3090?
@Finn In your case, make sure to filter for you required CUDA version too! You have several not meeting your requirements
Are you using network storage?
Log looks to be in a loop
Is this in CZ region?
yes
Idk what you mean
We have two and neither of them are working
Only 4090 is working in CZ, the others are broken, I mentioned this to flash-singh but didn't get a response. @JM can you look into why A5000 and 3090 are broken in CZ as well?
Do you need secure cloud specifically? If not, I suggest using an A5000 in community cloud in SK region.
I always use those and never have issues.
Sure, I can sort this out as well. What do you mean by ''broken''?
See screenshot above from @Finn , gets into a loop and container doesn't start.
A few people had this issue today including me.
POD id please?
I could start 4090 in CZ but not A5000 or 3090.
Host machine out of disk space or something probably
gmue2eh0wj8ybu
Have you tried US-OR and EU-IS for 4090s as well?
No, I was trying to reproduce the issue other people reported and ran into the same issue they did.
Ok thanks. Most likely a driver issue would be my guess, but those are being tested on as well speak.
Looks like all the GPUs are broken
All? If that's the case there might be something bigger
I can't get a single one to work, even after filtering for 12.2
this is a mess
Pod id: y7yvgvzcaoeld1 (A5000)
Pod id: 9bcyhnm2hqpbme (3090)
Did you try A5000 in SK region in Community Cloud?
And is it working with other images, or all images do not work
I can try
@Justin Could you give me a hand please? let me know if you are available
I haven't tried with other images
Isn't Community Cloud less reliable?
@TomS That Image will only work on Cuda 12.2+ AND on a specific kernel of Ubuntu 22.04
It's not super compatible
@ashleyk in your case, which template were you using?
Trying now...
Supposedly but secure cloud is less reliable these days, outages in CZ, SE etc.
RunPod PyTorch 2.1
I am currently testing 4090 in US-OR-1 region as well.
That's not normal
Whats not?
@Finn 4090 in US-OR-1 is fine.
All ports are up, even on the 1.9.3 image.
I believe @Finn issue is different than yours
He was not using same image earlier, except if he uses a different one now
I am helping @Finn , I found a solution for him
what's the solution?
use 4090 in US-OR-1 region
@Finn ^^ all ports working
Look good @ashleyk ?
Trying with OR
@TomS maybe you can try 4090 in US-OR-1 as well and see if it solves the issue for you too.
- Update: cz cannot pull any image. Will sort this out.
Thanks, its a different issue to the main thread here, but we ran into it when trying to use a different GPU type while trying to solve the main issue here π
This solved it!
RO is trash
That wasted us a few hours
can you guys please add some quality control? This is not the first time I've had issues with RO
It's really detrimental to our end service
Yep! Here are a couple things:
- There might be a mismatch driver in RO (waiting on confirmation).
- Second thing, I previously saw that there was attempt to use later CUDA with older CUDA. Remember to use the filter if you have requirements!
- We will update everything on the platform to be 12.0+ in the next 2 months.
- Last thing is, if you use Nvidia images, there might be a lots of requirements to make it compatible, including Kernel version. Those are not plug and play everywhere.
Would be better to update to 12.1+ rather than 12.0+ because oobabooga now requires 12.1 minimum
Not all pods you provided were broken. RO has been incredible so far; both in terms of deployment, but also speed of service. Uptime average is above 99.9% π Please note of the above, as this is important to make sure your deployments are as smooth as possible.
As for CZ, DM me and I can provide some credits; this networking redesign has been quite challenging. We will sort this out asap.
We do those in batches to maintain availability, but we will be working toward 12.0+, then 12.1+, then even 12.2+.
By the way, CUDA 12.3 still hasn't been added to the filter.
There were 4 RO 4090s I tested
they had 12.2
none of these worked
not to mention the ones running 12.1
Something wrong here then
It only lists 12.0 and 12.2 and not 12.1
I got a 4090 in RO with 535x driver and CUDA 12.2 and its fine:
Why is it wrong?
I am pulling those info from db
Also update: we have uncovered the culprit. The docker caching at the new location was the problem. We are fixing it
Yep, same; the 12.2 ones on RO appears to be fine from what I have tested? I believe it might be an isolated issue with one or 2 servers
Were those the pod IDs you provided earlier? Because only 2 had 12.2
@ashleyk @Finn Should be solved in CZ π
Goal for q1-q2 of this year is to have pristine, state-of-the-art standards. Keep us updated with anything you find, and we can knock it out
so if I get this thread, we can only use CUDA <= 12.2 right now or?
No thats not correct, depends on the template you are using, oobabooga requires 12.1 or higher.
But the main issue in this thread is that there are 4090's in EU-RO-1 with broken drivers.
oh ok sorry, then I will open a new one
And if we uncheck EU-RO the 4090's in unavaible on serverless , @JM when you have a solution can you post it in gΓ©nΓ©ral information if we need to adapt docker image or add a serverless params for check the driver ?
@Pierre Nicolas hey! Actually, cuda filter is out now π
Did you guys notice that yet?
Depending on what docker you use, it might be good pratice to select 12.0+, or even 12.1+
OMG YASSS THANK U!!!
Ok thanks you we try it tomorrow
I don't see the ability to filter CUDA versions under advanced settings in Serverless.
Mine still shows this
Ah damn, didn't realize that it was a private beta release. Expect this feature very very soon; that means it's being tested π π π
ok coming soon π
Thanks for doing this! Looking forward to try it out π