RunPod•16mo ago

Error 804: forward compatibility was attempted on non supported HW

Writing to the online chat bounces the messages, despite me being obviously connected.

120 Replies

TomSOP•16mo ago

The messages written are actually the main issue I wanted to solve, but since I ran into this, I'm also submitting it.

Madiator2011•16mo ago

It looks like PyTorch issue

TomSOP•16mo ago

The chat messages bouncing or the issue written in the chat?

Madiator2011•16mo ago

I mean your error message

TomSOP•16mo ago

It seems that way but the usual reason is version mismatch and the solution is restarting (which I obviously can't do): https://github.com/pytorch/pytorch/issues/40671, https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch/45319156#45319156

GitHub

Issues · pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - Issues · pytorch/pytorch

Stack Overflow

Nvidia NVML Driver/library version mismatch

When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the same message and uninstalled my CUDA library and I was ab...

J.•16mo ago

u can use a pytorch template by runpod or if u know the cuda version u can filter gpu pods by cuda versions

TomSOP•16mo ago

I did that, the problem is that the machines have different drivers. All of them are 12.2 but only some of them actually work. Looking at the Pytorch issue, it really seems that it's because some of the machines have outdated drivers (535.x vs. 525.x). I can provide more information when I run into the issue again, but it's extremely weird that only the machines with older drivers exhibit this error and suggests that it's not the issue with the image I'm using.

J.•16mo ago

Yeah. definitely somethign to flag staff about that is certaintly weird / strange 😢

TomSOP•16mo ago

In that case I'll update the issue when I run into it again. Which information should I provide? I presume some way to identify the machine?

J.•16mo ago

Yeah, I think a pod identifier, and you can stop the pod so u aren't burning money and just @ one of the active staff they are generally in the US time

TomSOP•16mo ago

Thanks, will do 🙂 Just ran into the same issue. The pod ID is efm8o6l8qebm1y, the nvidia-smi output is the following:

Mon Jan 15 15:58:33 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  Off |
|  0%   30C    P8    22W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Mon Jan 15 15:58:33 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  Off |
|  0%   30C    P8    22W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Pinging @Madiator2011 as you suggested. Can't pause the Pod (I think because I have a volume mounted). The full message is the following:

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling 
NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on 
non supported HW

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling 
NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on 
non supported HW

MRE:

import torch

torch.cuda.is_available()  # raises the error

import torch

torch.cuda.is_available()  # raises the error

Pytorch is installed with pip install torch torchaudio torchvision. I am using Python 3.10.13 installed with pyenv.

Madiator2011•16mo ago

probably outdated pytorch version

TomSOP•16mo ago

@Madiator2011 this only seems to happen on certain machines though (more specifically certain 4090s). The image has stayed the same.

Madiator2011•16mo ago

new models had same issues with H100 @TomS what output do you get from nvcc --version

TomSOP•16mo ago

I don't seem to have that command available

Madiator2011•16mo ago

what docker image are you using?

TomSOP•16mo ago

I am using Nvidia's nvidia/cuda:12.2.0-devel-ubuntu22.04 image

Madiator2011•16mo ago

wierd nvcc mostly comes with CUDA.

ashleyk•16mo ago

Probably just not in the path, probably have to run something like /usr/local/cuda/bin/nvcc --version

TomSOP•16mo ago

You're right, my bad!

root@5efe6f9cf8f9:/usr/local/cuda-12.2/bin# ./nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

root@5efe6f9cf8f9:/usr/local/cuda-12.2/bin# ./nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Madiator2011•16mo ago

mind trying

pip install install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

pip install install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

TomSOP•16mo ago

Getting the same error.

TomSOP•16mo ago

The package versions are attached (just for completeness).

message.txt

TomSOP•16mo ago

Also tried this for cu118 but no luck. Terminating the pod.

ashleyk•16mo ago

Was it community cloud or secure cloud?

TomSOP•16mo ago

Secure cloud. (all of the ones I tried - both the non-affected 4090s and the affected ones) Is it possible that the issue is with outdated drivers on certain machines, like the Pytorch GitHub issue suggests? (some are 525.x and some 535.x, like I mentioned?)

ashleyk•16mo ago

@flash-singh / @Justin / @JM this is an issue for people using my templates as well, can you do something to fix these broken drivers please. This issue is specific to 4090. They are more expensive than 3090, A5000 etc but their drivers are broken making them completely unusable.

JM•16mo ago

@TomS Could you provide me the pod ID of one of those machines that you are facing this problem?

ashleyk•16mo ago

Person having issues with my template was using 4090 in RO @Finn do you have the pod id?

JM•16mo ago

Also, to clarify, do you have a hard requirement with CUDA 12.2+?

ashleyk•16mo ago

My requirement is CUDA 12.1+

Finn•16mo ago

9gi3jqiqlts2ou

ashleyk•16mo ago

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:82:00.0 Off |                  Off |
|  0%   29C    P8    11W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:82:00.0 Off |                  Off |
|  0%   29C    P8    11W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------

Finn•16mo ago

jvkvnd5uu2crj2 oonjzmqb2rw7qj qswdrg5ltpr0v1 9gi3jqiqlts2ou I tried with 4 different 4090s

ashleyk•16mo ago

It has the correct version of CUDA, but also 525.x driver version like Tom and not 535.x.

JM•16mo ago

That's a 12.0 Cuda machine. Make sure to filter using the UI

ashleyk•16mo ago

Tom also gave his pod id above: efm8o6l8qebm1y

TomSOP•16mo ago

Sorry, was gone for a bit. Yes, the specified ID was filtered to be 12.2.

Finn•16mo ago

How do I do that?

TomSOP•16mo ago

nvidia-smi correctly showed CUDA 12.2

ashleyk•16mo ago

Yeah this issue is happening on machines that have the correct CUDA version As shown in the screenshot above Its in the filters at the top of the page.

JM•16mo ago

- 9gi3jqiqlts2ou: 12.0 - jvkvnd5uu2crj2: 12.2 - oonjzmqb2rw7qj: 12.2 - qswdrg5ltpr0v1: 12.0 - 9gi3jqiqlts2ou: 12.0

ashleyk•16mo ago

So 2 of these should work, but they don't 🤷‍♂️

JM•16mo ago

Correct. But that's a start right? CUDA is very good for backward compatibility, but horrible for forward. Those versions I provided as the CUDA installed on the BareMetal machines. That being said, I see some are running on Ubuntu 20.04 and Ubuntu 22.04. Do you know if your image also has some kernel requirements? I know some require 5.15+ for example.

ashleyk•16mo ago

This must be jvkvnd5uu2crj2 and its 12.1 but it throws the Error 804: forward compatibility was attempted on non supported HW error.

JM•16mo ago

If that's the case, that would be extremely valuable information

TomSOP•16mo ago

My issue occurred on efm8o6l8qebm1y, which is flagged as 12.2 but whose drivers are older than those of other 12.2 machines where this issue didn't arise What info should I provide that will help debug this?

ashleyk•16mo ago

Yeah I think you were onto something with the 525.x and 535.x.

JM•16mo ago

- jvkvnd5uu2crj2: Ubuntu 22.04.3 LTS, 6.2.0-36-generic, CUDA 12.2

ashleyk•16mo ago

Oh but you said its 12.1 in your list of pod ids?

JM•16mo ago

Edited. Alright, let me investigate, that's weird

ashleyk•16mo ago

Yeah its also weird that when @Finn ran nvidia-smi it showed CUDA 12.1 and none of your list of pod ids has 12.1 🤷‍♂️

TomSOP•16mo ago

@JM let me know if you need something that will help you debug, can e.g. share the Dockerfile

ashleyk•16mo ago

@TomS which region was your pod in?

TomSOP•16mo ago

@ashleyk EU-RO-1

JM•16mo ago

That's most likely a driver + library mismatch

ashleyk•16mo ago

Finn's pods were also in EU-RO-1.

JM•16mo ago

I will sort this out. Thanks a lot for uncovering this

Finn•16mo ago

We're unable to access any ports from the 5 GPUs we've now spun up what can we do as our real-time service is currently down

ashleyk•16mo ago

Did you try 3090?

JM•16mo ago

@Finn In your case, make sure to filter for you required CUDA version too! You have several not meeting your requirements

ashleyk•16mo ago

Are you using network storage?

Finn•16mo ago

Log looks to be in a loop

ashleyk•16mo ago

Is this in CZ region?

Finn•16mo ago

yes Idk what you mean We have two and neither of them are working

ashleyk•16mo ago

Only 4090 is working in CZ, the others are broken, I mentioned this to flash-singh but didn't get a response. @JM can you look into why A5000 and 3090 are broken in CZ as well? Do you need secure cloud specifically? If not, I suggest using an A5000 in community cloud in SK region. I always use those and never have issues.

JM•16mo ago

Sure, I can sort this out as well. What do you mean by ''broken''?

ashleyk•16mo ago

See screenshot above from @Finn , gets into a loop and container doesn't start. A few people had this issue today including me.

JM•16mo ago

POD id please?

ashleyk•16mo ago

I could start 4090 in CZ but not A5000 or 3090. Host machine out of disk space or something probably

Finn•16mo ago

gmue2eh0wj8ybu

JM•16mo ago

Have you tried US-OR and EU-IS for 4090s as well?

ashleyk•16mo ago

No, I was trying to reproduce the issue other people reported and ran into the same issue they did.

JM•16mo ago

Ok thanks. Most likely a driver issue would be my guess, but those are being tested on as well speak.

Finn•16mo ago

Looks like all the GPUs are broken

JM•16mo ago

All? If that's the case there might be something bigger

Finn•16mo ago

I can't get a single one to work, even after filtering for 12.2 this is a mess

ashleyk•16mo ago

Pod id: y7yvgvzcaoeld1 (A5000) Pod id: 9bcyhnm2hqpbme (3090) Did you try A5000 in SK region in Community Cloud?

JM•16mo ago

And is it working with other images, or all images do not work

Finn•16mo ago

I can try

JM•16mo ago

@Justin Could you give me a hand please? let me know if you are available

Finn•16mo ago

I haven't tried with other images Isn't Community Cloud less reliable?

JM•16mo ago

@TomS That Image will only work on Cuda 12.2+ AND on a specific kernel of Ubuntu 22.04 It's not super compatible @ashleyk in your case, which template were you using?

Finn•16mo ago

Trying now...

ashleyk•16mo ago

Supposedly but secure cloud is less reliable these days, outages in CZ, SE etc. RunPod PyTorch 2.1 I am currently testing 4090 in US-OR-1 region as well.

JM•16mo ago

That's not normal

ashleyk•16mo ago

Whats not? @Finn 4090 in US-OR-1 is fine. All ports are up, even on the 1.9.3 image.

JM•16mo ago

I believe @Finn issue is different than yours He was not using same image earlier, except if he uses a different one now

ashleyk•16mo ago

I am helping @Finn , I found a solution for him

Finn•16mo ago

what's the solution?

ashleyk•16mo ago

use 4090 in US-OR-1 region

ashleyk•16mo ago

https://u7xcubfusw1feg-3000.proxy.runpod.net/ https://u7xcubfusw1feg-5000.proxy.runpod.net/ https://u7xcubfusw1feg-6000.proxy.runpod.net/ https://u7xcubfusw1feg-6005.proxy.runpod.net/ https://u7xcubfusw1feg-8888.proxy.runpod.net/

Text generation web UI

ashleyk•16mo ago

@Finn ^^ all ports working

Finn•16mo ago

Look good @ashleyk ?

Finn•16mo ago

Trying with OR

ashleyk•16mo ago

@TomS maybe you can try 4090 in US-OR-1 as well and see if it solves the issue for you too.

JM•16mo ago

- Update: cz cannot pull any image. Will sort this out.

ashleyk•16mo ago

Thanks, its a different issue to the main thread here, but we ran into it when trying to use a different GPU type while trying to solve the main issue here 😆

Finn•16mo ago

This solved it! RO is trash That wasted us a few hours can you guys please add some quality control? This is not the first time I've had issues with RO It's really detrimental to our end service

JM•16mo ago

Yep! Here are a couple things: - There might be a mismatch driver in RO (waiting on confirmation). - Second thing, I previously saw that there was attempt to use later CUDA with older CUDA. Remember to use the filter if you have requirements! - We will update everything on the platform to be 12.0+ in the next 2 months. - Last thing is, if you use Nvidia images, there might be a lots of requirements to make it compatible, including Kernel version. Those are not plug and play everywhere.

ashleyk•16mo ago

Would be better to update to 12.1+ rather than 12.0+ because oobabooga now requires 12.1 minimum

JM•16mo ago

Not all pods you provided were broken. RO has been incredible so far; both in terms of deployment, but also speed of service. Uptime average is above 99.9% 🙂 Please note of the above, as this is important to make sure your deployments are as smooth as possible. As for CZ, DM me and I can provide some credits; this networking redesign has been quite challenging. We will sort this out asap. We do those in batches to maintain availability, but we will be working toward 12.0+, then 12.1+, then even 12.2+.

ashleyk•16mo ago

By the way, CUDA 12.3 still hasn't been added to the filter.

Finn•16mo ago

There were 4 RO 4090s I tested they had 12.2 none of these worked not to mention the ones running 12.1

ashleyk•16mo ago

Something wrong here then

9gi3jqiqlts2ou: 12.0
jvkvnd5uu2crj2: 12.2
oonjzmqb2rw7qj: 12.2
qswdrg5ltpr0v1: 12.0

9gi3jqiqlts2ou: 12.0
jvkvnd5uu2crj2: 12.2
oonjzmqb2rw7qj: 12.2
qswdrg5ltpr0v1: 12.0

It only lists 12.0 and 12.2 and not 12.1 I got a 4090 in RO with 535x driver and CUDA 12.2 and its fine:

root@7d563a2fc889:~# nvidia-smi
Mon Jan 15 23:27:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:02:00.0 Off |                  Off |
|  0%   30C    P8              18W / 450W |      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@7d563a2fc889:~# nvidia-smi
Mon Jan 15 23:27:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:02:00.0 Off |                  Off |
|  0%   30C    P8              18W / 450W |      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

JM•16mo ago

Why is it wrong? I am pulling those info from db Also update: we have uncovered the culprit. The docker caching at the new location was the problem. We are fixing it Yep, same; the 12.2 ones on RO appears to be fine from what I have tested? I believe it might be an isolated issue with one or 2 servers Were those the pod IDs you provided earlier? Because only 2 had 12.2 @ashleyk @Finn Should be solved in CZ 🙂 Goal for q1-q2 of this year is to have pristine, state-of-the-art standards. Keep us updated with anything you find, and we can knock it out

NERDDISCO•16mo ago

so if I get this thread, we can only use CUDA <= 12.2 right now or?

ashleyk•16mo ago

No thats not correct, depends on the template you are using, oobabooga requires 12.1 or higher. But the main issue in this thread is that there are 4090's in EU-RO-1 with broken drivers.

NERDDISCO•16mo ago

oh ok sorry, then I will open a new one

Pierre Nicolas•15mo ago

And if we uncheck EU-RO the 4090's in unavaible on serverless , @JM when you have a solution can you post it in général information if we need to adapt docker image or add a serverless params for check the driver ?

JM•15mo ago

@Pierre Nicolas hey! Actually, cuda filter is out now 🙂

JM•15mo ago

Did you guys notice that yet?

JM•15mo ago

Depending on what docker you use, it might be good pratice to select 12.0+, or even 12.1+

J.•15mo ago

OMG YASSS THANK U!!!

Pierre Nicolas•15mo ago

Ok thanks you we try it tomorrow

ashleyk•15mo ago

I don't see the ability to filter CUDA versions under advanced settings in Serverless.

ashleyk•15mo ago

Mine still shows this

JM•15mo ago

Ah damn, didn't realize that it was a private beta release. Expect this feature very very soon; that means it's being tested 😊 😊 😊

Pierre Nicolas•15mo ago

ok coming soon 😀

NERDDISCO•15mo ago

Thanks for doing this! Looking forward to try it out 🙏

Gaming

Programming

Error 804: forward compatibility was attempted on non supported HW

Did you find this page helpful?