SOS pod gpu errors

pod_id=usg9djjhmpjfpd
# nvidia-smi
Sat Sep 21 14:48:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
|ERR! 36C P0 48W / 450W | 24205MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
# nvidia-smi
Sat Sep 21 14:48:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
|ERR! 36C P0 48W / 450W | 24205MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
As you see, gpu is completely dead. We facing multiple errors like this. We faced it multiple times. It started since 19 september.
5 Replies
yhlong00000
yhlong000002mo ago
I don’t see any errors on our end, but your memory usage is nearly full (24,205 MiB / 24,564 MiB). Could this be the issue? You might want to try a 48GB GPU
nevermind
nevermind2mo ago
Whats an "ERR!" than? We actually use multiple instances of this configuration and sometimes pods just "die" during its lifetime (it affects our flow control).
yhlong00000
yhlong000002mo ago
I checked the error logs on the server and didn’t find anything significant. The ERR! in nvidia-smi usually indicates it’s unable to monitor or report certain metrics, which could be due to a hardware issue or a temporary glitch. However, if you’re encountering this frequently, it’s unlikely that all of our GPUs have the same issue. It might be worth trying a higher-end GPU to see if the problem persists. I suspect the high memory usage could be a contributing factor.
nevermind
nevermind2mo ago
Alright, if we face the same issue again - I'll report to this thread and try my best to lock the pod
yhlong00000
yhlong000002mo ago
surething, feel free to share podId and timestamp here.
Want results from more Discord servers?
Add your server