SOS pod gpu errors
pod_id=
usg9djjhmpjfpd
As you see, gpu is completely dead. We facing multiple errors like this. We faced it multiple times. It started since 19 september.5 Replies
I don’t see any errors on our end, but your memory usage is nearly full (24,205 MiB / 24,564 MiB). Could this be the issue? You might want to try a 48GB GPU
Whats an "ERR!" than?
We actually use multiple instances of this configuration and sometimes pods just "die" during its lifetime (it affects our flow control).
I checked the error logs on the server and didn’t find anything significant. The ERR! in nvidia-smi usually indicates it’s unable to monitor or report certain metrics, which could be due to a hardware issue or a temporary glitch. However, if you’re encountering this frequently, it’s unlikely that all of our GPUs have the same issue. It might be worth trying a higher-end GPU to see if the problem persists. I suspect the high memory usage could be a contributing factor.
Alright, if we face the same issue again - I'll report to this thread and try my best to lock the pod
surething, feel free to share podId and timestamp here.