GPU errored, machine dead

Search 0 matches 2024-09-04T11:12:09Z stop container 2024-09-04T11:12:44Z remove container 2024-09-04T11:12:51Z create container runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 2024-09-04T11:12:52Z 2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 Pulling from runpod/pytorch 2024-09-04T11:12:52Z Digest: sha256:a931abe272a5156aab1b4fd52a6d3c599a5bf283b6e6d11d1765336e22b1037c 2024-09-04T11:12:52Z Status: Image is up to date for runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 2024-09-04T11:12:52Z error creating container: nvidia-smi: exit status 255\n ---------stdout------ Unable to determine the device handle for GPU0000:04:00.0: Unknown Error ---------stderr------
7 Replies
pseudoterminalx
ID: 3m0ljx07puspok
nerdylive
nerdylive3w ago
Did you just start that container? Try to redeploy it, might be a bad gpu pod
Poddy
Poddy3w ago
@pseudoterminalx
Escalated To Zendesk
The thread has been escalated to Zendesk!
nevermind
nevermind3w ago
Why these pods are exposed to the users 🤯 It's such an easy task to detect broken gpu for RunPod, but they just ignore this issue for like 3 month
nerdylive
nerdylive3w ago
maybe it isnt, im not sure Open a ticket if you want to hear from them
nevermind
nevermind3w ago
Our practice is to run a short cuda test (like getting statistics or something). I think it will enhance DX if they do this on their side. May be I should bring it into the feedback
nerdylive
nerdylive3w ago
Yeah if they can do a short quick poll every x time, it'll be better, im not sure if thats the case for this problem but feel free to write it on #🧐|feedback
Want results from more Discord servers?
Add your server