R
RunPod12mo ago
Ercan

[Urgent] One GPU suddenly went away

Hi, we have prod issue right now one of the gpu from our pod suddently disappared
41 Replies
Superintendent
Superintendent12mo ago
fall off the bus?
Ercan
ErcanOP12mo ago
@Justin Can someone help, and check our pod ?
Superintendent
Superintendent12mo ago
lspci | grep VGA should spit out something about the gpu also can u run nvidia-smi and show it
Ercan
ErcanOP12mo ago
yes, here first gpu got missing
No description
Superintendent
Superintendent12mo ago
so what does lspci | grep VGA spit out?
Ercan
ErcanOP12mo ago
cant run it, command not found not sure what to install what to install ?
Superintendent
Superintendent12mo ago
wdym lspci isnt there?
Ercan
ErcanOP12mo ago
No
Superintendent
Superintendent12mo ago
run lspci without grep
Ercan
ErcanOP12mo ago
No description
Superintendent
Superintendent12mo ago
lspci your missing an i lspci
Ercan
ErcanOP12mo ago
No description
Superintendent
Superintendent12mo ago
whar.
Ercan
ErcanOP12mo ago
is not this for amd ?
Superintendent
Superintendent12mo ago
wdym your on nvidia gpus so yea it should work
Ercan
ErcanOP12mo ago
yea that is what I am double checking if this should work with nvdia
Superintendent
Superintendent12mo ago
sudo apt-get update sudo apt-get install pciutils try that i think its something to do with pciutils
Ercan
ErcanOP12mo ago
No description
Ercan
ErcanOP12mo ago
worked now
Superintendent
Superintendent12mo ago
what does dmesg spit out
Ercan
ErcanOP12mo ago
dmesg: read kernel buffer failed: Operation not permitted
Superintendent
Superintendent12mo ago
sudo !! (Wtf?)
Ercan
ErcanOP12mo ago
cant run sudo on Pods
Superintendent
Superintendent12mo ago
oh right docker container uhh.
Ercan
ErcanOP12mo ago
yea
Superintendent
Superintendent12mo ago
i have no friggin clue, can u try to restart it?
Ercan
ErcanOP12mo ago
will try but moved all production process to the running gpu so if I restart need to run bunch of things again I am just waiting maybe it comes back
Superintendent
Superintendent12mo ago
oof is the 4090 ok with the load?
Ercan
ErcanOP12mo ago
we have a internal queue, set it to 1 right now also just realized this is community cloud
Superintendent
Superintendent12mo ago
yes
Ercan
ErcanOP12mo ago
thought this was on secure cloud
Superintendent
Superintendent12mo ago
i personally havent had any issues on community tbh is it sd that ur running?
Ercan
ErcanOP12mo ago
bunch of things including sd, llm, and more models
Superintendent
Superintendent12mo ago
oh nice u can cram all that into 24gb of vram? or really 48
Ercan
ErcanOP12mo ago
Yes, it is really enough to handle e.g 100 active users at a time
Superintendent
Superintendent12mo ago
did the restart fix anything?
Ercan
ErcanOP12mo ago
I am working on it. Before that need to set up few scripts to get production running after restarts
Superintendent
Superintendent12mo ago
understood
Ercan
ErcanOP12mo ago
And also so weird, runpod keeps charging for 2 gpu even though one is not even running
Superintendent
Superintendent12mo ago
yes
Ercan
ErcanOP12mo ago
@Superintendent So weird, gpu came back up while I was preapering things
Want results from more Discord servers?
Add your server