[Urgent] One GPU suddenly went away
Hi, we have prod issue right now one of the gpu from our pod suddently disappared
41 Replies
fall off the bus?
@Justin Can someone help, and check our pod ?
lspci | grep VGA
should spit out something about the gpu
also can u run nvidia-smi and show it
yes, here first gpu got missing

so what does
lspci | grep VGA
spit out?cant run it, command not found
not sure what to install
what to install ?
wdym
lspci isnt there?
No
run lspci
without grep

lspci
your missing an i
lspci

whar.
is not this for amd ?
wdym
your on nvidia gpus
so yea it should work
yea that is what I am double checking if this should work with nvdia
sudo apt-get update
sudo apt-get install pciutils
try that
i think its something to do with pciutils

worked now
what does
dmesg
spit outdmesg: read kernel buffer failed: Operation not permitted
sudo !!
(Wtf?)
cant run sudo on Pods
oh right
docker container
uhh.
yea
i have no friggin clue, can u try to restart it?
will try but moved all production process to the running gpu
so if I restart need to run bunch of things again
I am just waiting maybe it comes back
oof is the 4090 ok with the load?
we have a internal queue, set it to 1 right now
also just realized this is community cloud
yes
thought this was on secure cloud
i personally havent had any issues on community tbh
is it sd that ur running?
bunch of things including sd, llm, and more models
oh nice u can cram all that into 24gb of vram?
or really 48
Yes, it is really enough to handle e.g 100 active users at a time
did the restart fix anything?
I am working on it. Before that need to set up few scripts to get production running after restarts
understood
And also so weird, runpod keeps charging for 2 gpu even though one is not even running
yes
@Superintendent So weird, gpu came back up
while I was preapering things