GPU Pods in EU-SE-1 unexpectedly die after approximately 30 hours

We are experiencing many instances of GPU pods (mainly A6000) that stop working after 30 hours losing also the VRAM content. We have repeatedly reported these issues but still there is not a solution since it keeps happening. We have left a pod on (ID : cxquttq3m3kqvl) for you to debug, can you please help? Thanks
9 Replies
Madiator2011 (Work)
you need give more details machine does not look to have any issues. Is it some of your app crashing, do you have any error message.
runpoduser
runpoduserOP6mo ago
Hi, thank you for the help. Looking at the container logs, everything looks fine.
runpoduser
runpoduserOP6mo ago
From Wandb, i notice a reduction in Watt after 30 hours approximately
No description
Madiator2011
Madiator20116mo ago
though I think issue is with the app you are running. does it crash mayby?
runpoduser
runpoduserOP6mo ago
My app is just a jupyter notebook running a training script for an LLM, unfortunately the cell output is not updated after I exit and reconnect to the pod jupyter, so if there is an error I don't think it will be visible. Is there a way to look at the updated cell output even if i reconnect to the jupyter notebook after a while? (the training scripts takes days, I can't keep the pc on all the time)
Madiator2011
Madiator20116mo ago
that is kinda bad things to do. Usually you want to run long term training in things like temux/screen so you can easy reconnect and still have ability to check logs
Madiator2011
Madiator20116mo ago
NetworkChuck
YouTube
you need to learn tmux RIGHT NOW!!
Spin up your next project with Linode: https://ntck.co/linode –You get a $100 Credit good for 60 days as a new user! I just started using Tmux……it’s amazing! If you use a terminal or CLI in any capacity Tmux will 10x your productivity in 10 seconds. From creating multiple panes and windows with ease to leaving your terminal sessions active as...
runpoduser
runpoduserOP6mo ago
Yeah I know, it was just for prototyping. Thanks man, I will have a look and try something different.
Madiator2011
Madiator20116mo ago
@runpoduser I mean try run train script in temux instead of jupiter and will give you idea what could cause issue. Something must broken and it could kill process
Want results from more Discord servers?
Add your server