GPU Pods in EU-SE-1 unexpectedly die after approximately 30 hours
We are experiencing many instances of GPU pods (mainly A6000) that stop working after 30 hours losing also the VRAM content.
We have repeatedly reported these issues but still there is not a solution since it keeps happening.
We have left a pod on (ID : cxquttq3m3kqvl) for you to debug, can you please help?
Thanks
9 Replies
you need give more details
machine does not look to have any issues. Is it some of your app crashing, do you have any error message.
Hi, thank you for the help.
Looking at the container logs, everything looks fine.
From Wandb, i notice a reduction in Watt after 30 hours approximately
though I think issue is with the app you are running. does it crash mayby?
My app is just a jupyter notebook running a training script for an LLM, unfortunately the cell output is not updated after I exit and reconnect to the pod jupyter, so if there is an error I don't think it will be visible. Is there a way to look at the updated cell output even if i reconnect to the jupyter notebook after a while? (the training scripts takes days, I can't keep the pc on all the time)
that is kinda bad things to do.
Usually you want to run long term training in things like temux/screen so you can easy reconnect and still have ability to check logs
here is good video:
https://youtu.be/nTqu6w2wc68?si=6VE08K5MyKEmG4CT
NetworkChuck
YouTube
you need to learn tmux RIGHT NOW!!
Spin up your next project with Linode: https://ntck.co/linode –You get a $100 Credit good for 60 days as a new user!
I just started using Tmux……it’s amazing! If you use a terminal or CLI in any capacity Tmux will 10x your productivity in 10 seconds. From creating multiple panes and windows with ease to leaving your terminal sessions active as...
Yeah I know, it was just for prototyping. Thanks man, I will have a look and try something different.
@runpoduser I mean try run train script in temux instead of jupiter and will give you idea what could cause issue. Something must broken and it could kill process