RunPod•10mo ago

GPU Pods in EU-SE-1 unexpectedly die after approximately 30 hours

We are experiencing many instances of GPU pods (mainly A6000) that stop working after 30 hours losing also the VRAM content. We have repeatedly reported these issues but still there is not a solution since it keeps happening. We have left a pod on (ID : cxquttq3m3kqvl) for you to debug, can you please help? Thanks

9 Replies

Madiator2011 (Work)•10mo ago

you need give more details machine does not look to have any issues. Is it some of your app crashing, do you have any error message.

runpoduserOP•10mo ago

Hi, thank you for the help. Looking at the container logs, everything looks fine.

Containers_Logs_.txt

runpoduserOP•10mo ago

From Wandb, i notice a reduction in Watt after 30 hours approximately

Madiator2011•10mo ago

though I think issue is with the app you are running. does it crash mayby?

runpoduserOP•10mo ago

My app is just a jupyter notebook running a training script for an LLM, unfortunately the cell output is not updated after I exit and reconnect to the pod jupyter, so if there is an error I don't think it will be visible. Is there a way to look at the updated cell output even if i reconnect to the jupyter notebook after a while? (the training scripts takes days, I can't keep the pc on all the time)

Madiator2011•10mo ago

that is kinda bad things to do. Usually you want to run long term training in things like temux/screen so you can easy reconnect and still have ability to check logs

Madiator2011•10mo ago

here is good video: https://youtu.be/nTqu6w2wc68?si=6VE08K5MyKEmG4CT

NetworkChuck

YouTube

you need to learn tmux RIGHT NOW!!

Spin up your next project with Linode: https://ntck.co/linode –You get a $100 Credit good for 60 days as a new user! I just started using Tmux……it’s amazing! If you use a terminal or CLI in any capacity Tmux will 10x your productivity in 10 seconds. From creating multiple panes and windows with ease to leaving your terminal sessions active as...

runpoduserOP•10mo ago

Yeah I know, it was just for prototyping. Thanks man, I will have a look and try something different.

Madiator2011•10mo ago

@runpoduser I mean try run train script in temux instead of jupiter and will give you idea what could cause issue. Something must broken and it could kill process

Gaming

Programming

GPU Pods in EU-SE-1 unexpectedly die after approximately 30 hours

Did you find this page helpful?