Jupiter notebook (In chrome tab) consistently crashing after 20 hours
My Jupiter lab notebook chrome tab has crashed in the middle of 22 hours of training a model, how do i know if it's still training it, if it has stopped, or if it is just running without doing anything? This has happened to me 3 times in a row and this time i would like to know what is happening. The GPU usage is going up and down with is suggesting it is training and simply not showing on the notebook, but i would like to make sure.
13 Replies
any update? 48 hours running and still nothing.
@Justin / @Madiator2011 - Just tagging staff who maybe can give your pod a look. My guess is that in the future can click Logs, and can always do a connect > web terminal, or direct ssh to your pod.
Hard to know why your chrome tab is crashing though.
try run command from the terminal using screen or texum
I would say though since your GPU uitlization is there (especially since it seemed to go up?? prob still working?
Jupiter is not advices to run long jobs
NetworkChuck
YouTube
you need to learn tmux RIGHT NOW!!
Spin up your next project with Linode: https://ntck.co/linode –You get a $100 Credit good for 60 days as a new user!
I just started using Tmux……it’s amazing! If you use a terminal or CLI in any capacity Tmux will 10x your productivity in 10 seconds. From creating multiple panes and windows with ease to leaving your terminal sessions active as...
I also made template for alternative notebook system
https://runpod.io/gsc?template=9ehepsqiw2&ref=vfker49t
It has some cool things but also cons:
Pros:
- Background jobs you can close website and it will still run including output
- If you switch to next gen Zeppelin you get much modern UI
Cons:
- Jupiter Notebooks are not direct compatible
- No upload files via drag and drop (mayby I should move it to pros)
Lol, I think the upload files via drag and drop is a pro. Force people to use runpodctl, or direct ssh to scp a zip over. (I think I remember, flash saying that runpodctl still has a middle server, so seems like the direct ssh is always the best?)
Uploading via drag and drop:
- slow upload speed
- easy to corupt files
do u happen to have a repo for this? just curious.
it's runPod pytorch template with installed https://zeppelin.apache.org/
@justin @Madiator2011 Thank you so much for all the advice and tips, in the future i will definitely be using tmux to train models or for any long running jobs, the model finally finish training and i'm currently downloading it to see if it works ❤️
For training always console I know how many times training filed cause colab or jupiter stoped working.