RunPod disconnecting/resetting during model training
Hi everyone,
I've encountered an issue several times over the past week and have yet to successfully complete a model because of it.
I've triple-checked to ensure I'm using an On-Demand instance. However, after a few hours of running my model, the web server or Jupyter notebook loses its connection. When I reconnect, the session appears to have reset:
• If I use the web server, when I reconnect, the terminal is blank.
• If I use Jupyter Notebook, the kernel is idle.
Despite this, I can see from the pod information that something is still running, and the GPU usage indicates activity. However, I'm unable to access or resume whatever process is ongoing.
As far as I know, I should be able to disconnect my internet, shut down my machine, and later log back in to find the model either completed or still running. This behavior suggests the interruption is happening on the server side rather than my end (I have funds in my account).
Does anyone know why this might be happening or how to resolve it?
Thanks in advance!
7 Replies
yes, jupyter closes the process when you close it or disconnect..
use the terminal, use the application tmux. it keeps terminal open
Ah, I have connected through SSH, currently running. Is that ok?
or screen, thats an app alternative to tmux
no...
use terminal multiplexer like those 2 apps i've mentioned
Ah, ok, have not heard of tmux. I will check it out now 🙂
ssh also closes when you close the connection thats why don't
Ah, thank you so much nerdylive 🙂
your welcome bro!