R
RunPod3w ago
Brawl

RunPod disconnecting/resetting during model training

Hi everyone, I've encountered an issue several times over the past week and have yet to successfully complete a model because of it. I've triple-checked to ensure I'm using an On-Demand instance. However, after a few hours of running my model, the web server or Jupyter notebook loses its connection. When I reconnect, the session appears to have reset: • If I use the web server, when I reconnect, the terminal is blank. • If I use Jupyter Notebook, the kernel is idle. Despite this, I can see from the pod information that something is still running, and the GPU usage indicates activity. However, I'm unable to access or resume whatever process is ongoing. As far as I know, I should be able to disconnect my internet, shut down my machine, and later log back in to find the model either completed or still running. This behavior suggests the interruption is happening on the server side rather than my end (I have funds in my account). Does anyone know why this might be happening or how to resolve it? Thanks in advance!
7 Replies
nerdylive
nerdylive3w ago
yes, jupyter closes the process when you close it or disconnect.. use the terminal, use the application tmux. it keeps terminal open
Brawl
BrawlOP3w ago
Ah, I have connected through SSH, currently running. Is that ok?
nerdylive
nerdylive3w ago
or screen, thats an app alternative to tmux no... use terminal multiplexer like those 2 apps i've mentioned
Brawl
BrawlOP3w ago
Ah, ok, have not heard of tmux. I will check it out now 🙂
nerdylive
nerdylive3w ago
ssh also closes when you close the connection thats why don't
Brawl
BrawlOP3w ago
Ah, thank you so much nerdylive 🙂
nerdylive
nerdylive3w ago
your welcome bro!
Want results from more Discord servers?
Add your server