Services Stopped
Hi team,
Could somebody help me with the issue?
I have my pod running - runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu 4 RTX 4090
To start my AI training program I write commands via command line and the process starts. But then after 1-4 hours, the process stops somehow so that I need to retype all the commands to start the process again.
What may stop the process? Why I need to restart everything 3-4 times per day?
8 Replies
Is it stopping or is it the ssh connection that is resetting?
No, stopping service
The connection is not resetting
Actually, I have the following issue.
Once connected via SSH to the server, I run commands to start AI services running. And they start running and run well. But when I close my laptop or close my terminal, the SSH connection drops which seems to be ok, but AI services stop.
You can use screen or tmux for this.
What is it? How can I resolve the issue?
SSH connections can't stay open if you close your laptop and the training can't continue if you close the terminal. screen/tmux start a background session that you can resume later if you need to close your laptop or terminal. I would highly recommend using them for training in any case.
Screen is typically easier to use than tmux.
Screen is also a terminal?
No, its basically a session manager.
You use it within the terminal.
Ok, I'll read about this. Thanks for the advise. Try to do)