Pod execution stopping without errors
I've been having issues since I started using a Pod yesterday, the execution of the finetuning script inside the pod stops abruptly and randomly, without any errors or anything to show in the logs. Every time this happens, I am wasting money and I can't afford to look at it 24/7 to make sure it's running. It happens every few hours. What could be happening?
10 Replies
It just happened again. I keep the web terminal open and after a few hours I see "Connection closed" and the execution has also stopped
it could be https://discord.com/channels/912829806415085598/1321544831125950524 if you are close to your RAM limit
Thanks for the reply, but RAM is sitting at 20%, and it's definitely not VRAM because there's no OOM error, it just closes the web terminal and the process with it
ah, you mean only the web terminal closes, the pod continues? I saw this happening for no reason since I have first used Runpod
regularily
even if RunPod improves this, I wouldn't have a long-running task rely on your terminal connection being stable the entire time. Look into linux
nohup
command.no, the terminal closes and the process inside the pod stops. I checked through wandb and it's stopped
Next time I'll try with nohup, thanks
linux processes always stop if a terminal is closed - unless
nohup
has to be that then, thank you
@Runpod it's still annonying though. Web terminals close for no reason, even while actively working with them
I couldn't fix it using nohup because the output wouldn't show up anywhere, not even on the nohup.out file, it would only appear when I closed the process (?)
I could do it in the end using the screen command
I closed the terminal and it's still running
export PYTHONUNBUFFERED=1 && nohup yourcmd& && tail -f nohup.out
assuming your app is python
otherwise it shows up in batches
screen works too