Pod execution stopping without errors

I've been having issues since I started using a Pod yesterday, the execution of the finetuning script inside the pod stops abruptly and randomly, without any errors or anything to show in the logs. Every time this happens, I am wasting money and I can't afford to look at it 24/7 to make sure it's running. It happens every few hours. What could be happening?
10 Replies
Juampab12
Juampab12OP2mo ago
It just happened again. I keep the web terminal open and after a few hours I see "Connection closed" and the execution has also stopped
dxqbYD
dxqbYD2mo ago
it could be https://discord.com/channels/912829806415085598/1321544831125950524 if you are close to your RAM limit
Juampab12
Juampab12OP2mo ago
Thanks for the reply, but RAM is sitting at 20%, and it's definitely not VRAM because there's no OOM error, it just closes the web terminal and the process with it
dxqbYD
dxqbYD2mo ago
ah, you mean only the web terminal closes, the pod continues? I saw this happening for no reason since I have first used Runpod regularily even if RunPod improves this, I wouldn't have a long-running task rely on your terminal connection being stable the entire time. Look into linux nohup command.
Juampab12
Juampab12OP2mo ago
no, the terminal closes and the process inside the pod stops. I checked through wandb and it's stopped Next time I'll try with nohup, thanks
dxqbYD
dxqbYD2mo ago
linux processes always stop if a terminal is closed - unless nohup
Juampab12
Juampab12OP2mo ago
has to be that then, thank you
dxqbYD
dxqbYD2mo ago
@Runpod it's still annonying though. Web terminals close for no reason, even while actively working with them
Juampab12
Juampab12OP2mo ago
I couldn't fix it using nohup because the output wouldn't show up anywhere, not even on the nohup.out file, it would only appear when I closed the process (?) I could do it in the end using the screen command I closed the terminal and it's still running
dxqbYD
dxqbYD2mo ago
export PYTHONUNBUFFERED=1 && nohup yourcmd& && tail -f nohup.out assuming your app is python otherwise it shows up in batches screen works too

Did you find this page helpful?