cw Posts - Answer Overflow

•Created by cw on 3/21/2025 in #⛅｜pods

LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log

I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly. This is what's in my nohup.out: {'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42} {'loss': 0.0988, 'grad_norm': 0.8195383548736572, 'learning_rate': 5.590474292278636e-07, 'mean_token_accuracy': 0.9391982555389404, 'epoch': 3.42} 86%|████████▌ | 2502/2924 [1:04:27<09:00, 1.28s/it]W0321 22:43:59.549000 138453348574336 torch/distributed/elastic/agent/server/api.py:688] Received 1 death signal, shutting down workers W0321 22:43:59.552000 138453348574336 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 11096 closing signal SIGHUP ... I don't think the system was running out of memory, because I have plenty. But having no access to syslog or dmesg on my pod I can't really tell. Has anyone experienced the same? Thanks.

5 replies

Gaming

Programming