LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log
I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly.
This is what's in my nohup.out:
{'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42}
{'loss': 0.0988, 'grad_norm': 0.8195383548736572, 'learning_rate': 5.590474292278636e-07, 'mean_token_accuracy': 0.9391982555389404, 'epoch': 3.42}
86%|████████▌ | 2502/2924 [1:04:27<09:00, 1.28s/it]W0321 22:43:59.549000 138453348574336 torch/distributed/elastic/agent/server/api.py:688] Received 1 death signal, shutting down workers
W0321 22:43:59.552000 138453348574336 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 11096 closing signal SIGHUP
...
I don't think the system was running out of memory, because I have plenty. But having no access to syslog or dmesg on my pod I can't really tell.
Has anyone experienced the same? Thanks.
4 Replies
I am documenting the issue and the workaround in case somebody finds it useful.
The crash (more a kill after timing out) tends to happen when the GPUs synchronize to do something, like saving a checkpoint. However, it
may also happen during the training loop sporadically. You may not encounter this problem with a small workload,
but I ran into it all the time when doing full finetuning of a 14B model on 4 GPUs (H100/H200, NVL/SXM).
The key error message from Pytorch:
100%|█████████▉| 2187/2193 [56:14<00:07, 1.29s/it][rank1]:[E322 01:11:03.777831019 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=435362, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800016 milliseconds before timing out.
You would also see the GPUs and the CPUs running at a high load when it crashes.
Googling shows many suggesting increasing the timeout. I don't think that's the right fix, because saving a checkpoint is not supposed to take that long.
I experimented with some Nvidia and FSDP settings related to synchronization. Here's the combination that got me out of the crash loop:
FDSP config:
* limit_all_gathers = true
* sync_module_states = true
Nvidia environment:
* export TORCH_NCCL_HIGH_PRIORITY=1
* export NCCL_P2P_LEVEL=NVL
@cw
Escalated To Zendesk
The thread has been escalated to Zendesk!
please open asupport ticket
I filed a support ticket already