R
RunPod3w ago
cw

LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log

I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly. This is what's in my nohup.out: {'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42} {'loss': 0.0988, 'grad_norm': 0.8195383548736572, 'learning_rate': 5.590474292278636e-07, 'mean_token_accuracy': 0.9391982555389404, 'epoch': 3.42} 86%|████████▌ | 2502/2924 [1:04:27<09:00, 1.28s/it]W0321 22:43:59.549000 138453348574336 torch/distributed/elastic/agent/server/api.py:688] Received 1 death signal, shutting down workers W0321 22:43:59.552000 138453348574336 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 11096 closing signal SIGHUP ... I don't think the system was running out of memory, because I have plenty. But having no access to syslog or dmesg on my pod I can't really tell. Has anyone experienced the same? Thanks.
4 Replies
cw
cwOP3w ago
I am documenting the issue and the workaround in case somebody finds it useful. The crash (more a kill after timing out) tends to happen when the GPUs synchronize to do something, like saving a checkpoint. However, it may also happen during the training loop sporadically. You may not encounter this problem with a small workload, but I ran into it all the time when doing full finetuning of a 14B model on 4 GPUs (H100/H200, NVL/SXM). The key error message from Pytorch: 100%|█████████▉| 2187/2193 [56:14<00:07, 1.29s/it][rank1]:[E322 01:11:03.777831019 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=435362, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800016 milliseconds before timing out. You would also see the GPUs and the CPUs running at a high load when it crashes. Googling shows many suggesting increasing the timeout. I don't think that's the right fix, because saving a checkpoint is not supposed to take that long. I experimented with some Nvidia and FSDP settings related to synchronization. Here's the combination that got me out of the crash loop: FDSP config: * limit_all_gathers = true * sync_module_states = true Nvidia environment: * export TORCH_NCCL_HIGH_PRIORITY=1 * export NCCL_P2P_LEVEL=NVL
Poddy
Poddy3w ago
@cw
Escalated To Zendesk
The thread has been escalated to Zendesk!
Jason
Jason3w ago
please open asupport ticket
cw
cwOP3w ago
I filed a support ticket already

Did you find this page helpful?