cw
cw
RRunPod
Created by cw on 3/21/2025 in #⛅|pods-clusters
LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log
I filed a support ticket already
5 replies
RRunPod
Created by cw on 3/21/2025 in #⛅|pods-clusters
LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log
I am documenting the issue and the workaround in case somebody finds it useful. The crash (more a kill after timing out) tends to happen when the GPUs synchronize to do something, like saving a checkpoint. However, it may also happen during the training loop sporadically. You may not encounter this problem with a small workload, but I ran into it all the time when doing full finetuning of a 14B model on 4 GPUs (H100/H200, NVL/SXM). The key error message from Pytorch: 100%|█████████▉| 2187/2193 [56:14<00:07, 1.29s/it][rank1]:[E322 01:11:03.777831019 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=435362, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800016 milliseconds before timing out. You would also see the GPUs and the CPUs running at a high load when it crashes. Googling shows many suggesting increasing the timeout. I don't think that's the right fix, because saving a checkpoint is not supposed to take that long. I experimented with some Nvidia and FSDP settings related to synchronization. Here's the combination that got me out of the crash loop: FDSP config: * limit_all_gathers = true * sync_module_states = true Nvidia environment: * export TORCH_NCCL_HIGH_PRIORITY=1 * export NCCL_P2P_LEVEL=NVL
5 replies