cw
RRunPod
•Created by cw on 3/21/2025 in #⛅|pods-clusters
LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log
I filed a support ticket already
5 replies
RRunPod
•Created by cw on 3/21/2025 in #⛅|pods-clusters
LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log
I am documenting the issue and the workaround in case somebody finds it useful.
The crash (more a kill after timing out) tends to happen when the GPUs synchronize to do something, like saving a checkpoint. However, it
may also happen during the training loop sporadically. You may not encounter this problem with a small workload,
but I ran into it all the time when doing full finetuning of a 14B model on 4 GPUs (H100/H200, NVL/SXM).
The key error message from Pytorch:
100%|█████████▉| 2187/2193 [56:14<00:07, 1.29s/it][rank1]:[E322 01:11:03.777831019 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=435362, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800016 milliseconds before timing out.
You would also see the GPUs and the CPUs running at a high load when it crashes.
Googling shows many suggesting increasing the timeout. I don't think that's the right fix, because saving a checkpoint is not supposed to take that long.
I experimented with some Nvidia and FSDP settings related to synchronization. Here's the combination that got me out of the crash loop:
FDSP config:
* limit_all_gathers = true
* sync_module_states = true
Nvidia environment:
* export TORCH_NCCL_HIGH_PRIORITY=1
* export NCCL_P2P_LEVEL=NVL
5 replies