R
RunPod8mo ago
DreamGen

4xH100 pod is stuck -- can't restart or stop

I am still connected with SSH, but the pod can't be used due to some network issues. RunPod UI also can't reach it (it shows waiting for logs). Over night the pod failed with:
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9442948, OpType=_ALLGATHER_BASE, NumelIn=196608, NumelOut=786432, Timeout(ms)=1800000) ran for 1800631 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9442948, OpType=_ALLGATHER_BASE, NumelIn=196608, NumelOut=786432, Timeout(ms)=1800000) ran for 1800631 milliseconds before timing out.
Wasting a lot of money.
No description
3 Replies
Bones
Bones6mo ago
how did you solve this ?
nerdylive
nerdylive6mo ago
Oh create a support ticket or maybe just stop it first
haris
haris6mo ago
@Bones are you facing a similar issue? This one seemed to have slipped through the cracks when it was first posted but would love to make sure everything is running smoothly for you.
Want results from more Discord servers?
Add your server