bitcurrent
bitcurrent
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
Can someone help with this error please? it's causing us a huge problem with our next release. Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000): https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Can anyone give some insight into what may be happening here please?
45 replies