[Urgent] failed : Software caused connection abort
Can someone help with this error please? it's causing us a huge problem with our next release.
Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000):
https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abortCan anyone give some insight into what may be happening here please?
Expose ports | RunPod Documentation
Learn to expose your ports.
27 Replies
are you sure you connect to correct port. Note external port is always random
and port 31421 is set as follows:
that is compute desinated as the master
we run the above script on the non-master node
on the master node, we run the similar script :
I do not have much experience with it
can you confirm if your master node is not locked to localhost and lisent on 0.0.0.0
@Wayne from what I see in nmap you do not expose that port aka nothing is listning
this is wjhat i see with
netstat -plnt
on the master node:
@Papa Madiatorit's like your master node app is not running
oh sorry, i just restarted
1 sec
this is master
master:
so the master is listening on 31421
and now the client node has failed:
though for some reason only on ipv6
You using symmetrical mapping or normal one?
i entered 70000+ for both nodes' tcp ports, so they should be symmetrical
that is my understanding anyway
ye but you will still get random port number
yes
so you need to make sure your master node is listning on generated external port
I do not know your master node script so do not know how it works
sorry. pls explain "generate external port"?
do you mean just not enter 70000 for the port? enter something specific that is not random?
70000 is usually not valid port number if you set it up runpod will generate random port and only make sure its symetric
so what should i specify for the tcp port numbers then?
if you would then print RUNPOD_TCP_PORT_70000 env variable you should see correct port
1. Setup TCP ports if you want them do by symetric
2. Print values of env variables example for 70000 it will be RUNPOD_TCP_PORT_70000
3. Use that port and run master node make sure your master node is listning on 0.0.0.0
anyway late here so I will go sleep though if you cant figure it out can help tomorrow
so the port number 31421 that i used in the script above is the value for the env var $RUNPOD_TCP_PORT_70003
so the master node should be listening on that port, yes?
Yes. Btw from what I have read in docs the other nodes also need exposed TCP port
@Wayne did not got response from you but if you want to debug it together we can get in touch if you still want to try get it to work.