[Urgent] failed : Software caused connection abort

Can someone help with this error please? it's causing us a huge problem with our next release. Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000): https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Can anyone give some insight into what may be happening here please?
Expose ports | RunPod Documentation
Learn to expose your ports.
27 Replies
Madiator2011
Madiator20119mo ago
are you sure you connect to correct port. Note external port is always random
Wayne
Wayne9mo ago
here is the script for running:
Wayne
Wayne9mo ago
and port 31421 is set as follows:
Wayne
Wayne9mo ago
No description
Wayne
Wayne9mo ago
that is compute desinated as the master we run the above script on the non-master node on the master node, we run the similar script :
Wayne
Wayne9mo ago
Madiator2011
Madiator20119mo ago
I do not have much experience with it can you confirm if your master node is not locked to localhost and lisent on 0.0.0.0 @Wayne from what I see in nmap you do not expose that port aka nothing is listning
Wayne
Wayne9mo ago
this is wjhat i see with netstat -plnt on the master node:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
@Papa Madiator
Madiator2011
Madiator20119mo ago
it's like your master node app is not running
Wayne
Wayne9mo ago
oh sorry, i just restarted 1 sec this is master
Wayne
Wayne9mo ago
No description
Wayne
Wayne9mo ago
master:
Wayne
Wayne9mo ago
No description
Wayne
Wayne9mo ago
root@5a24f0a5a0ec:~/ptl# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:55297 0.0.0.0:* LISTEN 270957/python
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:38407 0.0.0.0:* LISTEN 270957/python
tcp 0 0 192.168.240.2:60103 0.0.0.0:* LISTEN 270957/python
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
tcp6 0 0 :::31421 :::* LISTEN 270957/python
root@5a24f0a5a0ec:~/ptl# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:55297 0.0.0.0:* LISTEN 270957/python
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:38407 0.0.0.0:* LISTEN 270957/python
tcp 0 0 192.168.240.2:60103 0.0.0.0:* LISTEN 270957/python
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
tcp6 0 0 :::31421 :::* LISTEN 270957/python
so the master is listening on 31421 and now the client node has failed:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
Madiator2011
Madiator20119mo ago
though for some reason only on ipv6 You using symmetrical mapping or normal one?
Wayne
Wayne9mo ago
i entered 70000+ for both nodes' tcp ports, so they should be symmetrical that is my understanding anyway
Madiator2011
Madiator20119mo ago
ye but you will still get random port number
Wayne
Wayne9mo ago
yes
Madiator2011
Madiator20119mo ago
so you need to make sure your master node is listning on generated external port I do not know your master node script so do not know how it works
Wayne
Wayne9mo ago
sorry. pls explain "generate external port"? do you mean just not enter 70000 for the port? enter something specific that is not random?
Madiator2011
Madiator20119mo ago
70000 is usually not valid port number if you set it up runpod will generate random port and only make sure its symetric
Wayne
Wayne9mo ago
so what should i specify for the tcp port numbers then?
Madiator2011
Madiator20119mo ago
if you would then print RUNPOD_TCP_PORT_70000 env variable you should see correct port 1. Setup TCP ports if you want them do by symetric 2. Print values of env variables example for 70000 it will be RUNPOD_TCP_PORT_70000 3. Use that port and run master node make sure your master node is listning on 0.0.0.0 anyway late here so I will go sleep though if you cant figure it out can help tomorrow
Wayne
Wayne9mo ago
root@5a24f0a5a0ec:~/ptl# echo $RUNPOD_TCP_PORT_70003
31421
root@5a24f0a5a0ec:~/ptl# echo $RUNPOD_TCP_PORT_70003
31421
so the port number 31421 that i used in the script above is the value for the env var $RUNPOD_TCP_PORT_70003 so the master node should be listening on that port, yes?
Madiator2011
Madiator20119mo ago
Yes. Btw from what I have read in docs the other nodes also need exposed TCP port
Wayne
Wayne9mo ago
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
Madiator2011
Madiator20119mo ago
@Wayne did not got response from you but if you want to debug it together we can get in touch if you still want to try get it to work.
Want results from more Discord servers?
Add your server