Wayne
Wayne
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
so the master node should be listening on that port, yes?
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
so the port number 31421 that i used in the script above is the value for the env var $RUNPOD_TCP_PORT_70003
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
root@5a24f0a5a0ec:~/ptl# echo $RUNPOD_TCP_PORT_70003
31421
root@5a24f0a5a0ec:~/ptl# echo $RUNPOD_TCP_PORT_70003
31421
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
so what should i specify for the tcp port numbers then?
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
do you mean just not enter 70000 for the port? enter something specific that is not random?
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
sorry. pls explain "generate external port"?
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
yes
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
that is my understanding anyway
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
i entered 70000+ for both nodes' tcp ports, so they should be symmetrical
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
and now the client node has failed:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<60103> failed : Software caused connection abort
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
so the master is listening on 31421
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
root@5a24f0a5a0ec:~/ptl# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:55297 0.0.0.0:* LISTEN 270957/python
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:38407 0.0.0.0:* LISTEN 270957/python
tcp 0 0 192.168.240.2:60103 0.0.0.0:* LISTEN 270957/python
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
tcp6 0 0 :::31421 :::* LISTEN 270957/python
root@5a24f0a5a0ec:~/ptl# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 57/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 64/python
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:55297 0.0.0.0:* LISTEN 270957/python
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 39/nginx: master pr
tcp 0 0 192.168.240.2:38407 0.0.0.0:* LISTEN 270957/python
tcp 0 0 192.168.240.2:60103 0.0.0.0:* LISTEN 270957/python
tcp 0 0 127.0.0.11:44521 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 57/sshd: /usr/sbin/
tcp6 0 0 :::8888 :::* LISTEN 64/python
tcp6 0 0 :::31421 :::* LISTEN 270957/python
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
No description
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
master:
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
No description
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
this is master
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
1 sec
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
oh sorry, i just restarted
45 replies
RRunPod
Created by bitcurrent on 2/27/2024 in #⛅|pods
[Urgent] failed : Software caused connection abort
@Papa Madiator
45 replies