Global Networking
I am trying to use Global Networking. i have 1 master and 2 worker GPUs, all on different pods, but in the same data centre. it seems that the ports are not open between the pods and only port 22 is. I tried to specify a specific TCP port to expose when starting up the Pods too, but it does not work. I need to allow communications between the Pods for torch.dist
4 Replies
the code snippet here does not work for me: https://blog.runpod.io/runpod-launches-global-networking-to-enable-cross-data-center-communication/
RunPod Blog
Announcing Global Networking For Cross-Data Center Communication
RunPod is pleased to announce its launch of our Global Networking feature, which allows for cross-data center communication between pods. When a pod with the feature is deployed, your pods can communicate with each other over a virtual internal network facilitated by RunPod. This means that you can have pods
what could be a solution? do i need to set up SSH between the pods?
You can't use your
.runpod.internal
subdomains at all?
Just trying to understand the issue, sorryno worries. it seems the Pods cannot communicate between each other on port 29400
i run nc -vv {}.runpod.internal 29400 on Pod B, with the global networking hostname for Pod A and it says " port 29400 (tcp) failed: Connection refused"
ping {}.runpod.internal 29400 on Pod B for Pod A is fine
ah ok i figured it out. it was an issue in my script
i thought I set MASTER_ADDR and MASTER_PORT before launching it, and was using os.environ to access them. but i wasn't so torch.dist init was not to the right hostname and port. sorry!