Will
Will
RRunPod
Created by Will on 6/5/2024 in #⛅|pods
Networking Multiple Pods Together
I'm looking to train a distributed model on runpod. When configuring the torch.distributed or jax.distributed you provide a coordinator_address of the form ip:port. Right now I'm unable to confirm that two pods can communicate with one another. I start one pod expose a 70000 level port, ssh into it, run ip route to get the local IP, then start a simple python server python -m http.server 70000. Then SSH into the other pod and run curl <pod_1_local_ip>:<pod_1_70000_port>. This consitently fails. My intuition is that the docker containers don't belong to the same network, to my knowledge we users don't have the privilege to setup such a network on the datacenters machine, only modify containers on a one off basis. Any guidance on enabling communication between pods would be greatly appricieated!
12 replies