Will
Networking Multiple Pods Together
I'm looking to train a distributed model on runpod. When configuring the torch.distributed or jax.distributed you provide a
coordinator_address
of the form ip:port. Right now I'm unable to confirm that two pods can communicate with one another. I start one pod expose a 70000
level port, ssh into it, run ip route
to get the local IP, then start a simple python server python -m http.server 70000
. Then SSH into the other pod and run curl <pod_1_local_ip>:<pod_1_70000_port>
.
This consitently fails. My intuition is that the docker containers don't belong to the same network, to my knowledge we users don't have the privilege to setup such a network on the datacenters machine, only modify containers on a one off basis.
Any guidance on enabling communication between pods would be greatly appricieated!12 replies