Networking Multiple Pods Together
I'm looking to train a distributed model on runpod. When configuring the torch.distributed or jax.distributed you provide a
coordinator_address
of the form ip:port. Right now I'm unable to confirm that two pods can communicate with one another. I start one pod expose a 70000
level port, ssh into it, run ip route
to get the local IP, then start a simple python server python -m http.server 70000
. Then SSH into the other pod and run curl <pod_1_local_ip>:<pod_1_70000_port>
.
This consitently fails. My intuition is that the docker containers don't belong to the same network, to my knowledge we users don't have the privilege to setup such a network on the datacenters machine, only modify containers on a one off basis.
Any guidance on enabling communication between pods would be greatly appricieated!8 Replies
Would be most helpful if someone could inform on how to find the host IP address. Given we only have access to the containers there doesn't seem to be any way to access the host IP
You can request pods to have public ip in community cloud
It'll be there in the connect button if you expose some tcp
I'm looking for a local ip, the ip of the host machine that the container is running inside of
Oh networking between pods, I think containers aren't connected together in a private net even though they are in the same secure cloud dc ( not sure )
But it's best to open a support ticket to ask this pods private network thing
Yeah private networking between pods is not supported.
@Will which GPUs are you using?
Any plan to support that soon?
Yes, we are planning to release a global networking feature very soon, which will allow your pods to communicate with each other seamlessly. This feature should be ready within the next couple of weeks