RunPod•8mo ago

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!

5 Replies

Poddy•8mo ago

@Bob

Escalated To Zendesk

The thread has been escalated to Zendesk!

Jason•8mo ago

I'm not quite familiar with this, but from what I'm getting the external port of the "child worker" pods should be the same as internal of master? How does that work Btw I think it's best not to do cluster training like this unless there is some private networking connection

yhlong00000•8mo ago

We are currently developing a cluster feature that will help your case in the future.🙏🏻

BobOP•8mo ago

Sounds cool - whats the ETA and is there any way to get involved early 🙂

yhlong00000•8mo ago

So far I don’t know the ETA, will announce it when it is ready~

Gaming

Programming

Not 1:1 port mappings for multinode training

Did you find this page helpful?