Not 1:1 port mappings for multinode training
Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port.
Is this a problem other people have had and is there a solution?
Thanks in advance!
5 Replies
@Bob
Escalated To Zendesk
The thread has been escalated to Zendesk!
I'm not quite familiar with this, but from what I'm getting the external port of the "child worker" pods should be the same as internal of master? How does that work
Btw I think it's best not to do cluster training like this unless there is some private networking connection
We are currently developing a cluster feature that will help your case in the future.🙏🏻
Sounds cool - whats the ETA and is there any way to get involved early 🙂
So far I don’t know the ETA, will announce it when it is ready~