R
RunPod4mo ago
Bob

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!
5 Replies
Poddy
Poddy4mo ago
@Bob
Escalated To Zendesk
The thread has been escalated to Zendesk!
nerdylive
nerdylive4mo ago
I'm not quite familiar with this, but from what I'm getting the external port of the "child worker" pods should be the same as internal of master? How does that work Btw I think it's best not to do cluster training like this unless there is some private networking connection
yhlong00000
yhlong000004mo ago
We are currently developing a cluster feature that will help your case in the future.🙏🏻
Bob
BobOP4mo ago
Sounds cool - whats the ETA and is there any way to get involved early 🙂
yhlong00000
yhlong000004mo ago
So far I don’t know the ETA, will announce it when it is ready~
Want results from more Discord servers?
Add your server