Multinode training Runpod ports

I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port , command example: NODE A: torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']" NODE B: torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
11 Replies
Madiator2011
Madiator201111mo ago
external port is always randomised and not symetric
flash-singh
flash-singh11mo ago
we have plans in future to expore multi node training by enabling internal port communication
_manuelcerezo
_manuelcerezoOP11mo ago
thanks, it's very important thing ffor my team and our development. Actually this is one of the most important way of training models companies are doing.
gotcha
gotcha9mo ago
Hello, any updates on this? I'm trying to set up multinode training using deepseed and found little to no information about this online. Thanks!
flash-singh
flash-singh9mo ago
its still in progress, right now estimate is sometime in May, it will be a whole new feature "Training Cluster"
zkreutzjanz
zkreutzjanz5mo ago
@flash-singh any update on this?
flash-singh
flash-singh5mo ago
we are currently doing alpha testing, if anyone wants to be part of that let me know, open beta will likely be sometime in august
zkreutzjanz
zkreutzjanz5mo ago
Happy to join testing
upr1ce
upr1ce5mo ago
happy to join as well
Kaushik - streamtune.io
@flash-singh sent you a DM
flash-singh
flash-singh5mo ago
send me DM with more details on how you plan to use multi node cluster feature
Want results from more Discord servers?
Add your server