RunPod•16mo ago

Multinode training Runpod ports

I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port , command example: NODE A: torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']" NODE B: torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"

11 Replies

Madiator2011•16mo ago

external port is always randomised and not symetric

flash-singh•16mo ago

we have plans in future to expore multi node training by enabling internal port communication

_manuelcerezoOP•16mo ago

thanks, it's very important thing ffor my team and our development. Actually this is one of the most important way of training models companies are doing.