_manuelcerezo
Multinode training Runpod ports
I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port ,
command example:
NODE A:
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
NODE B:
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
13 replies