Multinode training Runpod ports
I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port ,
command example:
NODE A:
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
NODE B:
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
11 Replies
external port is always randomised and not symetric
we have plans in future to expore multi node training by enabling internal port communication
thanks, it's very important thing ffor my team and our development. Actually this is one of the most important way of training models companies are doing.
Hello, any updates on this? I'm trying to set up multinode training using deepseed and found little to no information about this online. Thanks!
its still in progress, right now estimate is sometime in May, it will be a whole new feature "Training Cluster"
@flash-singh any update on this?
we are currently doing alpha testing, if anyone wants to be part of that let me know, open beta will likely be sometime in august
Happy to join testing
happy to join as well
@flash-singh sent you a DM
send me DM with more details on how you plan to use multi node cluster feature