R
RunPod11mo ago
echozhou

How to use runpod for multi-machine distributed training?

We have applied symmetric ports for multiple machines and configured them to be accessible to each other via ssh. But it is not possible to do distributed training. The commands we are using : torchrun --nproc_per_node=1 \ --nnodes=2 \ --node_rank=0 \ ---master_addr="216.249.100.66" \ --master_port=12619 \ test.py torchrun --nproc_per_node=1 \ --nnodes=2 \ --node_rank=1 \ ---master_addr="216.249.100.66" \ --master_port=12619 \ test.py We are using RunPod Pytorch 2.1.
2 Replies
ashleyk
ashleyk11mo ago
I assume the IP and port are the public IP and public port mapping for SSH?
luna
luna10mo ago
Hi @echozhou , @ashleyk is it possible to do distributed training across multiple runpod machines?
Want results from more Discord servers?
Add your server