_manuelcerezo
_manuelcerezo
RRunPod
Created by _manuelcerezo on 6/20/2024 in #⛅|pods
Snapshot from Pod
We need to create a snapshot from a pod instance, to execute it in another cloud location inside Runpod supported cloud.
3 replies
RRunPod
Created by _manuelcerezo on 1/14/2024 in #⛅|pods
Multinode training Runpod ports
I'm trying training a distributed models using multinode, 2xPods x8GPU 4090 for each. We cant train using torchrun, because i need the same TCP port, for each machine, so, runpod assigned me a random external port , command example: NODE A: torchrun --nnodes=2 --node_rank=0 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']" NODE B: torchrun --nnodes=2 --node_rank=1 --nproc_per_node=8 --master_addr="207.189.112.61" --master_port=52616 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml','configs/transforms_infer.yaml']"
7 replies