Bob
Very inconsistent performance
I recently started using Runpod - and am a fan of the setup simplicity and pricing. I have recently noticed a huge amount of inconsistency in performance with identical training runs taking up to 3x longer to finish. I am on the secure cloud. Do you know why this may be?
9 replies
Not 1:1 port mappings for multinode training
Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port.
Is this a problem other people have had and is there a solution?
Thanks in advance!
7 replies