Multi-node training with multiple pods sharing same region.
I am trying multi-node training with multiple pods.
When I launched multiple pods with same region, they share same public IP, but only port is different.
How should I specify the proper port and IP for multi-node training?
Does secure cloud offers multi-node training?
19 Replies
You shouldn't launch multiple pods for multiple GPU training. Launch a single pod and use the dropdown to select how many GPUs you want to attach to it.
Thank you for your quick reply!
I am trying to test 32 gpus training, so I thought I should run 4 pods (each node might have 8 GPUs, which are the maximum number of available GPU for single pod).
Is the single pod could have 32 GPUs?
If multi-node training with secure cloud is impossible, is there any way to test multi-node training?
I need to test the speed of multi-node training for deciding the long term contract.
If you need something custom with 32 GPUs, you may want to chat with @JM about arranging something for you.
@ashleyk I got it. Thank you 😄
@JM Hi, could you please confirm if there is available option for testing multi-node training?
It's for network bandwidth test, so I need at least two pods sharing same region.
For each pod, 2 GPUs are enough to test multi-node training.
Also, about 3~4 hours are enough to have a test.
Oh thats different, thought you wanted 32 GPUs
You can do this yourself without involving @JM , @JM can assist with custom things, you don't need to involve him with things you can do yourself.
Then, how can I test multi-node training (not multi-gpu training) with runpod? the problem is same as my first question.
You probably need to log a Github issue for whatever application you're using and ask there.
I've already performed multi-node training on my server and there were no issues. My question is about network setting of pods.
I'm wondering if multiple pods launched from secure cloud can communicate with each other using same port number.
When I checked, they are using same public IP and they cannot communicate with their private IP.
No, secure cloud works the same way as community cloud, so best to ask the developer of the software you're using how to implement it as I said.
Ah, I got it.
Thank you for your reply!
multi node training is a gap we have, we plan to enable some type of internal networking early this year
In the meantime, if you need multi nodes for full servers for 1 month+ (8 for A100/H100 or Ada 6000/L40, and 10 for A6000/A5000/A4000), let us know! We can do baremetal rentals as well
@ART01
I hope 32 A100 GPUs at least for a 1 month. Before deciding to rent, I want to test the efficiency of multi-node training on your servers. Could we arrange a brief rental period, perhaps a few hours, to ensure it meets my requirements?
Any updates on this one?
global networking is planned for launch sometime early dec, multi networking is likely Q1
what's global networking going to do or going to affect
pods can talk to each other over an encrypted private connection without location limitations, location will impact throughput but wont hinder communication
Oh like private networking right
Is the speed basically the same as internet
speed is same as internet if across a region, but if the two pods are within the same region then it will try to use local networks and max speed you will get is around 500Mbps
its almost close to using wireguard, the tunnels are private