ART01
Multi-node training with multiple pods sharing same region.
I hope 32 A100 GPUs at least for a 1 month. Before deciding to rent, I want to test the efficiency of multi-node training on your servers. Could we arrange a brief rental period, perhaps a few hours, to ensure it meets my requirements?
26 replies
Multi-node training with multiple pods sharing same region.
I've already performed multi-node training on my server and there were no issues. My question is about network setting of pods.
I'm wondering if multiple pods launched from secure cloud can communicate with each other using same port number.
When I checked, they are using same public IP and they cannot communicate with their private IP.
26 replies
Multi-node training with multiple pods sharing same region.
@ashleyk I got it. Thank you 😄
@JM Hi, could you please confirm if there is available option for testing multi-node training?
It's for network bandwidth test, so I need at least two pods sharing same region.
For each pod, 2 GPUs are enough to test multi-node training.
Also, about 3~4 hours are enough to have a test.
26 replies
Multi-node training with multiple pods sharing same region.
If multi-node training with secure cloud is impossible, is there any way to test multi-node training?
I need to test the speed of multi-node training for deciding the long term contract.
26 replies
Multi-node training with multiple pods sharing same region.
Thank you for your quick reply!
I am trying to test 32 gpus training, so I thought I should run 4 pods (each node might have 8 GPUs, which are the maximum number of available GPU for single pod).
Is the single pod could have 32 GPUs?
26 replies