AutoK
Compatibility of RTX A6000 for Multi-GPU Training
I would like to inquire about the types of GPUs that support multi-GPU training. For instance, is it possible to engage in multi-GPU training using 10 RTX A6000 cards from a previous generation?
I understand that the H100 PCIe does not support multi-GPU training, while the H100 SXM5 does. Among the GPUs offered by RunPod, what other types of GPUs are capable of multi-GPU training?
2 replies
H100 multi-gpus settings
When I tried to load weights from checkpoints on my custom model using multi-gpus, weights are not loaded and the progress bar shows stop.
I am using H100 x 7 on runpod, and when I did same trial on my local server (A6000 x 6), it worked well.
Do you have any idea?
3 replies
s3 slow upload
I'm currently working on uploading a dataset from S3 to my Pod using cloud sync. The dataset I'm uploading from S3 is about 1TB in size, so I set the volume size to 2TB. However, when I check the progress bar, it shows as follows:
314.897 GIB / 316.270 GIB 100% 13.045 MIB/S ETA 1M47S
The upload seems to be progressing extremely slowly, and I can't estimate how long it will take. Could I be doing something wrong? I'm using 7 H100 GPUs, and the billing is adding up even though I haven't started working on the project yet. I would appreciate any quick help.
8 replies