How to use multiple GPUs for Kohya Training?
I am using 2x gpus for training using Kohya(Dreambooth). But it is not utilizing both the gpus and instead only 1 gpu is being utilized.
I have tried following solutions:
1. added "set CUDA_VISIBLE_DEVICES=1" to gui.bat file of kohya_ss
2. added the parameters(hardcoded): "aceelerate launch --num_processes=[NUM_YOUR_GPUS_PER_MACHINE] --num_machines=[NUM_YOUR_INDEPENDENT_MACHINES] --multi_gpus --gpu_ids=[GPU_IDS] args..." in the run_cmd of dreambooth_gui.py
3. Changed the config files where I set num_processes = 2
Still somehow when I restart the training, only 1 gpu is being utilized. How can I solve this to use multiple gpu for training?
p.s. docker image I am using: ashleykza/stable-diffusion-webui:3.12.0
2 Replies
RunPod is using Linux not Windows, you have to edit .sh files NOT .bat files.
For everything else, I suggest logging a Github issue in the Kohya_ss repo:
https://github.com/bmaltais/kohya_ss
GitHub
GitHub - bmaltais/kohya_ss
Contribute to bmaltais/kohya_ss development by creating an account on GitHub.
Can I know where is the nccl.conf file located in runpod?
I want to edit some environment variables and found this in Nvidia docs:
"NCCL has an extensive set of environment variables to tune for specific usage.
They can also be set statically in /etc/nccl.conf (for an administrator to set system-wide values) or in ~/.nccl.conf (for users). "
But couldn't find it at either places.
Ref: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html