CUDA out of memory (80GB GPU)
Hi there, I am trying to run a Dreambooth training through a serverless endpoint
Using the A100 80 GB GPU, is this perhaps not a good GPU for this type of training?
Using this template as a base, but I did modify it a bit, also modified the accelerate command with some other params but I wouldn't expect it to run out of memory..
https://github.com/runpod-workers/worker-lora_trainer
GitHub
GitHub - runpod-workers/worker-lora_trainer
Contribute to runpod-workers/worker-lora_trainer development by creating an account on GitHub.
4 Replies
Thats ridiculous for training a LoRA, you can do full Dreambooth training with 24GB
I am not training a LoRa, I am using that template as a base and modified the accelerate command under the hood to perform a Dreambooth training
That's why I don't understand why it's running out of memory
Must be something wrong with your implementation. Nobody can help you unless you share the code.
I understand. I’ll share a repository in a sec.
https://github.com/flamed0g/runpod-kohya-worker
The actual command that starts the training is here: https://github.com/flamed0g/runpod-kohya-worker/blob/d7e189465766890e21e941835db153de7813b369/src/handler.py#L67
And the serverless script uses sd-scripts under the hood and is a customized script based on
runpod-workers/worker-lora_trainer