Could not find CUDA drivers
I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya.
I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious.
When I begin training, the Kohya log file displays the following message:
Is this normal? Also, the training process is reporting
2.14s/it
and with Epoch set to 10 and 7000 steps it will take about 42 hours. Is that right?
Thanks,
JamesSolution:Jump to solution
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.
5 Replies
Solution
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.
Ahh yeah, they are going up and down. So activity on the GPU utilization suggests it's working as it should?
This is the first time I've managed to get a model training. Can I ask a n00b question?
The log began with this:
But I've seen in the model folder that there are now 3 model files in there. Does this mean the entire training process will be complete in 7000 steps?
yeah, the steps depends on the number of training images, number of repeats and number of epoch.
Thank you so much for that. That's great to know.
Thank you also for such a quick reply! 🙂
I'll check if reverting the tensorflow version to an older version fixes those errors.
By the way, I have noticed a significant difference between the same GPU types in different regions.
With A5000 in BG region, I was getting more than 1s/it but with A5000 in ES region, I am getting 3.37it/s