R
RunPod•8mo ago
mozthefox

Could not find CUDA drivers

I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya. I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious. When I begin training, the Kohya log file displays the following message:
2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Is this normal? Also, the training process is reporting 2.14s/it and with Epoch set to 10 and 7000 steps it will take about 42 hours. Is that right? Thanks, James
Solution:
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.
Jump to solution
5 Replies
Solution
ashleyk
ashleyk•8mo ago
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.
mozthefox
mozthefoxOP•8mo ago
Ahh yeah, they are going up and down. So activity on the GPU utilization suggests it's working as it should? This is the first time I've managed to get a model training. Can I ask a n00b question? The log began with this:
steps: 0%| | 0/7000 [00:00<?, ?it/s]
epoch 1/10
steps: 0%| | 0/7000 [00:00<?, ?it/s]
epoch 1/10
But I've seen in the model folder that there are now 3 model files in there. Does this mean the entire training process will be complete in 7000 steps?
ashleyk
ashleyk•8mo ago
yeah, the steps depends on the number of training images, number of repeats and number of epoch.
mozthefox
mozthefoxOP•8mo ago
Thank you so much for that. That's great to know. Thank you also for such a quick reply! 🙂
ashleyk
ashleyk•8mo ago
I'll check if reverting the tensorflow version to an older version fixes those errors. By the way, I have noticed a significant difference between the same GPU types in different regions. With A5000 in BG region, I was getting more than 1s/it but with A5000 in ES region, I am getting 3.37it/s
Want results from more Discord servers?
Add your server