RunPod•14mo ago

Could not find CUDA drivers

I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya. I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious. When I begin training, the Kohya log file displays the following message:

2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Is this normal? Also, the training process is reporting 2.14s/it and with Epoch set to 10 and 7000 steps it will take about 42 hours. Is that right? Thanks, James

Solution:

Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.

Jump to solution

5 Replies

Solution

ashleyk•14mo ago

Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.

mozthefoxOP•14mo ago

Ahh yeah, they are going up and down. So activity on the GPU utilization suggests it's working as it should? This is the first time I've managed to get a model training. Can I ask a n00b question? The log began with this:

steps:   0%|          | 0/7000 [00:00<?, ?it/s]
epoch 1/10

steps:   0%|          | 0/7000 [00:00<?, ?it/s]
epoch 1/10

But I've seen in the model folder that there are now 3 model files in there. Does this mean the entire training process will be complete in 7000 steps?

ashleyk•14mo ago

yeah, the steps depends on the number of training images, number of repeats and number of epoch.

mozthefoxOP•14mo ago

Thank you so much for that. That's great to know. Thank you also for such a quick reply! 🙂

ashleyk•14mo ago

I'll check if reverting the tensorflow version to an older version fixes those errors. By the way, I have noticed a significant difference between the same GPU types in different regions. With A5000 in BG region, I was getting more than 1s/it but with A5000 in ES region, I am getting 3.37it/s

Gaming

Programming

Could not find CUDA drivers

Did you find this page helpful?