#Could not find CUDA drivers

8 messages · Page 1 of 1 (latest)

eager oak
#

I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya.

I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious.

When I begin training, the Kohya log file displays the following message:

2024-03-11 20:04:56.909562: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.518054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-11 20:04:57.518308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-11 20:04:57.643386: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-11 20:04:57.881246: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-11 20:04:57.883751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 20:05:02.761582: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Is this normal? Also, the training process is reporting 2.14s/it and with Epoch set to 10 and 7000 steps it will take about 42 hours. Is that right?

Thanks,
James

rain zincBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

teal jungle
#

Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.

eager oak
#

Ahh yeah, they are going up and down. So activity on the GPU utilization suggests it's working as it should?

This is the first time I've managed to get a model training. Can I ask a n00b question?

The log began with this:

steps:   0%|          | 0/7000 [00:00<?, ?it/s]
epoch 1/10

But I've seen in the model folder that there are now 3 model files in there. Does this mean the entire training process will be complete in 7000 steps?

teal jungle
#

yeah, the steps depends on the number of training images, number of repeats and number of epoch.

eager oak
#

Thank you so much for that. That's great to know.

Thank you also for such a quick reply! 🙂

rain zincBOT
teal jungle
#

I'll check if reverting the tensorflow version to an older version fixes those errors.