RuntimeError: Found no NVIDIA driver on your system
I'm pulling my hair over this for 2 weeks now. I'm building a container with Comfyui and IDM-VTON custom_node and whenver I run it (serverless and on a pod) it gives me the "Found no driver" message. This container runs without a problem on my home 4090 and I'm using 4090 on runpod too.
When I run the same container without IDM-VTON custom node, it runs fine.
The container is the following: vazazon/comfyuivenv:dev
What am I missing?
81 Replies
What are you using as a base image?
Maybe you don't have the cuda libraries/driver in your image?
My base image is madiator2011/better-comfyui:light which is part of what runpod offers. It has all the drivers required inside the container image. I'm running this container on my personal machine that has 4090 and it runs properly.
Do you know if runpod passes --gpus=all to the running pod?
ic
I've also tried official base images offerred by runpod
yeah i think runpod does proper handling to pass the gpu access
It's very strange that comfyui works properly up until I install the IDM-VTON custom nodes and from there on, comfyui refuses to load due to the driver issue
it gives me the "Found no driver" message.
can you give a screenshot of this message
where is it?
This is the custom node I'm installing: https://github.com/TemryL/ComfyUI-IDM-VTON
GitHub
GitHub - TemryL/ComfyUI-IDM-VTON: ComfyUI adaptation of IDM-VTON fo...
ComfyUI adaptation of IDM-VTON for virtual try-on. - TemryL/ComfyUI-IDM-VTON
Hmm try setting 0 workers and back
and report this endpoint to runpod
from the contact button
sorry, what do you mean "and back"?
Howmuch max workers you have now
back to your amount currently
before setting it to 0 ( it deletes all workers )
ok, I'm building the EP again and will provide it to the support once it is up
What is EP?
Endpoint
I've spent almost 20$ on those experiments so far without success...
Yea, i have no idea of why is that
hopefully if they checked problem is from runpod you can get some credits back
Don't use CUDA 12.5.1 image
oh ya... they're not yet supported 🤣
the machines are like 12.4(max) right?
Most are 12.1
yeah
i thought they already released some on 12.5
You have to set the filters for CUDA version on the endpoint if you want to use anything higher than 12.1
Thats why it can't find the GPU
hmm but it usually errors like cuda compability right?
if this the case
I've tried all kinds of images. All works well when loading Comfyui till I add the IDM-VTON custom node. The diffusers_load.py code accesses cuda and fails:
File "/workspace/ComfyUI/nodes.py", line 21, in <module>
2024-07-30T07:35:28.884890696Z import comfy.diffusers_load
2024-07-30T07:35:28.884909706Z File "/workspace/ComfyUI/comfy/diffusers_load.py", line 3, in <module>
2024-07-30T07:35:28.885133662Z import comfy.sd
2024-07-30T07:35:28.885148463Z File "/workspace/ComfyUI/comfy/sd.py", line 5, in <module>
2024-07-30T07:35:28.885478892Z from comfy import model_management
2024-07-30T07:35:28.885491072Z File "/workspace/ComfyUI/comfy/model_management.py", line 119, in <module>
2024-07-30T07:35:28.885740719Z total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
2024-07-30T07:35:28.885750849Z File "/workspace/ComfyUI/comfy/model_management.py", line 88, in get_torch_device
2024-07-30T07:35:28.885848452Z return torch.device(torch.cuda.current_device())
2024-07-30T07:35:28.885876893Z File "/venv/lib/python3.10/site-packages/torch/cuda/init.py", line 778, in current_device
2024-07-30T07:35:28.886137860Z _lazy_init()
2024-07-30T07:35:28.886148180Z File "/venv/lib/python3.10/site-packages/torch/cuda/init.py", line 293, in _lazy_init
2024-07-30T07:35:28.886281774Z torch._C._cuda_init()
2024-07-30T07:35:28.886360026Z RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Download the latest official NVIDIA drivers
Download the latest official NVIDIA drivers
I've opened support case with all the information
This is 100% because you are using wrong CUDA version base image
Fix that
Otherwise filter to CUDA 12.5 if its even available
I see CUDA 12.5 is available in the dropdown
Edit your endpoint, go to advanced and check CUDA 12.5
CUDA is not forwards compatible, you can't use a CUDA 12.5 docker image on a 12.1 host
ah yea
I've tried defining cuda 12.5 and the issue still exists. Also, the log clearly states I'm running cuda 12.1 (not the host is 12.1)
I've chosen 12.5 in the dropdown of the endpoint.
if you run nvidia-smi
what happense
in the command line
This log does, your previous log said 12.5.1
Here
Is the docker image code on Github?
Looks like there may be an issue with torch installation
The container is on dockerhub: vazazon/comfyuivenv:dev - the base image is madiator2011/better-comfyui:light and the only change I made that is related to the error is opening Comfyui GUI and add IDM-VTON custom node.
yeah could be your torch version is messed up
How come the container works on my machine smoothly?
How can it work on your machine? You don't have the serverless infrastructure on your machine
runpod provides a way to run it locally, if there is a local file named test_input.json, it will be used as if you uploaded it through runpod api
I run it using:
docker run -it --rm --gpus=all -p3000:3000 comfyuivenv:dev
madiator2011/better-comfyui:light is based on CUDA 12.1
And this is why I don't understand why it fails to run on runpod
You know my image is not made for serverless 🙂
I also tried during the last 2 weeks to use runpod's suggested images with the same results
Better Comfy UI is made for Pods not serverless
I tried to run it on pod too - same results
this is why I said I'm pulling my hair off due to this issue
just testing it out
ComfyUI-IDM-VTON ?
@mikmak5595 have workflow?
kinda need more info if I want to test
so I installed without issues
@mikmak5595 I see TripleDES might be causing issues
For now you need to install this:
cryptography<43.0.0
Looks like its just a warning about deprecation though, not an actual error
I know it caused issue when I worked on whispherx worker
@Madiator2011 are you an AI? You are here 24/7
No he is a person
hmm
It was a joke
No errors on my side
TemryL/ComfyUI-IDM-VTON installed fine without issues
This is the workflow
If you run my container without any command it should process this workflow automatically
The problem is not with installing IDM-VTON, it's with providing the container with IDM-VTON installed afterwards. My container already has this module installed along with all required models. When I run it without any CMD, it fails during Comfyui bringup as it calls the customer nodes' init function which tries to use CUDA
@mikmak5595 want me to help you with setup?
yes please , I have the docker image vazazon/comfyuivenv :dev that when executed both in pod and serverless mode gives "RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx" error. The same container runs fine on my machine using:
docker run -it --rm --gpus=all vazazon/comfyuivenv:dev
Page Not Found | NVIDIA
Page Not Found
what is your api input?
also noticed you have 2 venvs in container
@mikmak5595 would it be issue to share dockerfile and code of worker you use?
This is the workflow. I don't have dockerfile as I took the base image of runpod and installed all custom_nodes on it manually and then ran docker commit
cause issue is with image itself it's most likely venv mess up
I'm not sure why you have taken such a hard way of doing it
I also tried the none venv way. One I installed the IDM-VTON custom_node things were starting to mess up
on serverless you want to avoid venvs
you do not plan to use network storage?
no, I placed all models and other data on the container image
btw, it fails to run both on pod and on serverless
ye cause you totally messed up docker image
it wasnt build corectly so the nvidia drivers did not get injected
Can you explain what you mean by "it wasn't build correctly"?
build with dockerfile and then build command
I did build the base image using docker build and then opened bash inside and ran pip install on all custom_nodes' requirements.txt files
it's late for me but tomorrow can try make you template
that would be awesome!
Here is dockerfile and additional minor changes to https://github.com/blib-la/runpod-worker-comfy code.
To build it, use the following command:
docker build -t vazazon/runpod-worker-comfy:dev-full2 --target base --platform linux/amd64 .
GitHub
GitHub - blib-la/runpod-worker-comfy: ComfyUI as a serverless API o...
ComfyUI as a serverless API on RunPod. Contribute to blib-la/runpod-worker-comfy development by creating an account on GitHub.
I'm uploading the output docker image to dockerhub and will try this image both as pod and in serverless mode in runpod and update here
Its better to fork the Repo and push your changes to GitHub than to post zip files of changes.
Let's first wait to see if the image runs properly on runpod.
Update: the image I created from the dockerfile above works!
The next challenge with serverless mode is when running the 2nd workflow causing out of memory on the GPU. I will have to check how I can cause comfyui to free up GPU memory at the beginning/end of the workflow
Nice
Sounds like you need to pick a GPU with more VRAM.
It happens only on the 2nd time of invoking the API. It means that comfyui does not free up GPU vram after running the workflow. Any suggestions how to achieve that? maybe runpod API provides a way to do that?
Here is a link on unloading models: https://www.reddit.com/r/comfyui/comments/194sehe/is_there_a_unload_model_node_im_sure_i_used_one/
why not keeping model in memory unless you switching models
It's the only way I know of to free up VRAM.
I was able to resolve the issue by adding " --lowvram" to the comfyui command line
@mikmak5595 just checking up on you. So the issue is solved?