RunPod•7mo ago

Issue with Multiple instances of ComfyUI running simultaneously on Serverless

Hello, I am using Runpod Serverless and deploying ComfyUI using this repo: https://github.com/blib-la/runpod-worker-comfy?tab=readme-ov-file#bring-your-own-models For the Server, this repo is being used: https://github.com/comfyanonymous/ComfyUI I am deploying via docker image and both these repos are engrained into the image. When I run 2-3 workers via API, the comfy server gets activated and it responds as usual. The problem arose when multiple API requests came for example more than 5 requests came to workers and more than 5 workers got activated, in that case, the ComfyUI server creates an issue and does not get activate. I understand that activation of ComfyUI server is related to the comfyUI server code but if that is the case then even 1 worker shouldn't work but that is not the case. When workers are less, everything is working fine as soon number of workers increase then comfyUI server does not get's activated. I appreciate if anyone takes a look. Thank You

GitHub

GitHub - comfyanonymous/ComfyUI: The most powerful and modular diff...

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. - comfyanonymous/ComfyUI

GitHub

GitHub - blib-la/runpod-worker-comfy: ComfyUI as a serverless API o...

ComfyUI as a serverless API on RunPod. Contribute to blib-la/runpod-worker-comfy development by creating an account on GitHub.

59 Replies

Encyrption•7mo ago

I am running a custom blib-la/runpod-worker-comfy image. I'm not sure what you mean by server? The basic config of blib-la/runpod-worker-comfy is to run the ComfyUI API server on the worker. The RunPod handler reaches out to it locally. If you have modified this behavior can you provide more details on those modifications?

SyedAliiiOP•7mo ago

@Encyrption I am doing the same thing. Blib repo make comfyUI local server and then send request to it. The problem is when they are less than 3-4 workers, everything works fine. Api becomes reachable after 15-20 retires (Retires are done after few miliseconds - the default behavior). But when the workers are more than 5 then Api is not reachable. Server does not get activated. For example out of 5 workers only 2-3 workers able to activate server and others keeps retrying until max retries are reached.

Encyrption•7mo ago

That's odd. Each worker should be an island unto itself. I'm not sure how having more workers would impact the function of any single worker. How are you handling models?

SyedAliiiOP•7mo ago

@Encyrption That us what my thought is that each worker is independent. Silly thought i have, that probably the port is occupied so I have randomly generate ports on which server ran for each worker but issue remains the same. I have setup models, loras, custom nodes everything within docker image, no network volumn is attached.

Encyrption•7mo ago

Have you checked when this is happening which workers they are running on? The port should be wholly contained inside the docker image it doesn't even touch the host ports. maybe if the host was completly out of ports???

SyedAliiiOP•7mo ago

I have a python script for testing . I just send 5 request instantly to my endpoint. Rest runpod endpoint assign each request to worker. Can you please explain what do you mean to check workers?

Encyrption•7mo ago

If you go into serverless you can select workers and see the status of all the workers assigned your endpoint.

Encyrption•7mo ago

What do you have set for max workers?

SyedAliiiOP•7mo ago

They all are in running state, no throttling or any other thing happening to workers. The worker keeps retrying and after 500 retires send me a fail response.

Encyrption•7mo ago

so, all your requests are IN_PROGRESS state? or are you using RUNSYNC?

SyedAliiiOP•7mo ago

I have setup 20 workers max and issue remains. (i know 30 is max but i have asked runpod to give me more workers so my max is 50). My request is RunSync. I wait for request to complete

Encyrption•7mo ago

50 would be nice, all I could get from them was 35.

SyedAliiiOP•7mo ago

Some other endpoints i am running that's why i need those.

Encyrption•7mo ago

So, you can see from logs that the ComfyUI API is timeing out?

SyedAliiiOP•7mo ago

Yes

Encyrption•7mo ago

As long as you are paying RP enough I'm sure they will continue to give more... I am not currently spending anything, in development. I do everything Asnyc but I have no such issues... Although I currently only have flux schnell, dev, sd3, and sdxl. I don't have anything custom.

SyedAliiiOP•7mo ago

Yes, i can see everything from the logs. Server retry time out and then send back fail response. There are no unusual errors in logs. i am seeing it is trying to reach to server api.

Encyrption•7mo ago

I would expect some of that while it syncs up. Are you running in specific region?

SyedAliiiOP•7mo ago

I have custom nodes and models but these are unrelated to the issue i believe

Encyrption•7mo ago

Yeah, don't see how that would change anything.

SyedAliiiOP•7mo ago

No specific region, i have selected global region coz network volumn is not attached so no region restriction.

Encyrption•7mo ago

Do you block out any regions?

SyedAliiiOP•7mo ago

Encyrption•7mo ago

I am currently blocking EU-* and US-OR as they have had issues reported. still have seen no update about them getting it working

SyedAliiiOP•7mo ago

Btw, may be you can say that to my specific endpont, there is some internet bug. But tested on a test endpoint. Issue remains the same.

Encyrption•7mo ago

Do you have a local GPU you can test locally?

SyedAliiiOP•7mo ago

I have and i am able to run comfyui gui. Do you get these updates regarding which region is causing issue from their other channel, if so please let me know

Encyrption•7mo ago

If you have local GPU you can use the docker compose from the repo to run it in local API mode. Just people talking about it on this server

SyedAliiiOP•7mo ago

But on single worker everything works fine. Issue is when multiple workers get request. I don't think, i can simulate this behavior on my local machine. Can you please explain about this more? Can you please explain about this more?

Encyrption•7mo ago

I would try blocking those regions I mentioned and testing again.. and open ticket with RP. It shouldn't matter how many workers are running each worker should be an isolated entity.

SyedAliiiOP•7mo ago

Yes, each worker is seperate gpu. @Encyrption tried blocking the EU and US-OR region but the issue persists. However, I have noticed that today the error rate is low. Like if I send 10 requests then I see that all 10 get completed and sometimes 2-3 requests fail. So issue appears to be runpod internal. Thank you for your time to take a look at this.

gnarley_farley.•7mo ago

Hey I dont know if you have figured this out yet. But you cannot use runsync like this.

gnarley_farley.•7mo ago

use the async endpoint instead. Look here

gnarley_farley.•7mo ago

The issue is that your requests are something passing the limit.

gnarley_farley.•7mo ago

Here is everything your need to get your problem solved. https://docs.runpod.io/serverless/endpoints/job-operations

Job operations | RunPod Documentation

Learn how to use the Runpod Endpoint to manage job operations, including running, checking status, purging queues, and streaming results, with cURL and SDK examples.

gnarley_farley.•7mo ago

Use a polling mechanism and check every few seconds to see if your requests are ready. I am using the exact same severless comfyui api to power my application. I do bulk processing of images and I ran into the exact same issue. This is how I fixed it. I can run hundreds of images in one shot now effortlessly. Even with just 3 active gpus. It's much more performant anyways to use the polling system. And if you build an app on say severless in future it won't effect your serverless function limits. Using runsync and waiting for a bunch of requests to return is not very efficient. Modern frameworks like nextjs have a 10s limit on function timeouts and the the serverless function time is one of the biggest cost factors And it does not effect the speed at all. I feel it is faster now. I have not done tests to verify that. But it's definitely not slower Im using 3x 4090's and ripping hard. I saw no/very little performance increase from using bigger gpus

SyedAliiiOP•7mo ago

@gnarley_farley. Hello, Thank you so much. Your point totally makes sense. Though two queries: 1- Even though async is better performant but even if I use sync they said that limit is 2000 per 10 sec. I believe i haven't even cross 100. 2- Using async, what is the time interval after which you make request to check status Though it depends upon a task you are performing but still what is your suggestion.

gnarley_farley.•7mo ago

Because it's one request at a time and comfyui doesn't use batching, you don't really gain much from the extra ram. Set the polling to 1 second interval. It is very performant. I don't see a difference at all My requests come back in the same time. I know those limits are fudged. I had to figure it out through trial and error

SyedAliiiOP•7mo ago

Yes i have seen that too. Using better gpus does not effect much. But i have seen that if I run comfy with --gpu-only, i have seen increase like for example if one job takes 18 sec then with gpu flag takes 13-14 seconds.

gnarley_farley.•7mo ago

OKay cool thanks. I will test that out. Where are you adding the flag?

SyedAliiiOP•7mo ago

When you run python main.py in comfy ui directory there write python main.py --gpu-only You can also see other flags like highVram etc. Write python main.py --help

gnarley_farley.•7mo ago

Thanks will check that out don't know much about Comfyui. If anyone can make an inpainting version of this worker that uses flux it would be AMAZING! I want to be able to just pass in a masked photo, with a prompt and get something back. My starlink+ dockerhub is just super slow for some reason, can't effectively push these big images.

SyedAliiiOP•7mo ago

Why not make a pod and run your comfy expertiments there? Install models using wget in network volumn so that you can use later.

gnarley_farley.•7mo ago

Sorry I don't want to hijack your thread but The issue is that you need to put everything in the docker image otherwise it takes to long to initialize in severless. I need it to be scalable from 0 serverless, because our sass infra demands that. It's quite complicated. Have tried. It might also be a bit premature I don't see any official/popular flux based workflows for inpainting up yet.

Example.Bot•7mo ago

Yeah, I've been experimenting with my own serverless comfyui setup today and that was my experience as well, A flux docker image without extra quantization and whatnot takes forever to build and ends up at 40+ gigs but once it's set up the returned images go from 0 to loaded and generated within 20 seconds typically

SyedAliiiOP•7mo ago

@gnarley_farley. I was facing the same issue if use network volumn and put everything in it and then from docker image use network volumn then it is extremely very slow. If directly put in docker image then it's very fast but image size is very large. I don't see any solution to that right now. Though there are techniques to reduce docker image.

flash-singh•7mo ago

its best to use webhooks where possible, polling is inefficient in general we have model cache coming soon so you can pull models from huggingface and not embed them in your container image, we will automatically inject the model into your worker

Encyrption•7mo ago

This sounds hopeful, how does 'inject the model into your worker' actual happen? Is it done through a volume? I'm wondering how fast it will be.

gnarley_farley.•7mo ago

is there a serverless endpoint for cogvideo5b yet? I see camenduru has one with a gui. looking for just api service send request, receive back video . If anyone finds it please hola.

flash-singh•7mo ago

yes through a read only volume or folder, both container image and model are downloaded and stored into our nvme disk for local computing but also in network storage for caching and avoiding internet traffic in future

Encyrption•7mo ago

That sounds awesome! If it is on the NVME disk it should be just as fast as baking the model in without having to have large Docker images. As always, ❤️ your work!

gnarley_farley.•7mo ago

What's the latest version/implementaiton of the webhooks for runpod look like? Managed to find this link https://www.answeroverflow.com/m/1206251618022981694

webhooks custom updates - RunPod

Does the job webhook get invoked with runpod.serverless.progress_update calls?

Encyrption•7mo ago

I use runpod.serverless.progress_update to send updates in real time but it will NOT update the final status. For the final update you have to include all the data in what you return from the handler. You can either check that data with a STATUS or get the info from a webhook. <-- all of this is assuming you are using async RUN method.

gnarley_farley.•7mo ago

Yeah I am just using polling currently with RUN. Works flawlessly. Not really phased about a few extra requests popping of to check if it's ready. Hardly consequential in my current pipelines.

yhlong00000•7mo ago

very simple to use webhook: https://docs.runpod.io/serverless/endpoints/send-requests#webhooks

Send a request | RunPod Documentation

Learn how to construct a JSON request body to send to your custom endpoint, including optional inputs for webhooks, execution policies, and S3-compatible storage, to optimize job execution and resource management.

alka_99•7mo ago

is this already released?? i've struggeled with docker image that has 49 GB size due to i need to put everything inside of the image itself and make sure it is installled also

flash-singh•7mo ago

not yet

gnarley_farley.•6mo ago

If you get the docker 5$ per month sub. (pro i think). You can automatically build the docker images when you push to github. Makes a huge difference.

Jason•6mo ago

You can do that with github actions too, or other ci /Cd pipelines provider But yeah those needs a setup with some tech skills, but its a one time setup unless something needs to change related to image name/ repo, etc

Gaming

Programming

Issue with Multiple instances of ComfyUI running simultaneously on Serverless

Did you find this page helpful?