Running worker automatically once docker image has been pulled
Hello
I'm using a serverless runpod worker to run a comfyui workflow. This is the GitHub repo: https://github.com/blib-la/runpod-worker-comfy
What I'd like to implement is a warmup workflow to be initiated once the image has been successfully pulled. This will ensure my comfyui models will be cached for future runs and save on cold start times.
Do you have any guidance on how to do this? Been messing around for a couple of days and can't get anything to work.
Thanks
Tom
GitHub
GitHub - blib-la/runpod-worker-comfy: ComfyUI as a serverless API o...
ComfyUI as a serverless API on RunPod. Contribute to blib-la/runpod-worker-comfy development by creating an account on GitHub.
33 Replies
serverless auto loads your images on max workers, when you add a job it will auto start one that is available, what are you trying to do beyond that?
Thanks @flash-singh - I'm trying to run a job as soon as the image is loaded. When a job runs on a new worker for the first time it has to load all the models etc which adds around 30 seconds. If I do this as soon as the image is loaded then the user won't need to experience this wait time as the models will be cached
model wont be loaded onto gpu without you adding jobs first or you can use active workers then workers will start first and you can add jobs
got it - I just wondered if there was a simple way to queue a job automatically once the container has been setup.
active workers?, but i dont think theres such thing
whats your goal? workers only start if there is job or active workers is setup, you can queue fake jobs and have your workers warm up early
Exactly, I want to queue a job that loads all the models I need. I'm just unclear the best way to initiate the the job once the worker is ready.
The goal is to reduce the runtime of the first job requested through my web app. The first job takes about twice as long as subsequent runs.
Just queue a job or active workers if you can't predict when's the first request from your web app
The queued job will wait until at least one will be idle (ready to take a request), but this won't be effective if your subsequent request from the web server is a long time relative after this first request, the absolute time cannot be specified more
Ideally I was looking for an automated solution. One of my workers is pretty much always being used but when additional workers are called they're sometimes slow to output a result due to not being used since the docker image was pulled.
The Docker image is pulled only once, either when you create the endpoint or when you update it.
So like running up workers first before they're going to be used?
You can always queue up extra jobs whenever you want it to run an extra workers ( when one worker isn't enough) or more intuitive way when your application feels like the load is increasing, increase active workers using graphql api
Thanks. But is there a way to auto run a job for that worker once the image has been pulled? I use it for comfyui image generation. The first run takes about 40 secs as the models need to be loaded into the GPU memory, subsequent runs are 15 seconds.
Just queue a job after deploying a request that'll do that
yeah I just thought there might be a way of doing it automatically. Either that, or I'll have to make them permanent... but then cost goes up
There is.. How would you deploy automatically?
Just use any program or cli to send a proper request to the endpoint with your api keys, I think that's "automatic" already but what kind of automatically were you thinking?
i think you want to remove cold start times and get better startup time, we try to do that with flashboot but due to nature of scale here, its not possible to do so without taking on the cost, otherwise the cost is put on us, cold starts are charged
ok - I'm just going to use dedicated active workers for now and try to optimise my docker image. Thanks and appreciate everyone chipping in with ideas.
I am now fighting with cold-start too and automatically pre-warming the workers with sample job as soon as they spawn is not ideal, but good idea. But I guess there's no tool to programatically get new workers or even their count, neither you can send the job to the specific one.
/health endpoint, Check docs just like /run in endpoints
Also check graphql api
Yeah I guess sending a job to a specific one would be ✨ because we can use one worker specifically for one model that is loaded if we have multiple models
Why is that not ideal but a good idea?
The ideal would be something that doesn't require additional execution cost and is more time-effective, like sharing loaded states between workers via network volume or something similar. Model loading can be optimized with special formats, but things like engine initializations are still a problem.
I am talking about this also here: https://discord.com/channels/912829806415085598/1326321926469189754 Any ideas would be much appreciated
Having a constant flow of traffic, and a new optimization ( for speeding up model loading from ns)
You're using currently from an model inside the docker image directly?
The serverless idea is made around on-demand computation that scales from 0. If I have a constant and predictable flow of traffic I can use a dedicated pod/server. Sure, I am baking all models to my images as long as I can remember. But as you see in the VLLM log, the model loading is not the problem.
Yeah agreed without the autoscaling
That being said, I'm starting to wonder if automatic worker warming is not currently the only possible solution for our cold starts problem. I am already thinking about the code in my head. It would require just adding a bit to the handler and your app server OR making a small utility service for PC, that would periodically fetch the workers to check changes and send the prewarm requests to them. Wanna cooperate on this @testymctestface ?
meaning that it will make your worker always running?
It would ensure that all idle workers would be always pre-warmed with Flashboot. And that's what we need.
" It would require just adding a bit to the handler and your app server OR making a small utility service for PC"
how will this work?
theorotically it seems impossible in my mind unless you've got some flow of request that is distributed among the workers
A high-level overview of how this would work is adding a warm-up routine to your handler that would not generate anything (or just blank/small data if the AI framework doesn't have manually callable init and generate methods), but just initialize the framework and load the models. It could be set to react to something like . The magic here is, that since Flashboot cached workers are already initialized if you send such a prewarm request to this worker, it would execute in milliseconds and just return something like , while the uninitialized workers would first load and then return it. You would then send the prewarm request as many times as needed to occupy all workers since you can't unfortunately target just the new specific one. The app code or pc utility would then periodically fetch either the /health endpoint or would web-scrape the GUI (if it's not against the TOS and depending on how complex we want to make it) for changes in workers (again, it would be ideal to have some endpoint we can subscribe to that would push data about newly assigned worker to our app but we have to work with what we have). As the changes in workers would be detected, you send the prewarm requests. Just a draft. What do you think?
Well explained, about the same as my suggestion of just sending a request to the endpoint.. Might work, Same as having a constant flow of request but a limitation is you wouldn't know how long flashboot would keep the model warm after.
But yeah sure try it, and I guess don't scrape the web hi for data, just use graphql or /health endpoint if it's simpler
Set the scale type to 1 job max per worker, the second one I think, not the delay time so you can spread out the jobs evenly
Well, the Flashboot 'goes away' only with the worker availability, correct? Or does it expire after some time even if the worker is still ready as idle for your endpoint? I am not sure because I am always losing the whole worker and different ones are taking its place and I don't think one ever stayed for so long to test this. But I know the cached workers are prioritised for a new job even when they are idle in the "Extra workers" group and not the "latest".
I'm not sure about that, maybe they use some kind of formula, maybe an Ml, or Ai model.. To decide
I don't think it goes away with "worker availability", not sure of what you mean
@3WaD you using it for text generation? If so your approach sounds ok. I'm using it for image generation and Comfyui which I don't think would work as a warm worker would need to generate an image, which takes about 15 secs.
Their method canprobably work by using a smaller resolution. Small model whatever makes the generation faster