RunPod•5mo ago

Running worker automatically once docker image has been pulled

Hello I'm using a serverless runpod worker to run a comfyui workflow. This is the GitHub repo: https://github.com/blib-la/runpod-worker-comfy What I'd like to implement is a warmup workflow to be initiated once the image has been successfully pulled. This will ensure my comfyui models will be cached for future runs and save on cold start times. Do you have any guidance on how to do this? Been messing around for a couple of days and can't get anything to work. Thanks Tom

GitHub

GitHub - blib-la/runpod-worker-comfy: ComfyUI as a serverless API o...

ComfyUI as a serverless API on RunPod. Contribute to blib-la/runpod-worker-comfy development by creating an account on GitHub.

48 Replies

flash-singh•5mo ago

serverless auto loads your images on max workers, when you add a job it will auto start one that is available, what are you trying to do beyond that?

testymctestfaceOP•5mo ago

Thanks @flash-singh - I'm trying to run a job as soon as the image is loaded. When a job runs on a new worker for the first time it has to load all the models etc which adds around 30 seconds. If I do this as soon as the image is loaded then the user won't need to experience this wait time as the models will be cached

flash-singh•5mo ago

model wont be loaded onto gpu without you adding jobs first or you can use active workers then workers will start first and you can add jobs

testymctestfaceOP•4mo ago

got it - I just wondered if there was a simple way to queue a job automatically once the container has been setup.

Jason•4mo ago

active workers?, but i dont think theres such thing

flash-singh•4mo ago

whats your goal? workers only start if there is job or active workers is setup, you can queue fake jobs and have your workers warm up early

testymctestfaceOP•4mo ago

Exactly, I want to queue a job that loads all the models I need. I'm just unclear the best way to initiate the the job once the worker is ready. The goal is to reduce the runtime of the first job requested through my web app. The first job takes about twice as long as subsequent runs.

Jason•4mo ago

Just queue a job or active workers if you can't predict when's the first request from your web app The queued job will wait until at least one will be idle (ready to take a request), but this won't be effective if your subsequent request from the web server is a long time relative after this first request, the absolute time cannot be specified more

testymctestfaceOP•4mo ago

Ideally I was looking for an automated solution. One of my workers is pretty much always being used but when additional workers are called they're sometimes slow to output a result due to not being used since the docker image was pulled.

yhlong00000•4mo ago

The Docker image is pulled only once, either when you create the endpoint or when you update it.

Jason•4mo ago

So like running up workers first before they're going to be used? You can always queue up extra jobs whenever you want it to run an extra workers ( when one worker isn't enough) or more intuitive way when your application feels like the load is increasing, increase active workers using graphql api

testymctestfaceOP•4mo ago

Thanks. But is there a way to auto run a job for that worker once the image has been pulled? I use it for comfyui image generation. The first run takes about 40 secs as the models need to be loaded into the GPU memory, subsequent runs are 15 seconds.

Jason•4mo ago

Just queue a job after deploying a request that'll do that

testymctestfaceOP•4mo ago

yeah I just thought there might be a way of doing it automatically. Either that, or I'll have to make them permanent... but then cost goes up

Jason•4mo ago

There is.. How would you deploy automatically? Just use any program or cli to send a proper request to the endpoint with your api keys, I think that's "automatic" already but what kind of automatically were you thinking?

flash-singh•4mo ago

i think you want to remove cold start times and get better startup time, we try to do that with flashboot but due to nature of scale here, its not possible to do so without taking on the cost, otherwise the cost is put on us, cold starts are charged

testymctestfaceOP•4mo ago

ok - I'm just going to use dedicated active workers for now and try to optimise my docker image. Thanks and appreciate everyone chipping in with ideas.

3WaD•4mo ago

I am now fighting with cold-start too and automatically pre-warming the workers with sample job as soon as they spawn is not ideal, but good idea. But I guess there's no tool to programatically get new workers or even their count, neither you can send the job to the specific one.

Jason•4mo ago

/health endpoint, Check docs just like /run in endpoints Also check graphql api Yeah I guess sending a job to a specific one would be ✨ because we can use one worker specifically for one model that is loaded if we have multiple models Why is that not ideal but a good idea?

3WaD•4mo ago

The ideal would be something that doesn't require additional execution cost and is more time-effective, like sharing loaded states between workers via network volume or something similar. Model loading can be optimized with special formats, but things like engine initializations are still a problem. I am talking about this also here: https://discord.com/channels/912829806415085598/1326321926469189754 Any ideas would be much appreciated

Jason•4mo ago

Having a constant flow of traffic, and a new optimization ( for speeding up model loading from ns) You're using currently from an model inside the docker image directly?

3WaD•4mo ago

The serverless idea is made around on-demand computation that scales from 0. If I have a constant and predictable flow of traffic I can use a dedicated pod/server. Sure, I am baking all models to my images as long as I can remember. But as you see in the VLLM log, the model loading is not the problem.

Jason•4mo ago

Yeah agreed without the autoscaling

3WaD•4mo ago

That being said, I'm starting to wonder if automatic worker warming is not currently the only possible solution for our cold starts problem. I am already thinking about the code in my head. It would require just adding a bit to the handler and your app server OR making a small utility service for PC, that would periodically fetch the workers to check changes and send the prewarm requests to them. Wanna cooperate on this @testymctestface ?

Jason•4mo ago

meaning that it will make your worker always running?

3WaD•4mo ago

It would ensure that all idle workers would be always pre-warmed with Flashboot. And that's what we need.

Jason•4mo ago

" It would require just adding a bit to the handler and your app server OR making a small utility service for PC" how will this work? theorotically it seems impossible in my mind unless you've got some flow of request that is distributed among the workers

3WaD•4mo ago

A high-level overview of how this would work is adding a warm-up routine to your handler that would not generate anything (or just blank/small data if the AI framework doesn't have manually callable init and generate methods), but just initialize the framework and load the models. It could be set to react to something like

{"input":{"prewarm": true}}

{"input":{"prewarm": true}}

. The magic here is, that since Flashboot cached workers are already initialized if you send such a prewarm request to this worker, it would execute in milliseconds and just return something like

{"output":{"warm": true}, ...}

{"output":{"warm": true}, ...}

, while the uninitialized workers would first load and then return it. You would then send the prewarm request as many times as needed to occupy all workers since you can't unfortunately target just the new specific one. The app code or pc utility would then periodically fetch either the /health endpoint or would web-scrape the GUI (if it's not against the TOS and depending on how complex we want to make it) for changes in workers (again, it would be ideal to have some endpoint we can subscribe to that would push data about newly assigned worker to our app but we have to work with what we have). As the changes in workers would be detected, you send the prewarm requests. Just a draft. What do you think?

Jason•4mo ago

Well explained, about the same as my suggestion of just sending a request to the endpoint.. Might work, Same as having a constant flow of request but a limitation is you wouldn't know how long flashboot would keep the model warm after. But yeah sure try it, and I guess don't scrape the web hi for data, just use graphql or /health endpoint if it's simpler Set the scale type to 1 job max per worker, the second one I think, not the delay time so you can spread out the jobs evenly

3WaD•4mo ago

Well, the Flashboot 'goes away' only with the worker availability, correct? Or does it expire after some time even if the worker is still ready as idle for your endpoint? I am not sure because I am always losing the whole worker and different ones are taking its place and I don't think one ever stayed for so long to test this. But I know the cached workers are prioritised for a new job even when they are idle in the "Extra workers" group and not the "latest".

Jason•4mo ago

I'm not sure about that, maybe they use some kind of formula, maybe an Ml, or Ai model.. To decide I don't think it goes away with "worker availability", not sure of what you mean

testymctestfaceOP•4mo ago

@3WaD you using it for text generation? If so your approach sounds ok. I'm using it for image generation and Comfyui which I don't think would work as a warm worker would need to generate an image, which takes about 15 secs.

Jason•4mo ago

Their method canprobably work by using a smaller resolution. Small model whatever makes the generation faster

3WaD•4mo ago

You can load both the server and models without full generation in Comfy. Sample workflow for SDXL would be using nodes: "Load Checkpoint" + "Empty Latent Image" (with minimum 16x16 size) -> "VAE Decode" where samples are connected from Latent and VAE from the model -> "Preview Image" which connects from VAE Decode. This only loads the model and almost instantly (few ms) returns 16x16 latent.

flash-singh•4mo ago

yes thats about right, we can't freeze worker for flashboot if capacity is needed for an active workload

3WaD•4mo ago

Thanks for the confirmation! I've already added the warm-up method to my image, and I will test whether it is enough to send periodic warm-up requests to the endpoint to keep it cached. If effective, I'll start adding this to all my community RunPod implementations for users who don't have enough traffic. Well even for those who do when I think about it. It would really be great to have an auto-prewarm option added so it's part of the initialization state of the worker (of course billed), so no job and end-user can ever experience the cold-start.

flash-singh•4mo ago

we already have this planned its labeled as "active flashboot" which will be charged, it will likely be closed beta for many months

Jason•4mo ago

what differs that with active workers then?

3WaD•4mo ago

As I imagine, the active workers are "dedicated," which means they basically run nonstop as a pod would. That's why you pay for them this whole time. This active flash boot, on the other hand, is the same as a normal flash boot. The worker can disappear at any time since it's not dedicated to your endpoint. The only major change is that it automatically loads the models and engines on worker initialization, and that's what we want. Correct me if I am wrong.

Jason•4mo ago

Hmm okay so the worker will be initialized like after a request? But normally all endpoints load models and engines on job start ( which initializes workers)

3WaD•4mo ago

Yes, but only after they receive the job. And that results in the cold-start. And this happens every time the workers are shifted. If it would happen automatically on their initialization, theoretically you would never experience the cold-start on a job itself.

Jason•4mo ago

I see

flash-singh•4mo ago

yeah close enough its now called priority flashboot

testymctestfaceOP•3mo ago

@flash-singh would love to be part of the beta tests if possible. I can DM you my runpod account details

Justin•3mo ago

Any chance to get in this beta? ❤️

Ardha•2mo ago

is there any update about this?

Jason•2mo ago

not yet i guess, they'll announce it in email and #📢｜announcements

flash-singh•2mo ago

we are going live with priority flashboot later this month but its meant to be a hidden feature we enable for endpoints with high volume, learn from that and eventually figure out how we can price it

Gaming

Programming

Running worker automatically once docker image has been pulled

Did you find this page helpful?