Run LLM Model on Runpod Serverless
Hi There,
I have LLM Model which build on docker image and it was 40GB++ docker Image.
I'm wondering, can I mount the model as volume instead of add the model in the docker image?
Thanks !
31 Replies
Yes, you can put your model on network storage and load it from there, but its generally more performant to bake the model into the docker image because network storage is incredibly slow. Network storage also limits GPU availability.
Is that okay to bake the model into the docker image @ashleyk ?
does it affected the cold start?
No, loading it from network storage affects cold start more.
How about the pulling image strategy, does the runpod cache the image to internal registry, or it would pull the image everytime the worker spawned?
Your docker image is cached onto the workers in advance so has no impact on cold start times.
Alright thank you, I will try it first
@ashleyk I have tried to setup the serverless endpoint, how to check the logs of the pull?
how do I know if the worker success pull the image
Click on each worker and check. The workers will go "Idle" when they are done pulling the image.
It stucked on "Initializing"
Does it returned error if the pull image fail?
let say I have missconfigured the registry access
Click on the workers to check the logs.
I see, does I got charged when the worker is on initializing state?
No, only while the container is running - cold start + execution time.
Wow, okay
If u can share the screenshot of ur template also be good
sometimes ppl forget the tag
so just double check
such as
username/image:1.0
some ppl just write
username/image
Just use our pre-made worker vLLM image and attach a network volume
On startup, the worker will download the model to the network storage, and all the workers will have access to it
The image itself is only 3gb as well and no need to build it
I have sucessfully run my model, but need some adjustment, because inside the container, I still run the FastAPI for endpoint.
Yeah you don't need FastAPI for serverless, serverless already provides an API layer for you.
Hi @ashleyk @Alpay Ariyak
I have tried deploying my LLM Model on runpod serverless, the image is over 40GB, it doesn't cost wise to use Google Artifact Registry as they charging for egress outside of GCP Network, any recommendations for the container registry?
Thank you
I just use Dockerhub, its free
But they limit the pull requests right?
Yeah they have rate limiting by IP, but you can use your token to authenticate instead
I use dockerhub in my production serverless endpoints and never had any issues.
I was testing Llama models / mistral models that are close to the 35Gb/40Gb mark through dockerhub no issues - u can add ur docker credentials to runpod too
Also dockerhub has one private repo per account - if u have it as something sensitive; which then obvs need to add ur docker credential to runpod settings
Alright, I will try with dockerhub then, Thank you @ashleyk @justin
Cant believe google charges u egress for a registry 👁️
i know they do for gcp bucket data but rlly for a container registry? damn
Google, AWS and Azure all charge massive egress costs for everything
I have move my LLM model to Dockerhub, so I don't get haunted by gcp egress cost lol
I have another question
my cold start (load the LLM Model ) is around 15s-30s, anyway to optimize it? @ashleyk @justin
1. Enable FlashBoot if you haven't already done so.
2. Load the model outside of
runpod.serverless.start()
so that it is cached into the worker and not loaded on every single request.So, its possible to preload the model into the worker ?
You can also look at setting Active workers, but you are charged for those.
Thats basically what FlashBoot does but it doesn't really provide any benefit unless you have a constant flow of requests.
I would try the 2nd option and let you know the result
No
But what he is saying is do:
model = load(model)
def handler():
model.predict()
the load model will get added to ur delay time
but on subsequent requests if the worker is still active and didnt spin down to take other requests
doesnt need to reload into memory
if u had it in function scope would reset the variable
Hi @WillyRL, is there a reason you don't want to use https://github.com/runpod-workers/worker-vllm ? It solves all of your problems already
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm