Best way to deploy a new LLM serverless, where I don't want to build large docker images
The infrastructure I have come across at runpod, there is not much support for serverless for fast copying of weights of models from a local data centre. Can I get some suggestions on how I should plan my deployment because building large docker images and uploading them on docker file, and then server downloading it at cold start, takes massive time. Help will be appreciated.
8 Replies
If you want to build faster models, that are really large, I use depot:
https://discord.com/channels/912829806415085598/1194693049897463848
You can leverage everything on a remote infrastructure to build and push, which is great vs being bottlenecked locally
The other thing is I would divide your iterations into two dockerfiles:
https://discord.com/channels/912829806415085598/1194695853026328626
You can have one base image with all your dependencies + weights
And a second one that uses the first image, but just copy any new handler.py over
this way when you build you don't have slow performance
The initialization time on the first time a worker inits, should be the only time that you get that slow initialization time. Otherwise should be cached on runpods end
But if you want to avoid it all together still, you can use a network volume with serverless.
The drawback here is that loading large LLMs from a network volume can be very slow
But if it was me and the way I do it is:
1) Build a dockerfile with the FROM runpod/pytorch template + dependencies + model weights; this is also GPU Pod compatible, which is great bc I can test and run on GPU pod to make sure all my logic is right with a handler.py without the runpod.start() call.
2) Build a second dockerfile with a handler.py with the runpod.start() call ( i forgot this once and kept wondering why my handler.py was hanging lol).
3) Use depot in order to have faster build times / remote infras to build + large caching
We have a very fast serverless vLLM worker container that doesn’t require you to build any images
You can simply use our pre-built image and customize it with environment variables in the template when you deploy
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
If you attach a network volume to the endpoint, it will download the model you specify with an environment variable there once at startup and all workers will be able to use it
@justin your suggestion seems to be creating separate pods for the model and the code, but this will just increase the cost by double if I were to only use serverless and downloaded the model from huggingface/github repos. Thanks for the depot suggestion, it seems interesting for docker interaction
@Alpay Ariyak thank you for the suggestion, this is something towards which I was looking for. Reduces docker time and uses serverless but if I have a large model won't the worker download it everytime it is sent a request?
If you have a network volume attached, there will be one download on the first request, then all workers can always access that downloaded model. If you don't have a network volume attached, each worker will need to download the model
i am not
I am saying that serverless is really just a dockerfile that is made foe the gpu pod, with a handler.py
So u can create an image for let’s say:
from runpod
download weights
download dependencies
(And ONLY if u want) u can use it on gpu pod to test. this is entirely up to u and u can shut down a gpu pod at any time
Then all u need to do for serverless is
FROM my previous image
Copy handler.py
That way when u are building the second image, u do not need to go through the process of constantly redownloading models and weights
Just for maybe iterations on the handler.py
then u can take that second image and deploy to serverless
this does not double ur cost on runpod
im simply stating how to construct docker images in a more efficient way so u can deploy to runpod, and now have to rebuild and redownload new weights every time u want to make a small print() statement or something to ur handler.py
Downloading of docker images has nothing to do with the cold start. The docker images are downloaded to your workers ahead of time and they sit idle waiting for requests. it has zero impact to cold start time.