Best way to deploy a new LLM serverless, where I don't want to build large docker images

The infrastructure I have come across at runpod, there is not much support for serverless for fast copying of weights of models from a local data centre. Can I get some suggestions on how I should plan my deployment because building large docker images and uploading them on docker file, and then server downloading it at cold start, takes massive time. Help will be appreciated.
8 Replies
justin
justin10mo ago
If you want to build faster models, that are really large, I use depot: https://discord.com/channels/912829806415085598/1194693049897463848 You can leverage everything on a remote infrastructure to build and push, which is great vs being bottlenecked locally The other thing is I would divide your iterations into two dockerfiles: https://discord.com/channels/912829806415085598/1194695853026328626 You can have one base image with all your dependencies + weights And a second one that uses the first image, but just copy any new handler.py over this way when you build you don't have slow performance The initialization time on the first time a worker inits, should be the only time that you get that slow initialization time. Otherwise should be cached on runpods end But if you want to avoid it all together still, you can use a network volume with serverless. The drawback here is that loading large LLMs from a network volume can be very slow But if it was me and the way I do it is: 1) Build a dockerfile with the FROM runpod/pytorch template + dependencies + model weights; this is also GPU Pod compatible, which is great bc I can test and run on GPU pod to make sure all my logic is right with a handler.py without the runpod.start() call. 2) Build a second dockerfile with a handler.py with the runpod.start() call ( i forgot this once and kept wondering why my handler.py was hanging lol). 3) Use depot in order to have faster build times / remote infras to build + large caching
Alpay Ariyak
Alpay Ariyak10mo ago
We have a very fast serverless vLLM worker container that doesn’t require you to build any images You can simply use our pre-built image and customize it with environment variables in the template when you deploy
Alpay Ariyak
Alpay Ariyak10mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
Alpay Ariyak
Alpay Ariyak10mo ago
If you attach a network volume to the endpoint, it will download the model you specify with an environment variable there once at startup and all workers will be able to use it
codeRetarded
codeRetardedOP10mo ago
@justin your suggestion seems to be creating separate pods for the model and the code, but this will just increase the cost by double if I were to only use serverless and downloaded the model from huggingface/github repos. Thanks for the depot suggestion, it seems interesting for docker interaction @Alpay Ariyak thank you for the suggestion, this is something towards which I was looking for. Reduces docker time and uses serverless but if I have a large model won't the worker download it everytime it is sent a request?
Alpay Ariyak
Alpay Ariyak10mo ago
If you have a network volume attached, there will be one download on the first request, then all workers can always access that downloaded model. If you don't have a network volume attached, each worker will need to download the model
justin
justin10mo ago
i am not I am saying that serverless is really just a dockerfile that is made foe the gpu pod, with a handler.py So u can create an image for let’s say: from runpod download weights download dependencies (And ONLY if u want) u can use it on gpu pod to test. this is entirely up to u and u can shut down a gpu pod at any time Then all u need to do for serverless is FROM my previous image Copy handler.py That way when u are building the second image, u do not need to go through the process of constantly redownloading models and weights Just for maybe iterations on the handler.py then u can take that second image and deploy to serverless this does not double ur cost on runpod im simply stating how to construct docker images in a more efficient way so u can deploy to runpod, and now have to rebuild and redownload new weights every time u want to make a small print() statement or something to ur handler.py
ashleyk
ashleyk10mo ago
Downloading of docker images has nothing to do with the cold start. The docker images are downloaded to your workers ahead of time and they sit idle waiting for requests. it has zero impact to cold start time.
Want results from more Discord servers?
Add your server