R
RunPod•6mo ago
wizardjoe

Setting up MODEL_BASE_PATH when building worker-vllm image

I'm a little confused about this parameter in setting up worker-vllm. It seems to default to /runpod-volume, which to me implies a network volume, instead of getting baked into the image, but I'm not sure. A few questions: 1) If set to "/runpod-volume", does this mean that the model will be downloaded to that path automatically, and therefore won't be a part of the image (resulting in a much smaller image)? 2) Will I therefore need to set up a network volume when creating the endpoint? 3) Does the model get downloaded every time workers are created from a cold start? If not, then will I need to "run" a worker for a given amount of time at first to download the model?
Solution:
Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs For example: 1. sudo docker build -t xyz:123 . and add cuda version arg if you need...
Jump to solution
15 Replies
wizardjoe
wizardjoe•6mo ago
@Alpay Ariyak , any thoughts? If I try to bake in the mixtral-8x7b model, it results in a huge image that I'm having trouble getting up into docker hub, so I want to figure out how to set it up with volumes.
ashleyk
ashleyk•6mo ago
The original mixtral-8x7b requires at least 2 x A100, you may want to consider using a quantized one that will also help your Docker image to be smaller.
wizardjoe
wizardjoe•6mo ago
@ashleyk Is there one in particular that you recommend?
ashleyk
ashleyk•6mo ago
I struggled with TheBloke/dolphin-2.7-mixtral-8x7b-AWQ, but TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ seems to work well in Oobabooga and its also uncensored if you prompt it correctly by bribing it and threatening to kill kittens 😆 TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ can fit into 32GB of VRAM and is not massive.
wizardjoe
wizardjoe•6mo ago
Do you know how big the models were in terms of total space used in HDD?
ashleyk
ashleyk•6mo ago
You can check on Huggingface
ashleyk
ashleyk•6mo ago
Its around 24GB
wizardjoe
wizardjoe•6mo ago
Aw man, I think runpod serverless only supports either awq or squeezellm quantization
ashleyk
ashleyk•6mo ago
No, it supports whatever your application supports
ashleyk
ashleyk•6mo ago
I am busy working on this which supports it https://github.com/ashleykleynhans/runpod-worker-exllamav2
GitHub
GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...
RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.
wizardjoe
wizardjoe•6mo ago
Sorry, I meant the "worker-vllm" image that runpod has only supports awq or squeezellm, at least according to these docs: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
wizardjoe
wizardjoe•6mo ago
If you're making your own then I guess you can do anything really Are the "dolphin" images mainly for coding or are they good at other things?
ashleyk
ashleyk•6mo ago
They are uncensored And v3 will be good at role play
Solution
Alpay Ariyak
Alpay Ariyak•6mo ago
Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs For example: 1. sudo docker build -t xyz:123 . and add cuda version arg if you need 2. Create an endpoint with 2x80gb (might have to request it from our team) and attach a network volume 3. specify the model name as mixtral in environment variables When you send the first request to the endpoint, the worker will download the model to the network volume