Setting up MODEL_BASE_PATH when building worker-vllm image
I'm a little confused about this parameter in setting up worker-vllm. It seems to default to /runpod-volume, which to me implies a network volume, instead of getting baked into the image, but I'm not sure. A few questions:
1) If set to "/runpod-volume", does this mean that the model will be downloaded to that path automatically, and therefore won't be a part of the image (resulting in a much smaller image)?
2) Will I therefore need to set up a network volume when creating the endpoint?
3) Does the model get downloaded every time workers are created from a cold start? If not, then will I need to "run" a worker for a given amount of time at first to download the model?
Solution:Jump to solution
Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there
If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs
For example:
1.
sudo docker build -t xyz:123 .
and add cuda version arg if you need...15 Replies
@Alpay Ariyak , any thoughts? If I try to bake in the mixtral-8x7b model, it results in a huge image that I'm having trouble getting up into docker hub, so I want to figure out how to set it up with volumes.
The original mixtral-8x7b requires at least 2 x A100, you may want to consider using a quantized one that will also help your Docker image to be smaller.
@ashleyk Is there one in particular that you recommend?
I struggled with
TheBloke/dolphin-2.7-mixtral-8x7b-AWQ
, but TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ
seems to work well in Oobabooga and its also uncensored if you prompt it correctly by bribing it and threatening to kill kittens 😆
TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ can fit into 32GB of VRAM and is not massive.Do you know how big the models were in terms of total space used in HDD?
You can check on Huggingface
Its around 24GB
Aw man, I think runpod serverless only supports either awq or squeezellm quantization
No, it supports whatever your application supports
I am busy working on this which supports it
https://github.com/ashleykleynhans/runpod-worker-exllamav2
GitHub
GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...
RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.
Sorry, I meant the "worker-vllm" image that runpod has only supports awq or squeezellm, at least according to these docs: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
If you're making your own then I guess you can do anything really
Are the "dolphin" images mainly for coding or are they good at other things?
They are uncensored
And v3 will be good at role play
Solution
Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there
If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs
For example:
1.
sudo docker build -t xyz:123 .
and add cuda version arg if you need
2. Create an endpoint with 2x80gb (might have to request it from our team) and attach a network volume
3. specify the model name as mixtral in environment variables
When you send the first request to the endpoint, the worker will download the model to the network volume