RunPod•16mo ago

Setting up MODEL_BASE_PATH when building worker-vllm image

I'm a little confused about this parameter in setting up worker-vllm. It seems to default to /runpod-volume, which to me implies a network volume, instead of getting baked into the image, but I'm not sure. A few questions: 1) If set to "/runpod-volume", does this mean that the model will be downloaded to that path automatically, and therefore won't be a part of the image (resulting in a much smaller image)? 2) Will I therefore need to set up a network volume when creating the endpoint? 3) Does the model get downloaded every time workers are created from a cold start? If not, then will I need to "run" a worker for a given amount of time at first to download the model?

Solution:

Jump to solution

15 Replies

wizardjoeOP•16mo ago

@Alpay Ariyak , any thoughts? If I try to bake in the mixtral-8x7b model, it results in a huge image that I'm having trouble getting up into docker hub, so I want to figure out how to set it up with volumes.

ashleyk•16mo ago

The original mixtral-8x7b requires at least 2 x A100, you may want to consider using a quantized one that will also help your Docker image to be smaller.

wizardjoeOP•16mo ago

@ashleyk Is there one in particular that you recommend?

ashleyk•16mo ago

I struggled with TheBloke/dolphin-2.7-mixtral-8x7b-AWQ, but TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ seems to work well in Oobabooga and its also uncensored if you prompt it correctly by bribing it and threatening to kill kittens 😆 TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ can fit into 32GB of VRAM and is not massive.

wizardjoeOP•16mo ago

Do you know how big the models were in terms of total space used in HDD?

ashleyk•16mo ago

You can check on Huggingface

ashleyk•16mo ago

https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ/tree/main

TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ at main

ashleyk•16mo ago

Its around 24GB

wizardjoeOP•16mo ago

Aw man, I think runpod serverless only supports either awq or squeezellm quantization

ashleyk•16mo ago

No, it supports whatever your application supports

ashleyk•16mo ago

I am busy working on this which supports it https://github.com/ashleykleynhans/runpod-worker-exllamav2

GitHub

GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...

RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.

wizardjoeOP•16mo ago

Sorry, I meant the "worker-vllm" image that runpod has only supports awq or squeezellm, at least according to these docs: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...

wizardjoeOP•16mo ago

If you're making your own then I guess you can do anything really Are the "dolphin" images mainly for coding or are they good at other things?

ashleyk•16mo ago

They are uncensored And v3 will be good at role play

Solution

Alpay Ariyak•16mo ago

Hey, if you are downloading the model at build, it will create a local folder within the image with whatever the model base path is and store it there If you want to download onto the network volume, you can do the first option or build the image without model-related arguments, specifying the env variables mentioned in the docs For example: 1. sudo docker build -t xyz:123 . and add cuda version arg if you need 2. Create an endpoint with 2x80gb (might have to request it from our team) and attach a network volume 3. specify the model name as mixtral in environment variables When you send the first request to the endpoint, the worker will download the model to the network volume

Gaming

Programming

Setting up MODEL_BASE_PATH when building worker-vllm image

Did you find this page helpful?