RunPod•3mo ago

Serveless quants

Hi, how do you specify a specific gguf quant file from a hf repo when configuring a vllm serveless endpoint? Only seems to let you specify the repo level.

6 Replies

GromitInWA•3mo ago

I have the same question. Now that VLLM supports quants, I'm wondering if there's a way to specify it through an environment variable. Also, I'm not sure what format to use for the tokenizer path - is it the full path or just the top-level HF repo for the original model?

Ryan•3mo ago

Can anyone help on this?

SvenBrnn•3mo ago

i was also searching for it last week, i ended up giving up as there seems to be a ticket about this on the github of the vllm worker. https://github.com/runpod-workers/worker-vllm/issues/98 Doesn't seem on their tasklist anytime soon so i ended up building my own ollama based runner.

GitHub

Support GGUF models · Issue #98 · runpod-workers/worker-vllm

See vllm-project/vllm#1002, vllm-project/vllm#5191. Should be able to set gguf as QUANTIZATION envar, but we also need to provide exact quant. I'm thinking of some MODEL_FILENAME envar containi...

Jason•3mo ago

i think its not supported yet, try to open issue in the github repo

Jason•3mo ago

https://github.com/runpod-workers/worker-vllm/issues

GitHub

Issues · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - Issues · runpod-workers/worker-vllm

riverfog7•2mo ago

I do have a running script install_requirements.sh

MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
VLLM_USE_NIGHTLY=1

apt-get update

if [ ! -x /usr/bin/sudo ]; then
    apt-get install -y sudo
fi

if [ ! -x /usr/bin/wget ]; then
    sudo apt-get install -y wget
fi

if [ ! -x /usr/bin/screen ]; then
    sudo apt-get install -y screen
fi

if [ ! -x /usr/bin/nvtop ]; then
    sudo apt-get install -y nvtop
fi

# install miniconda
mkdir -p ~/miniconda3
wget $MINICONDA_URL -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/condabin/conda init bash
source ~/.bashrc

# install vllm and dependencies
conda create -n vllm python=3.12 -y
conda activate vllm
python -m pip install --upgrade pip
pip install -U "huggingface_hub[cli]"

if [ $VLLM_USE_NIGHTLY -eq 1 ]; then
    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
else
    pip install vllm
fi

MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
VLLM_USE_NIGHTLY=1

apt-get update

if [ ! -x /usr/bin/sudo ]; then
    apt-get install -y sudo
fi

if [ ! -x /usr/bin/wget ]; then
    sudo apt-get install -y wget
fi

if [ ! -x /usr/bin/screen ]; then
    sudo apt-get install -y screen
fi

if [ ! -x /usr/bin/nvtop ]; then
    sudo apt-get install -y nvtop
fi

# install miniconda
mkdir -p ~/miniconda3
wget $MINICONDA_URL -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/condabin/conda init bash
source ~/.bashrc

# install vllm and dependencies
conda create -n vllm python=3.12 -y
conda activate vllm
python -m pip install --upgrade pip
pip install -U "huggingface_hub[cli]"

if [ $VLLM_USE_NIGHTLY -eq 1 ]; then
    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
else
    pip install vllm
fi

start_vllm.sh

VLLM_API_KEY="asdfasdf"
MODEL_REPO="bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF"
ORIG_MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
MODEL_FILE="DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf"
RANDOM_SEED=42

chmod +x /install_requirements.sh
source /install_requirements.sh
rm /install_requirements.sh

#huggingface-cli login --token $HF_TOKEN
if [ ! -f "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" ]; then
    mkdir -p "/workspace/models/${MODEL_REPO}"
    huggingface-cli download "${MODEL_REPO}" "${MODEL_FILE}" --local-dir "/workspace/models/${MODEL_REPO}"
fi

vllm serve \
           "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" \
           --port 80 --api-key "${VLLM_API_KEY}" \
           --enable-reasoning --reasoning-parser "deepseek_r1" \
           --tokenizer "${ORIG_MODEL}" --kv-cache-dtype "auto" \
           --max-model-len 16384 --pipeline-parallel-size 1 \
           --tensor-parallel-size 2 --seed "${RANDOM_SEED}" \
           --swap-space 4 --cpu-offload-gb 0 --gpu-memory-utilization 0.95 \
           --quantization "gguf" --device "cuda"

VLLM_API_KEY="asdfasdf"
MODEL_REPO="bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF"
ORIG_MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
MODEL_FILE="DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf"
RANDOM_SEED=42

chmod +x /install_requirements.sh
source /install_requirements.sh
rm /install_requirements.sh

#huggingface-cli login --token $HF_TOKEN
if [ ! -f "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" ]; then
    mkdir -p "/workspace/models/${MODEL_REPO}"
    huggingface-cli download "${MODEL_REPO}" "${MODEL_FILE}" --local-dir "/workspace/models/${MODEL_REPO}"
fi

vllm serve \
           "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" \
           --port 80 --api-key "${VLLM_API_KEY}" \
           --enable-reasoning --reasoning-parser "deepseek_r1" \
           --tokenizer "${ORIG_MODEL}" --kv-cache-dtype "auto" \
           --max-model-len 16384 --pipeline-parallel-size 1 \
           --tensor-parallel-size 2 --seed "${RANDOM_SEED}" \
           --swap-space 4 --cpu-offload-gb 0 --gpu-memory-utilization 0.95 \
           --quantization "gguf" --device "cuda"

change tensor-parallel-size to gpu count

Gaming

Programming

Serveless quants

Did you find this page helpful?