R
RunPod2mo ago
Bj9000

Serveless quants

Hi, how do you specify a specific gguf quant file from a hf repo when configuring a vllm serveless endpoint? Only seems to let you specify the repo level.
6 Replies
GromitInWA
GromitInWA2mo ago
I have the same question. Now that VLLM supports quants, I'm wondering if there's a way to specify it through an environment variable. Also, I'm not sure what format to use for the tokenizer path - is it the full path or just the top-level HF repo for the original model?
Ryan
Ryan2mo ago
Can anyone help on this?
SvenBrnn
SvenBrnn2mo ago
i was also searching for it last week, i ended up giving up as there seems to be a ticket about this on the github of the vllm worker. https://github.com/runpod-workers/worker-vllm/issues/98 Doesn't seem on their tasklist anytime soon so i ended up building my own ollama based runner.
GitHub
Support GGUF models · Issue #98 · runpod-workers/worker-vllm
See vllm-project/vllm#1002, vllm-project/vllm#5191. Should be able to set gguf as QUANTIZATION envar, but we also need to provide exact quant. I'm thinking of some MODEL_FILENAME envar containi...
nerdylive
nerdylive4w ago
i think its not supported yet, try to open issue in the github repo
nerdylive
nerdylive4w ago
GitHub
Issues · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - Issues · runpod-workers/worker-vllm
riverfog7
riverfog72w ago
I do have a running script install_requirements.sh
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
VLLM_USE_NIGHTLY=1

apt-get update

if [ ! -x /usr/bin/sudo ]; then
apt-get install -y sudo
fi

if [ ! -x /usr/bin/wget ]; then
sudo apt-get install -y wget
fi

if [ ! -x /usr/bin/screen ]; then
sudo apt-get install -y screen
fi

if [ ! -x /usr/bin/nvtop ]; then
sudo apt-get install -y nvtop
fi

# install miniconda
mkdir -p ~/miniconda3
wget $MINICONDA_URL -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/condabin/conda init bash
source ~/.bashrc

# install vllm and dependencies
conda create -n vllm python=3.12 -y
conda activate vllm
python -m pip install --upgrade pip
pip install -U "huggingface_hub[cli]"

if [ $VLLM_USE_NIGHTLY -eq 1 ]; then
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
else
pip install vllm
fi
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
VLLM_USE_NIGHTLY=1

apt-get update

if [ ! -x /usr/bin/sudo ]; then
apt-get install -y sudo
fi

if [ ! -x /usr/bin/wget ]; then
sudo apt-get install -y wget
fi

if [ ! -x /usr/bin/screen ]; then
sudo apt-get install -y screen
fi

if [ ! -x /usr/bin/nvtop ]; then
sudo apt-get install -y nvtop
fi

# install miniconda
mkdir -p ~/miniconda3
wget $MINICONDA_URL -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/condabin/conda init bash
source ~/.bashrc

# install vllm and dependencies
conda create -n vllm python=3.12 -y
conda activate vllm
python -m pip install --upgrade pip
pip install -U "huggingface_hub[cli]"

if [ $VLLM_USE_NIGHTLY -eq 1 ]; then
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
else
pip install vllm
fi
start_vllm.sh
VLLM_API_KEY="asdfasdf"
MODEL_REPO="bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF"
ORIG_MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
MODEL_FILE="DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf"
RANDOM_SEED=42

chmod +x /install_requirements.sh
source /install_requirements.sh
rm /install_requirements.sh

#huggingface-cli login --token $HF_TOKEN
if [ ! -f "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" ]; then
mkdir -p "/workspace/models/${MODEL_REPO}"
huggingface-cli download "${MODEL_REPO}" "${MODEL_FILE}" --local-dir "/workspace/models/${MODEL_REPO}"
fi

vllm serve \
"/workspace/models/${MODEL_REPO}/${MODEL_FILE}" \
--port 80 --api-key "${VLLM_API_KEY}" \
--enable-reasoning --reasoning-parser "deepseek_r1" \
--tokenizer "${ORIG_MODEL}" --kv-cache-dtype "auto" \
--max-model-len 16384 --pipeline-parallel-size 1 \
--tensor-parallel-size 2 --seed "${RANDOM_SEED}" \
--swap-space 4 --cpu-offload-gb 0 --gpu-memory-utilization 0.95 \
--quantization "gguf" --device "cuda"
VLLM_API_KEY="asdfasdf"
MODEL_REPO="bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF"
ORIG_MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
MODEL_FILE="DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf"
RANDOM_SEED=42

chmod +x /install_requirements.sh
source /install_requirements.sh
rm /install_requirements.sh

#huggingface-cli login --token $HF_TOKEN
if [ ! -f "/workspace/models/${MODEL_REPO}/${MODEL_FILE}" ]; then
mkdir -p "/workspace/models/${MODEL_REPO}"
huggingface-cli download "${MODEL_REPO}" "${MODEL_FILE}" --local-dir "/workspace/models/${MODEL_REPO}"
fi

vllm serve \
"/workspace/models/${MODEL_REPO}/${MODEL_FILE}" \
--port 80 --api-key "${VLLM_API_KEY}" \
--enable-reasoning --reasoning-parser "deepseek_r1" \
--tokenizer "${ORIG_MODEL}" --kv-cache-dtype "auto" \
--max-model-len 16384 --pipeline-parallel-size 1 \
--tensor-parallel-size 2 --seed "${RANDOM_SEED}" \
--swap-space 4 --cpu-offload-gb 0 --gpu-memory-utilization 0.95 \
--quantization "gguf" --device "cuda"
change tensor-parallel-size to gpu count

Did you find this page helpful?