how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:
4 Replies
vLLM does support GGUF, but it looks like the official RunPod template doesn't. You'll have to build the image yourself with
quantization
set to GGUF
. Using a tokenizer from the unquantized model is also recommended.
Keep in mind that GGUF is optimized for CPU inference. For GPU you probably want to use dynamic bitsandbytes.understood, so this should run out of box?
https://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic
or do we need to set env vars to use a quantized model like this ?
The author of the models states it's not required, but I believe you have to set the correct quantization for any type. The RunPod UI selection is very limited only to:
AWQ, SqueezeLLM and GPTQ
. While vLLM currently supports: So, for most quantization types you have to set it yourself with the QUANTIZATION
env variable. You can find all the info in the vLLM documentation and template source repositoryGitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm