RunPod•2mo ago

how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF

unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF · Hugging Face

4 Replies

3WaD•2mo ago

vLLM does support GGUF, but it looks like the official RunPod template doesn't. You'll have to build the image yourself with quantization set to GGUF. Using a tokenizer from the unquantized model is also recommended. Keep in mind that GGUF is optimized for CPU inference. For GPU you probably want to use dynamic bitsandbytes.

codyman4488OP•2mo ago

understood, so this should run out of box? https://huggingface.co/neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic

neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic · Hugging Face

codyman4488OP•2mo ago

or do we need to set env vars to use a quantized model like this ?

3WaD•2mo ago

The author of the models states it's not required, but I believe you have to set the correct quantization for any type. The RunPod UI selection is very limited only to: AWQ, SqueezeLLM and GPTQ. While vLLM currently supports:

aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark

aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark

So, for most quantization types you have to set it yourself with the QUANTIZATION env variable. You can find all the info in the vLLM documentation and template source repository

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Gaming

Programming

how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:

Did you find this page helpful?