how to run a quantized model on server less? I'd like to run the 4/8 bit version of this model:

4 Replies
3WaD
3WaD5d ago
vLLM does support GGUF, but it looks like the official RunPod template doesn't. You'll have to build the image yourself with quantization set to GGUF. Using a tokenizer from the unquantized model is also recommended. Keep in mind that GGUF is optimized for CPU inference. For GPU you probably want to use dynamic bitsandbytes.
codyman4488
codyman4488OP5d ago
or do we need to set env vars to use a quantized model like this ?
3WaD
3WaD4d ago
The author of the models states it's not required, but I believe you have to set the correct quantization for any type. The RunPod UI selection is very limited only to: AWQ, SqueezeLLM and GPTQ. While vLLM currently supports:
aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark
aqlm, awq, deepspeedfp, tpu_int8, fp8, ptpc_fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark
So, for most quantization types you have to set it yourself with the QUANTIZATION env variable. You can find all the info in the vLLM documentation and template source repository
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Did you find this page helpful?