R
RunPod3d ago
Bj9000

Serveless quants

Hi, how do you specify a specific gguf quant file from a hf repo when configuring a vllm serveless endpoint? Only seems to let you specify the repo level.
3 Replies
GromitInWA
GromitInWA3d ago
I have the same question. Now that VLLM supports quants, I'm wondering if there's a way to specify it through an environment variable. Also, I'm not sure what format to use for the tokenizer path - is it the full path or just the top-level HF repo for the original model?
Ryan
Ryan3d ago
Can anyone help on this?
SvenBrnn
SvenBrnn2d ago
i was also searching for it last week, i ended up giving up as there seems to be a ticket about this on the github of the vllm worker. https://github.com/runpod-workers/worker-vllm/issues/98 Doesn't seem on their tasklist anytime soon so i ended up building my own ollama based runner.
GitHub
Support GGUF models · Issue #98 · runpod-workers/worker-vllm
See vllm-project/vllm#1002, vllm-project/vllm#5191. Should be able to set gguf as QUANTIZATION envar, but we also need to provide exact quant. I'm thinking of some MODEL_FILENAME envar containi...

Did you find this page helpful?