Runpod VLLM - How to use GGUF with VLLM
I have this repo mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF and I use this command
"--host 0.0.0.0 --port 8000 --max-model-len 37472 --model mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF --dtype bfloat16 --gpu-memory-utilization 0.95 --quantization gguf" but it doesn't work...
It say "2024-10-07T20:39:24.964316283Z ValueError: No supported config format found in mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF"
I don't have this problem with normal models, only with quantized one...
6 Replies
Meaning it doesn't support gguf yet
The vllm
Can I upgrade my vLLM version on my template. I use the runpod vvlm/openai
Hmm your template? Sure but Runpod's image is different I think you must create a pr to the github repo
Or at least create an issue to get a notice of that feature
If you got the docs (vllm) which doesn't exist right now for it, attach the links too
Have you tried converting the model so that RunPod can use it? Here is example Python for converting it:
Wow model convert, so that means the onnx is still quantized?
I don't know such things... I would think so.