R
RunPod2mo ago
Sal ✨

Runpod VLLM - How to use GGUF with VLLM

I have this repo mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF and I use this command "--host 0.0.0.0 --port 8000 --max-model-len 37472 --model mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF --dtype bfloat16 --gpu-memory-utilization 0.95 --quantization gguf" but it doesn't work... It say "2024-10-07T20:39:24.964316283Z ValueError: No supported config format found in mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF" I don't have this problem with normal models, only with quantized one...
6 Replies
nerdylive
nerdylive2mo ago
Meaning it doesn't support gguf yet The vllm
Sal ✨
Sal ✨OP2mo ago
Can I upgrade my vLLM version on my template. I use the runpod vvlm/openai
nerdylive
nerdylive2mo ago
Hmm your template? Sure but Runpod's image is different I think you must create a pr to the github repo Or at least create an issue to get a notice of that feature If you got the docs (vllm) which doesn't exist right now for it, attach the links too
Encyrption
Encyrption2mo ago
Have you tried converting the model so that RunPod can use it? Here is example Python for converting it:
import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)
import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)
nerdylive
nerdylive2mo ago
Wow model convert, so that means the onnx is still quantized?
Encyrption
Encyrption2mo ago
I don't know such things... I would think so.
Want results from more Discord servers?
Add your server