RunPod•6mo ago

Runpod VLLM - How to use GGUF with VLLM

I have this repo mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF and I use this command "--host 0.0.0.0 --port 8000 --max-model-len 37472 --model mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF --dtype bfloat16 --gpu-memory-utilization 0.95 --quantization gguf" but it doesn't work... It say "2024-10-07T20:39:24.964316283Z ValueError: No supported config format found in mradermacher/Llama-3.1-8B-Stheno-v3.4-i1-GGUF" I don't have this problem with normal models, only with quantized one...

6 Replies

nerdylive•6mo ago

Meaning it doesn't support gguf yet The vllm

Sal ✨OP•6mo ago

Can I upgrade my vLLM version on my template. I use the runpod vvlm/openai

nerdylive•6mo ago

Hmm your template? Sure but Runpod's image is different I think you must create a pr to the github repo Or at least create an issue to get a notice of that feature If you got the docs (vllm) which doesn't exist right now for it, attach the links too

Encyrption•6mo ago

Have you tried converting the model so that RunPod can use it? Here is example Python for converting it:

import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)

import torch
# Load the PyTorch model
model = torch.load("your_model.pth")

# Dummy input for the model export
dummy_input = torch.randn(1, 3, 224, 224)

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "model.onnx", verbose=True)

nerdylive•6mo ago

Wow model convert, so that means the onnx is still quantized?

Encyrption•6mo ago

I don't know such things... I would think so.

Gaming

Programming

Runpod VLLM - How to use GGUF with VLLM

Did you find this page helpful?