RunPod•11mo ago

GGUF in serverless vLLM

How do I run a GGUF quantized model? I need to run this LLM: https://huggingface.co/mradermacher/OpenBioLLM-Llama3-70B-GGUF What parameters should I specify? Thank you

mradermacher/OpenBioLLM-Llama3-70B-GGUF · Hugging Face

49 Replies

digigoblin•11mo ago

You will have to create your own serverless handler for it, because the vllm worker does not support GGUF due to the underlying vllm engine not supporting GGUF.

ArmykOP•11mo ago

Such a shame that it doesn't. Can I run Ollama in serverless?

digigoblin•11mo ago

You can run whatever you want in serverless as long as you implement the RunPod serverless handler

Jason•11mo ago

Is there any template that supports it? if yes -> it can and use it if not, it still can but make your own template

ArmykOP•11mo ago

By templates, do you mean the very limited ones from "quick deploy", or any template that can be run in a normal pod like on the screenshot? I can't input the ollama/ollama template on the serverless deployment

digigoblin•11mo ago

You can't use pod templates in serverless, they don't work the same way, you need to invoke the serverless handler for serverless as I mentioned above.

ArmykOP•11mo ago

Where can I browse community templates for serverless? There has to be someone that already did this

digigoblin•11mo ago

Not sure why you asking this, serverless templates aren't shared publicly like pod ones. The only serverless template available is the vllm one that RunPod created. You can't. @nerdylive must have been confused.

ArmykOP•11mo ago

https://discord.com/channels/912829806415085598/1221249312495898675 I found this thread. I will probably need to configure it myself. Thank you for your help.

Jason•11mo ago

yep there are no place for sharing community templates for serverless yet on the site

digigoblin•11mo ago

@Papa Madiator has this list of Open Source things tho: https://github.com/kodxana/Awesome-RunPod

GitHub

GitHub - kodxana/Awesome-RunPod: A curated list of amazing RunPod p...

A curated list of amazing RunPod projects, libraries, and resources - kodxana/Awesome-RunPod

Madiator2011•11mo ago

I made template with open web ui

digigoblin•11mo ago

For pod tho, not serverless isn't it? he is looking for serverless solution.

Madiator2011•11mo ago

PatrickR•11mo ago

https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference

Run an Ollama Server on a RunPod CPU | RunPod Documentation

Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.

ArmykOP•11mo ago

Cpu inference isn't good enough but thank you

Jason•11mo ago

You can also do it with GPU view: https://discord.com/channels/912829806415085598/1221249312495898675/1246110846720151674

PatrickR•11mo ago

@Armyk If you follow that guide, but just select GPU, you’ll get the same results.

Alpay Ariyak•11mo ago

https://github.com/vllm-project/vllm/pull/5191

GitHub

[Core] Support loading GGUF model by Isotr0py · Pull Request #5191 ...

FILL IN THE PR DESCRIPTION HERE Related issue: #1002 Features: This PR adds support for loading GGUF format model This PR will also add gguf to requirements. Currently, only llama is modified for ...

Alpay Ariyak•11mo ago

Seems vLLM will have GGUF support soon!

guestavius•11mo ago

I just created an account today and am looking at the serverless vLLM quick deploy settings. If GGUF isn't supported, what's this thing? I don't see a bpw / quant level setting.

digigoblin•11mo ago

VLLM engine does not support GGUF, see the messages above.

guestavius•11mo ago

I'm asking what that drop menu do.

digigoblin•11mo ago

It does what it says, select quantization type Its not quant level, its quant TYPE

digigoblin•11mo ago

Docs are available here: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

guestavius•11mo ago

[redacted] ~~I'm hoping not to need 360GB of VRAM to run an 8x22B.~~ Edit: Oh wait, that just means I can point a name/model-AWQ-or-GPTQ repository to serverless.

guestavius•11mo ago

>two looks like I wouldn't be able to run it anyway Edit: I'll have to quantize this obscure model...

digigoblin•11mo ago

Which model? Many models are already available as quantized versions

guestavius•11mo ago

https://huggingface.co/gghfez/WizardLM-2-8x22B-Beige This one caught my interest. No idea if it's good though.

digigoblin•11mo ago

I see there is EXL2 quantized versions but vllm doesn't support EXL2 quant type Aphrodite Engine and TabbyAPI both support EXL2 tho.

Charixfox•11mo ago

Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.

guestavius•11mo ago

I ended up using KoboldCpp's runpod template for gguf, lol. And sharing with some people to spend less time idling. (I'm being an idiot, yes.)

Charixfox•11mo ago

If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient. On last gen 48GB for example, it's 40% active runtime (Actively processing requests)

houmie•10mo ago

GGUF is a format for offline use on your own computer. It's not meant to be for servers really. Use AWQ or GPTQ untill ex2 is supported on vLLM.

digigoblin•10mo ago

GGUF supports both CPU and GPU not just CPU

houmie•10mo ago

Yeah and It's not the fastest though.

digigoblin•10mo ago

EXL2 is fastest.

houmie•10mo ago

Yeah I just hope vLLM would one day support EXL2. It would open up so many new opportunities.

digigoblin•10mo ago

aphrodite engine supports it

houmie•10mo ago

Yes, but aphrodite runs only on classic Pods and it's very expensive to run. 🙂 This is why I love serverless, it's cheap to begin with (but gets 3 x more expensive if you have constant traffic). Serverless is great to start a project with minimal traffic, only if the project is a success and can generate money, then it's worth it to switch to a classic pod with aphrodite.

digigoblin•10mo ago

You can port aphrodite to serverless too

houmie•10mo ago

But it will be experimental right? I don't think it's that easy and I'm so busy as it is with coding. 🙂

Jason•10mo ago

What's experimental

digigoblin•10mo ago

He is probably referring to aphrodite-engine. Its not experimental, TabbyAPI is: "TabbyAPI is a hobby project solely for a small amount of users. It is not meant to run on production servers. For that, please look at other backends that support those workloads."

houmie•10mo ago

No, I mean right now I can create an vLLM serverless directly from run pod dashboard. The same isn’t true about Aphrodite as serverless Hence I assume the latter is experimental.

Charixfox•10mo ago

As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.

Jason•10mo ago

Ohhh ya, There's no quick deploy for it yet, but vllm is also considerably not really production ready for big models, has some bugs too

Charixfox•10mo ago

True enough. I guess can we really say that any FOSS solution is 'production ready' right now?

Jason•10mo ago

im not sure if vllm is but on runpod's quick deploy it still has some bugs Oh some of t hem fixed yay : https://github.com/runpod-workers/worker-vllm/issues/29

GitHub

Errors cause the instance to run indefinitely · Issue #29 · runpod-...

Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several ru...

Gaming

Programming

GGUF in serverless vLLM

Did you find this page helpful?