GGUF in serverless vLLM
How do I run a GGUF quantized model?
I need to run this LLM: https://huggingface.co/mradermacher/OpenBioLLM-Llama3-70B-GGUF
What parameters should I specify?
Thank you
49 Replies
You will have to create your own serverless handler for it, because the vllm worker does not support GGUF due to the underlying vllm engine not supporting GGUF.
Such a shame that it doesn't. Can I run Ollama in serverless?
You can run whatever you want in serverless as long as you implement the RunPod serverless handler
Is there any template that supports it? if yes -> it can and use it
if not, it still can but make your own template
By templates, do you mean the very limited ones from "quick deploy", or any template that can be run in a normal pod like on the screenshot? I can't input the ollama/ollama template on the serverless deployment
You can't use pod templates in serverless, they don't work the same way, you need to invoke the serverless handler for serverless as I mentioned above.
Where can I browse community templates for serverless? There has to be someone that already did this
Not sure why you asking this, serverless templates aren't shared publicly like pod ones.
The only serverless template available is the vllm one that RunPod created.
You can't. @nerdylive must have been confused.
https://discord.com/channels/912829806415085598/1221249312495898675
I found this thread. I will probably need to configure it myself. Thank you for your help.
yep there are no place for sharing community templates for serverless yet on the site
@Papa Madiator has this list of Open Source things tho:
https://github.com/kodxana/Awesome-RunPod
GitHub
GitHub - kodxana/Awesome-RunPod: A curated list of amazing RunPod p...
A curated list of amazing RunPod projects, libraries, and resources - kodxana/Awesome-RunPod
I made template with open web ui
For pod tho, not serverless isn't it? he is looking for serverless solution.
oh
Run an Ollama Server on a RunPod CPU | RunPod Documentation
Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.
Cpu inference isn't good enough but thank you
You can also do it with GPU view: https://discord.com/channels/912829806415085598/1221249312495898675/1246110846720151674
@Armyk If you follow that guide, but just select GPU, you’ll get the same results.
GitHub
[Core] Support loading GGUF model by Isotr0py · Pull Request #5191 ...
FILL IN THE PR DESCRIPTION HERE
Related issue: #1002
Features:
This PR adds support for loading GGUF format model
This PR will also add gguf to requirements.
Currently, only llama is modified for ...
Seems vLLM will have GGUF support soon!
I just created an account today and am looking at the serverless vLLM quick deploy settings. If GGUF isn't supported, what's this thing? I don't see a bpw / quant level setting.
VLLM engine does not support GGUF, see the messages above.
I'm asking what that drop menu do.
It does what it says, select quantization type
Its not quant level, its quant TYPE
Docs are available here:
https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
[redacted] I'm hoping not to need 360GB of VRAM to run an 8x22B.
Edit: Oh wait, that just means I can point a name/model-AWQ-or-GPTQ repository to serverless.
>two
looks like I wouldn't be able to run it anyway
Edit: I'll have to quantize this obscure model...
Which model? Many models are already available as quantized versions
https://huggingface.co/gghfez/WizardLM-2-8x22B-Beige
This one caught my interest. No idea if it's good though.
I see there is EXL2 quantized versions but vllm doesn't support EXL2 quant type
Aphrodite Engine and TabbyAPI both support EXL2 tho.
Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.
I ended up using KoboldCpp's runpod template for gguf, lol. And sharing with some people to spend less time idling. (I'm being an idiot, yes.)
If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient.
On last gen 48GB for example, it's 40% active runtime (Actively processing requests)
GGUF is a format for offline use on your own computer. It's not meant to be for servers really. Use AWQ or GPTQ untill ex2 is supported on vLLM.
GGUF supports both CPU and GPU not just CPU
Yeah and It's not the fastest though.
EXL2 is fastest.
Yeah I just hope vLLM would one day support EXL2. It would open up so many new opportunities.
aphrodite engine supports it
Yes, but aphrodite runs only on classic Pods and it's very expensive to run. 🙂 This is why I love serverless, it's cheap to begin with (but gets 3 x more expensive if you have constant traffic). Serverless is great to start a project with minimal traffic, only if the project is a success and can generate money, then it's worth it to switch to a classic pod with aphrodite.
You can port aphrodite to serverless too
But it will be experimental right? I don't think it's that easy and I'm so busy as it is with coding. 🙂
What's experimental
He is probably referring to aphrodite-engine. Its not experimental, TabbyAPI is:
"TabbyAPI is a hobby project solely for a small amount of users. It is not meant to run on production servers. For that, please look at other backends that support those workloads."
No, I mean right now I can create an vLLM serverless directly from run pod dashboard.
The same isn’t true about Aphrodite as serverless
Hence I assume the latter is experimental.
As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.
Ohhh ya, There's no quick deploy for it yet, but vllm is also considerably not really production ready for big models, has some bugs too
True enough. I guess can we really say that any FOSS solution is 'production ready' right now?
im not sure if vllm is but on runpod's quick deploy it still has some bugs
Oh some of t hem fixed yay : https://github.com/runpod-workers/worker-vllm/issues/29
GitHub
Errors cause the instance to run indefinitely · Issue #29 · runpod-...
Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several ru...