R
RunPod6mo ago
Armyk

GGUF in serverless vLLM

How do I run a GGUF quantized model? I need to run this LLM: https://huggingface.co/mradermacher/OpenBioLLM-Llama3-70B-GGUF What parameters should I specify? Thank you
49 Replies
digigoblin
digigoblin6mo ago
You will have to create your own serverless handler for it, because the vllm worker does not support GGUF due to the underlying vllm engine not supporting GGUF.
Armyk
ArmykOP6mo ago
Such a shame that it doesn't. Can I run Ollama in serverless?
digigoblin
digigoblin6mo ago
You can run whatever you want in serverless as long as you implement the RunPod serverless handler
nerdylive
nerdylive6mo ago
Is there any template that supports it? if yes -> it can and use it if not, it still can but make your own template
Armyk
ArmykOP6mo ago
By templates, do you mean the very limited ones from "quick deploy", or any template that can be run in a normal pod like on the screenshot? I can't input the ollama/ollama template on the serverless deployment
No description
digigoblin
digigoblin6mo ago
You can't use pod templates in serverless, they don't work the same way, you need to invoke the serverless handler for serverless as I mentioned above.
Armyk
ArmykOP6mo ago
Where can I browse community templates for serverless? There has to be someone that already did this
digigoblin
digigoblin6mo ago
Not sure why you asking this, serverless templates aren't shared publicly like pod ones. The only serverless template available is the vllm one that RunPod created. You can't. @nerdylive must have been confused.
Armyk
ArmykOP6mo ago
https://discord.com/channels/912829806415085598/1221249312495898675 I found this thread. I will probably need to configure it myself. Thank you for your help.
nerdylive
nerdylive6mo ago
yep there are no place for sharing community templates for serverless yet on the site
digigoblin
digigoblin6mo ago
@Papa Madiator has this list of Open Source things tho: https://github.com/kodxana/Awesome-RunPod
GitHub
GitHub - kodxana/Awesome-RunPod: A curated list of amazing RunPod p...
A curated list of amazing RunPod projects, libraries, and resources - kodxana/Awesome-RunPod
Madiator2011
Madiator20116mo ago
I made template with open web ui
digigoblin
digigoblin6mo ago
For pod tho, not serverless isn't it? he is looking for serverless solution.
Madiator2011
Madiator20116mo ago
oh
PatrickR
PatrickR6mo ago
Run an Ollama Server on a RunPod CPU | RunPod Documentation
Learn to set up and run an Ollama server on RunPod CPU for inference with this step-by-step tutorial.
Armyk
ArmykOP6mo ago
Cpu inference isn't good enough but thank you
PatrickR
PatrickR6mo ago
@Armyk If you follow that guide, but just select GPU, you’ll get the same results.
Alpay Ariyak
Alpay Ariyak6mo ago
GitHub
[Core] Support loading GGUF model by Isotr0py · Pull Request #5191 ...
FILL IN THE PR DESCRIPTION HERE Related issue: #1002 Features: This PR adds support for loading GGUF format model This PR will also add gguf to requirements. Currently, only llama is modified for ...
Alpay Ariyak
Alpay Ariyak6mo ago
Seems vLLM will have GGUF support soon!
guestavius
guestavius6mo ago
I just created an account today and am looking at the serverless vLLM quick deploy settings. If GGUF isn't supported, what's this thing? I don't see a bpw / quant level setting.
No description
digigoblin
digigoblin6mo ago
VLLM engine does not support GGUF, see the messages above.
guestavius
guestavius6mo ago
I'm asking what that drop menu do.
digigoblin
digigoblin6mo ago
It does what it says, select quantization type Its not quant level, its quant TYPE
digigoblin
digigoblin6mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
guestavius
guestavius6mo ago
[redacted] I'm hoping not to need 360GB of VRAM to run an 8x22B. Edit: Oh wait, that just means I can point a name/model-AWQ-or-GPTQ repository to serverless.
guestavius
guestavius6mo ago
>two looks like I wouldn't be able to run it anyway Edit: I'll have to quantize this obscure model...
No description
digigoblin
digigoblin6mo ago
Which model? Many models are already available as quantized versions
guestavius
guestavius6mo ago
https://huggingface.co/gghfez/WizardLM-2-8x22B-Beige This one caught my interest. No idea if it's good though.
digigoblin
digigoblin6mo ago
I see there is EXL2 quantized versions but vllm doesn't support EXL2 quant type Aphrodite Engine and TabbyAPI both support EXL2 tho.
Charixfox
Charixfox6mo ago
Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.
guestavius
guestavius6mo ago
I ended up using KoboldCpp's runpod template for gguf, lol. And sharing with some people to spend less time idling. (I'm being an idiot, yes.)
Charixfox
Charixfox6mo ago
If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient. On last gen 48GB for example, it's 40% active runtime (Actively processing requests)
houmie
houmie6mo ago
GGUF is a format for offline use on your own computer. It's not meant to be for servers really. Use AWQ or GPTQ untill ex2 is supported on vLLM.
digigoblin
digigoblin6mo ago
GGUF supports both CPU and GPU not just CPU
houmie
houmie6mo ago
Yeah and It's not the fastest though.
digigoblin
digigoblin6mo ago
EXL2 is fastest.
houmie
houmie5mo ago
Yeah I just hope vLLM would one day support EXL2. It would open up so many new opportunities.
digigoblin
digigoblin5mo ago
aphrodite engine supports it
houmie
houmie5mo ago
Yes, but aphrodite runs only on classic Pods and it's very expensive to run. 🙂 This is why I love serverless, it's cheap to begin with (but gets 3 x more expensive if you have constant traffic). Serverless is great to start a project with minimal traffic, only if the project is a success and can generate money, then it's worth it to switch to a classic pod with aphrodite.
digigoblin
digigoblin5mo ago
You can port aphrodite to serverless too
houmie
houmie5mo ago
But it will be experimental right? I don't think it's that easy and I'm so busy as it is with coding. 🙂
nerdylive
nerdylive5mo ago
What's experimental
digigoblin
digigoblin5mo ago
He is probably referring to aphrodite-engine. Its not experimental, TabbyAPI is: "TabbyAPI is a hobby project solely for a small amount of users. It is not meant to run on production servers. For that, please look at other backends that support those workloads."
houmie
houmie5mo ago
No, I mean right now I can create an vLLM serverless directly from run pod dashboard. The same isn’t true about Aphrodite as serverless Hence I assume the latter is experimental.
Charixfox
Charixfox5mo ago
As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.
nerdylive
nerdylive5mo ago
Ohhh ya, There's no quick deploy for it yet, but vllm is also considerably not really production ready for big models, has some bugs too
Charixfox
Charixfox5mo ago
True enough. I guess can we really say that any FOSS solution is 'production ready' right now?
nerdylive
nerdylive5mo ago
im not sure if vllm is but on runpod's quick deploy it still has some bugs Oh some of t hem fixed yay : https://github.com/runpod-workers/worker-vllm/issues/29
GitHub
Errors cause the instance to run indefinitely · Issue #29 · runpod-...
Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several ru...
Want results from more Discord servers?
Add your server