Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image
Hey everyone π
Looking for tips from anyone who's worked with bitsandbytes-quantized models on RunPod's serverless setup. It's not available out of the box with vLLM, and I was wondering if anyone's got it working? Saw a post in the serverless forum about maybe using a custom Docker image for this.
For context: I've fine-tuned LLaMA-3.1 70B-instruct using the unsloth library (which utilizes bitsandbytes for quantization) and am looking to deploy it.
Any insights would be greatly appreciated! π
17 Replies
im not sure if theres a way but if theres a way maybe you can unquantize it somehow? or convert it to another format that it supports
Any updates!, I want to do the same thing with 3.3 version.
you can make custom workers actually
with some code, library, frameworks that works inside linux that can run bitsandbytes models
https://docs.vllm.ai/en/stable/quantization/bnb.html
Oh actually
im not sure if theres an option for this for runpod vllm-worker
maybe something like this would work
whats the model's file extension for bnb?
.safetensors
?I tried this but, the vllm-worker checks the variable if its not one of the defined chocies
I fork the vllm-wrorker and change it to accept bitsandbytes
yes
Oh the file format?
in this
worker-config.json
does not has bitsandbytes
Ohhh
In the website it doesn't accept other than those options right?
Thanks for sharing this, it'll be helpful for others that wanna use bnb in the future
this may work, I am going to test runpod-vllm-worker with
LOAD_FORMAT
it supports bitsandbytes
hope the src/engine will load it, I think it will not because in the github repo they does not handle it fully like in this https://docs.vllm.ai/en/stable/quantization/bnb.html
yes
I will inform youwhats missing? i thought they pass on that arguments? (not sure about the load_format one) yeah..
I got the expected error
what error?
the param
Load_Fromat
support accept "BitsAndBytes"
and if it set to "BitsAndBytes" then QUANTIZATION
must be "bitsandbytes" ("None" will not work)
the QUANTIZATION
options are "None", "AWQ", "SqueezeLLM", "GPTQ"
the error Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization
engine.py :115 2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None
https://github.com/runpod-workers/worker-vllm/issues/99GitHub
Bitsandbytes support Β· Issue #99 Β· runpod-workers/worker-vllm
Hi there! vllm supports bitsandbytes quantization, but there is no bitsandbytes dependency in requirements.txt. Is there any plans to fix that?
GitHub
worker-config.json
's QUANTIZATION does not has 'bitsandbytes' opt...Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization engine.py :26 2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/t...
oh have you got it to work now?
yeah, [https://github.com/mohamednaji7/worker-vllm/tree/main] I add few line to complete the option of using bitsandbytes
and this is a my merge request [https://github.com/runpod-workers/worker-vllm/pull/146]