RunPod•6mo ago

Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image

Hey everyone 👋 Looking for tips from anyone who's worked with bitsandbytes-quantized models on RunPod's serverless setup. It's not available out of the box with vLLM, and I was wondering if anyone's got it working? Saw a post in the serverless forum about maybe using a custom Docker image for this. For context: I've fine-tuned LLaMA-3.1 70B-instruct using the unsloth library (which utilizes bitsandbytes for quantization) and am looking to deploy it. Any insights would be greatly appreciated! 🙏

17 Replies

Jason•6mo ago

im not sure if theres a way but if theres a way maybe you can unquantize it somehow? or convert it to another format that it supports

Mohamed Nagy•3mo ago

Any updates!, I want to do the same thing with 3.3 version.

Jason•3mo ago

you can make custom workers actually with some code, library, frameworks that works inside linux that can run bitsandbytes models https://docs.vllm.ai/en/stable/quantization/bnb.html Oh actually im not sure if theres an option for this for runpod vllm-worker

Jason•3mo ago

maybe something like this would work

Jason•3mo ago

whats the model's file extension for bnb? .safetensors?

Mohamed Nagy•3mo ago

I tried this but, the vllm-worker checks the variable if its not one of the defined chocies I fork the vllm-wrorker and change it to accept bitsandbytes yes

Jason•3mo ago

Oh the file format?

Mohamed Nagy•3mo ago

in this worker-config.json

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

  "QUANTIZATION": {
    "env_var_name": "QUANTIZATION",
    "value": "",
    "title": "Quantization",
    "description": "Method used to quantize the weights.",
    "required": false,
    "type": "select",
    "options": [
      { "value": "None", "label": "None" },
      { "value": "awq", "label": "AWQ" },
      { "value": "squeezellm", "label": "SqueezeLLM" },
      { "value": "gptq", "label": "GPTQ" }
    ]
  },

does not has bitsandbytes

Jason•3mo ago

Ohhh In the website it doesn't accept other than those options right? Thanks for sharing this, it'll be helpful for others that wanna use bnb in the future

Mohamed Nagy•3mo ago

this may work, I am going to test runpod-vllm-worker with LOAD_FORMAT it supports bitsandbytes hope the src/engine will load it, I think it will not because in the github repo they does not handle it fully like in this https://docs.vllm.ai/en/stable/quantization/bnb.html yes I will inform you

Jason•3mo ago

whats missing? i thought they pass on that arguments? (not sure about the load_format one) yeah..

Mohamed Nagy•3mo ago

I got the expected error

Jason•3mo ago

what error?

Mohamed Nagy•3mo ago

the param Load_Fromat support accept "BitsAndBytes" and if it set to "BitsAndBytes" then QUANTIZATION must be "bitsandbytes" ("None" will not work) the QUANTIZATION options are "None", "AWQ", "SqueezeLLM", "GPTQ" the error Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization

engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

https://github.com/runpod-workers/worker-vllm/issues/99

GitHub

Bitsandbytes support · Issue #99 · runpod-workers/worker-vllm

Hi there! vllm supports bitsandbytes quantization, but there is no bitsandbytes dependency in requirements.txt. Is there any plans to fix that?

Mohamed Nagy•3mo ago

https://github.com/runpod-workers/worker-vllm/issues/145

GitHub

worker-config.json's QUANTIZATION does not has 'bitsandbytes' opt...

Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization engine.py :26 2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/t...

Jason•3mo ago

oh have you got it to work now?

Mohamed Nagy•3mo ago

yeah, [https://github.com/mohamednaji7/worker-vllm/tree/main] I add few line to complete the option of using bitsandbytes and this is a my merge request [https://github.com/runpod-workers/worker-vllm/pull/146]

Gaming

Programming

Deploying bitsandbytes-quantized Models on RunPod Serverless using Custom Docker Image

Did you find this page helpful?