LoRA adapter on Runpod.io (using vLLM Worker)
Hi, I hope everyone is doing well. I'm reaching out to seek some insights or advice regarding an issue I'm encountering while attempting to deploy a serverless API endpoint on RunPod.io. The model in question has been adapted using a Lora adapter, and I seems like I am stuck because of missing configuration file.
However, the nature of the model's adaptation with the Lora adapter means that I don't have a traditional configuration file available. (see screenshot please)
Given the technical nature of this issue, I was hoping someone here might have encountered a similar situation or could offer guidance on how to proceed. Specifically, I'm looking for any advice on how to bypass the requirement for a config file in this context or if there's an alternative method of supplying the necessary configuration information to satisfy the deployment process.
14 Replies
Hey how do you run the model?
using this
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM
config = PeftConfig.from_pretrained("alsokit/eLM-mini-4B-4K-4bit-v01")
base_model = AutoModelForCausalLM.from_pretrained("unsloth/phi-3-mini-4k-instruct-bnb-4bit")
model = PeftModel.from_pretrained(base_model, "alsokit/eLM-mini-4B-4K-4bit-v01")
Do you get any warnings for using that?
or maybe errors?
well, also have an error
ValueError: Can't find 'adapter_config.json' at 'alsokit/eLM-mini-4B-4K-4bit-v01'
try this # Example: Initialize LoRA config
lora_config = LoraConfig(init_lora_weights="gaussian", target_modules=["to_k", "to_q", "to_v", "to_out.0"])
im not sure if it works, or the configs
ow, it seems like it works after doing these steps
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
# model_name = "unsloth/mistral-7b-v0.3", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
model_name = "unsloth/Phi-3-mini-4k-instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in4bit,
# token = "hf...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
Does Runpod serverless support LoRA adapter?
the problem is, i only have adapter_config
2024-06-26T12:49:19.602715611Z OSError: alsokit/eLM-mini-4B-4K-4bit-v01 does not appear to have a file named config.json. Checkout 'https://huggingface.co/alsokit/eLM-mini-4B-4K-4bit-v01/tree/main' for available files.
is there any way to set an endpoint using my model without config (cause i only have adapter_config) or i have to somehow change the model?
They sure do like other gpus do support I guess
With peft I think you need the config
Are there any ways to avoid this eror? (no config found)
Yes, with peft i need, but with other method i dont
so, to enable API endpoint, as I see, config is also a musthave?
I guess so, what's the other method?
Why not use that
above i have wrote the code which does not use peft
after merging adapter and base model, i have a config, but the model quality gets a lot worse
so i thought i could set API endpoint without config
Ahh yeah that's why maybe you should have the lora config hahah
Is the lora trained in the same base model?
yes
How do we deploy this on a serverless endpoint?