codeRetarded
codeRetarded
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
Then you mean exporting the variable before running the code? But I don't seem to understand why does it work correctly for the first time the worker is spawned
13 replies
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
oh you mean adding devices in the dockerfile while creating the container?
13 replies
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
I don't know if I should make any changes to runpod source code for multi-gpu?
13 replies
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
so this is my code, where I am trying to run a chat model, get_chat_response is the handler
13 replies
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
def get_chat_response(job):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_query = job["input"]["input_query"]
base_model, llama_tokenizer = create_base_model()
prompt = f"""
something
"""
model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
prompt_len = len(prompt)

base_model.eval()
with torch.no_grad():
resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
resp = extract_regex(resp)
return resp

def create_base_model():
model_id="/base/13B-chat"
peft_id="/base/LLM_Finetune/tmp3/llama-output"

base_model = AutoModelForCausalLM.from_pretrained(
model_id,
#quantization_config=quant_config,
device_map='auto'
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right" # Fix for fp16

base_model = PeftModel.from_pretrained(
base_model,
peft_id,
)

return base_model, llama_tokenizer
def get_chat_response(job):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_query = job["input"]["input_query"]
base_model, llama_tokenizer = create_base_model()
prompt = f"""
something
"""
model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
prompt_len = len(prompt)

base_model.eval()
with torch.no_grad():
resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
resp = extract_regex(resp)
return resp

def create_base_model():
model_id="/base/13B-chat"
peft_id="/base/LLM_Finetune/tmp3/llama-output"

base_model = AutoModelForCausalLM.from_pretrained(
model_id,
#quantization_config=quant_config,
device_map='auto'
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1
llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right" # Fix for fp16

base_model = PeftModel.from_pretrained(
base_model,
peft_id,
)

return base_model, llama_tokenizer
13 replies
RRunPod
Created by codeRetarded on 3/12/2024 in #⚡|serverless
Serverless multi gpu
Update : If I stop for some large amount of time and then send a request then it is working. I think it is working every time after some refresh. Please help.
13 replies
RRunPod
Created by codeRetarded on 2/2/2024 in #⚡|serverless
Docker daemon is not started by default?
okayy, i have a particular usecase, is there a workaround?
5 replies
RRunPod
Created by codeRetarded on 1/31/2024 in #⚡|serverless
Best way to deploy a new LLM serverless, where I don't want to build large docker images
@Alpay Ariyak thank you for the suggestion, this is something towards which I was looking for. Reduces docker time and uses serverless but if I have a large model won't the worker download it everytime it is sent a request?
27 replies
RRunPod
Created by codeRetarded on 1/31/2024 in #⚡|serverless
Best way to deploy a new LLM serverless, where I don't want to build large docker images
@justin your suggestion seems to be creating separate pods for the model and the code, but this will just increase the cost by double if I were to only use serverless and downloaded the model from huggingface/github repos. Thanks for the depot suggestion, it seems interesting for docker interaction
27 replies