codeRetarded Comments - Answer Overflow

codeRetarded

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

Then you mean exporting the variable before running the code? But I don't seem to understand why does it work correctly for the first time the worker is spawned

13 replies

RRunPod

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

oh you mean adding devices in the dockerfile while creating the container?

13 replies

RRunPod

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

I don't know if I should make any changes to runpod source code for multi-gpu?

13 replies

RRunPod

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

so this is my code, where I am trying to run a chat model, get_chat_response is the handler

13 replies

RRunPod

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

 def get_chat_response(job):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_query = job["input"]["input_query"]
    base_model, llama_tokenizer = create_base_model()
    prompt = f"""
      something
    """
    model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
    prompt_len =  len(prompt)

    base_model.eval()
    with torch.no_grad():
        resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
    resp = extract_regex(resp)
    return resp 

def create_base_model():
    model_id="/base/13B-chat"
    peft_id="/base/LLM_Finetune/tmp3/llama-output"

    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        #quantization_config=quant_config,
        device_map='auto'
    )
    base_model.config.use_cache = False
    base_model.config.pretraining_tp = 1
    llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    llama_tokenizer.pad_token = llama_tokenizer.eos_token
    llama_tokenizer.padding_side = "right"  # Fix for fp16

    base_model = PeftModel.from_pretrained(
        base_model,
        peft_id,
    )

    return base_model, llama_tokenizer

 def get_chat_response(job):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_query = job["input"]["input_query"]
    base_model, llama_tokenizer = create_base_model()
    prompt = f"""
      something
    """
    model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
    prompt_len =  len(prompt)

    base_model.eval()
    with torch.no_grad():
        resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
    resp = extract_regex(resp)
    return resp 

def create_base_model():
    model_id="/base/13B-chat"
    peft_id="/base/LLM_Finetune/tmp3/llama-output"

    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        #quantization_config=quant_config,
        device_map='auto'
    )
    base_model.config.use_cache = False
    base_model.config.pretraining_tp = 1
    llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    llama_tokenizer.pad_token = llama_tokenizer.eos_token
    llama_tokenizer.padding_side = "right"  # Fix for fp16

    base_model = PeftModel.from_pretrained(
        base_model,
        peft_id,
    )

    return base_model, llama_tokenizer

13 replies

RRunPod

•Created by codeRetarded on 3/12/2024 in #⚡｜serverless

Serverless multi gpu

Update : If I stop for some large amount of time and then send a request then it is working. I think it is working every time after some refresh. Please help.

13 replies

RRunPod

•Created by codeRetarded on 2/2/2024 in #⚡｜serverless

Docker daemon is not started by default?

okayy, i have a particular usecase, is there a workaround?

5 replies

RRunPod

•Created by codeRetarded on 1/31/2024 in #⚡｜serverless

Best way to deploy a new LLM serverless, where I don't want to build large docker images

@Alpay Ariyak thank you for the suggestion, this is something towards which I was looking for. Reduces docker time and uses serverless but if I have a large model won't the worker download it everytime it is sent a request?

27 replies

RRunPod

•Created by codeRetarded on 1/31/2024 in #⚡｜serverless

Best way to deploy a new LLM serverless, where I don't want to build large docker images

@justin your suggestion seems to be creating separate pods for the model and the code, but this will just increase the cost by double if I were to only use serverless and downloaded the model from huggingface/github repos. Thanks for the depot suggestion, it seems interesting for docker interaction

27 replies

Gaming

Programming