RunPod•13mo ago

Serverless multi gpu

I have a model deployed on 2 48 GB GPUs and 1 worker. It ran correctly for the first time with cuda distributed. But then fails with this "error_message": "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)",\n "error_traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py\\ What can be the issue here?

9 Replies

codeRetardedOP•13mo ago

Update : If I stop for some large amount of time and then send a request then it is working. I think it is working every time after some refresh. Please help.

ashleyk•13mo ago

What model? What are you running on serverless? Impossible to help without full information.

codeRetardedOP•13mo ago

 def get_chat_response(job):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_query = job["input"]["input_query"]
    base_model, llama_tokenizer = create_base_model()
    prompt = f"""
      something
    """
    model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
    prompt_len =  len(prompt)

    base_model.eval()
    with torch.no_grad():
        resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
    resp = extract_regex(resp)
    return resp 

def create_base_model():
    model_id="/base/13B-chat"
    peft_id="/base/LLM_Finetune/tmp3/llama-output"

    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        #quantization_config=quant_config,
        device_map='auto'
    )
    base_model.config.use_cache = False
    base_model.config.pretraining_tp = 1
    llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    llama_tokenizer.pad_token = llama_tokenizer.eos_token
    llama_tokenizer.padding_side = "right"  # Fix for fp16

    base_model = PeftModel.from_pretrained(
        base_model,
        peft_id,
    )

    return base_model, llama_tokenizer

 def get_chat_response(job):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_query = job["input"]["input_query"]
    base_model, llama_tokenizer = create_base_model()
    prompt = f"""
      something
    """
    model_input = llama_tokenizer(prompt, return_tensors="pt").to(device)
    prompt_len =  len(prompt)

    base_model.eval()
    with torch.no_grad():
        resp = llama_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=500)[0], skip_special_tokens=True)
    resp = extract_regex(resp)
    return resp 

def create_base_model():
    model_id="/base/13B-chat"
    peft_id="/base/LLM_Finetune/tmp3/llama-output"

    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        #quantization_config=quant_config,
        device_map='auto'
    )
    base_model.config.use_cache = False
    base_model.config.pretraining_tp = 1
    llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    llama_tokenizer.pad_token = llama_tokenizer.eos_token
    llama_tokenizer.padding_side = "right"  # Fix for fp16

    base_model = PeftModel.from_pretrained(
        base_model,
        peft_id,
    )

    return base_model, llama_tokenizer

so this is my code, where I am trying to run a chat model, get_chat_response is the handler

acomquest•13mo ago

I am facing similar issue !

codeRetardedOP•13mo ago

I don't know if I should make any changes to runpod source code for multi-gpu?

ashleyk•13mo ago

You usually need to set CUDA_VISIBLE_DEVICES to use more than one GPU or configure your code to do so, it doesn't happen magically by itself.

codeRetardedOP•13mo ago

oh you mean adding devices in the dockerfile while creating the container?

ashleyk•13mo ago

No, that won't work

codeRetardedOP•13mo ago

Then you mean exporting the variable before running the code? But I don't seem to understand why does it work correctly for the first time the worker is spawned

Gaming

Programming

Serverless multi gpu

Did you find this page helpful?