RunPod•13mo ago

Mutli GPU

I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse. For sequence calls, the throughput for 1x GPU is better than 2x GPU! Code for sequence calls:

def tgi_server(prompt):
   headers = {'Content-Type': 'application/json'}

    url = f'.../generate'


    data = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 1000,
            "temperature": 1.0,
            "top_p": 0.99,
            "do_sample":False,
            "seed": 42
        }
    }

    response = requests.post(url, json=data, headers=headers)
    # print(response.status_code)
    res = response.json()
    # print(res)
    # print(response.status_code)
    return res


if __name__ == '__main__':

    for index, sample in enumerate(input_sample_data):
        input_text = '...'
        input_str = f'"""{input_text}"""'
        
        template = f"""[INST] <<SYS>> ...
<</SYS>>
{input_str}[/INST]"""

        print("starting on {}".format(InsightSourceId))
        s0 = time()
        # print(template)
        response = tgi_server(template)
        s1 = time()
        # print(response)
        response = response["generated_text"]

I asked this question from LoRAX team, and they mentioned:

This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.

I am using 2x L40 on Runpod

GitHub

GitHub - predibase/lorax: Multi-LoRA inference server that scales t...

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs - predibase/lorax

8 Replies

ashleyk•13mo ago

You need to ask this question on the Github repo not here.

SiamakOP•13mo ago

@ashleyk , The question from runpod is whether your GPUs are connected by NVlink or PCIe? Is there any github for runpod to ask this question there?!

ashleyk•13mo ago

Ah yeah would have been better just to ask that. @flash-singh can probably answer this.

SiamakOP•13mo ago

@flash-singh Could you please help me?

ashleyk•13mo ago

He is in the US so probably have to wait a few hours for him to come online.

flash-singh•13mo ago

L40s are only pcie

SiamakOP•13mo ago

@flash-singh RTX4090s are connected via NVLink? because I have same issue on RTX4090 as well Could you please mention, which GPU type are connected via NVLink?

flash-singh•13mo ago

only nvlink or fast interconnects is SXM, A100 or H100

Gaming

Programming

Mutli GPU

Did you find this page helpful?