Mutli GPU
I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse.
For sequence calls, the throughput for 1x GPU is better than 2x GPU!
Code for sequence calls:
def tgi_server(prompt):
headers = {'Content-Type': 'application/json'}
url = f'.../generate'
data = {
"inputs": prompt,
"parameters": {
"max_new_tokens": 1000,
"temperature": 1.0,
"top_p": 0.99,
"do_sample":False,
"seed": 42
}
}
response = requests.post(url, json=data, headers=headers)
# print(response.status_code)
res = response.json()
# print(res)
# print(response.status_code)
return res
if __name__ == '__main__':
for index, sample in enumerate(input_sample_data):
input_text = '...'
input_str = f'"""{input_text}"""'
template = f"""[INST] <<SYS>> ...
<</SYS>>
{input_str}[/INST]"""
print("starting on {}".format(InsightSourceId))
s0 = time()
# print(template)
response = tgi_server(template)
s1 = time()
# print(response)
response = response["generated_text"]
I asked this question from LoRAX team, and they mentioned:
This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.I am using 2x L40 on Runpod
GitHub
GitHub - predibase/lorax: Multi-LoRA inference server that scales t...
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs - predibase/lorax
8 Replies
You need to ask this question on the Github repo not here.
@ashleyk , The question from runpod is whether your GPUs are connected by NVlink or PCIe? Is there any github for runpod to ask this question there?!
Ah yeah would have been better just to ask that. @flash-singh can probably answer this.
@flash-singh Could you please help me?
He is in the US so probably have to wait a few hours for him to come online.
L40s are only pcie
@flash-singh RTX4090s are connected via NVLink? because I have same issue on RTX4090 as well
Could you please mention, which GPU type are connected via NVLink?
only nvlink or fast interconnects is SXM, A100 or H100