Siamak
Run Lorax on Runpod (Serverless)
I created a docker image similar to (https://github.com/runpod-workers/worker-tgi/blob/main/src/entrypoint.sh) for Lorax, but inside of the docker image I am getting connection refused:
could you please check it?
15 replies
Mutli GPU
I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse.
For sequence calls, the throughput for 1x GPU is better than 2x GPU!
Code for sequence calls:
def tgi_server(prompt):
headers = {'Content-Type': 'application/json'}
url = f'.../generate'
data = {
"inputs": prompt,
"parameters": {
"max_new_tokens": 1000,
"temperature": 1.0,
"top_p": 0.99,
"do_sample":False,
"seed": 42
}
}
response = requests.post(url, json=data, headers=headers)
# print(response.status_code)
res = response.json()
# print(res)
# print(response.status_code)
return res
if __name__ == '__main__':
for index, sample in enumerate(input_sample_data):
input_text = '...'
input_str = f'"""{input_text}"""'
template = f"""[INST] <<SYS>> ...
<</SYS>>
{input_str}[/INST]"""
print("starting on {}".format(InsightSourceId))
s0 = time()
# print(template)
response = tgi_server(template)
s1 = time()
# print(response)
response = response["generated_text"]
I asked this question from LoRAX team, and they mentioned:
This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.I am using 2x L40 on Runpod
11 replies