Siamak
Siamak
RRunPod
Created by Siamak on 2/21/2024 in #⛅|pods
Run Lorax on Runpod (Serverless)
I created a docker image similar to (https://github.com/runpod-workers/worker-tgi/blob/main/src/entrypoint.sh) for Lorax, but inside of the docker image I am getting connection refused: could you please check it?
15 replies
RRunPod
Created by Siamak on 2/18/2024 in #⛅|pods
Mutli GPU
I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse. For sequence calls, the throughput for 1x GPU is better than 2x GPU! Code for sequence calls: def tgi_server(prompt): headers = {'Content-Type': 'application/json'} url = f'.../generate' data = { "inputs": prompt, "parameters": { "max_new_tokens": 1000, "temperature": 1.0, "top_p": 0.99, "do_sample":False, "seed": 42 } } response = requests.post(url, json=data, headers=headers) # print(response.status_code) res = response.json() # print(res) # print(response.status_code) return res if __name__ == '__main__': for index, sample in enumerate(input_sample_data): input_text = '...' input_str = f'"""{input_text}"""' template = f"""[INST] <<SYS>> ... <</SYS>> {input_str}[/INST]""" print("starting on {}".format(InsightSourceId)) s0 = time() # print(template) response = tgi_server(template) s1 = time() # print(response) response = response["generated_text"] I asked this question from LoRAX team, and they mentioned:
This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.
I am using 2x L40 on Runpod
11 replies