Siamak Posts - Answer Overflow

Siamak

•Created by Siamak on 2/21/2024 in #⛅｜pods

Run Lorax on Runpod (Serverless)

I created a docker image similar to (https://github.com/runpod-workers/worker-tgi/blob/main/src/entrypoint.sh) for Lorax, but inside of the docker image I am getting connection refused: could you please check it?

15 replies

RRunPod

•Created by Siamak on 2/18/2024 in #⛅｜pods

Mutli GPU

I was conducting an experiment to run LoRAX (https://github.com/predibase/lorax) on multiple GPUs. However, I did not observe any improvement in the results; in fact, the throughput was even worse. For sequence calls, the throughput for 1x GPU is better than 2x GPU! Code for sequence calls:

def tgi_server(prompt):
   headers = {'Content-Type': 'application/json'}

    url = f'.../generate'


    data = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 1000,
            "temperature": 1.0,
            "top_p": 0.99,
            "do_sample":False,
            "seed": 42
        }
    }

    response = requests.post(url, json=data, headers=headers)
    # print(response.status_code)
    res = response.json()
    # print(res)
    # print(response.status_code)
    return res


if __name__ == '__main__':

    for index, sample in enumerate(input_sample_data):
        input_text = '...'
        input_str = f'"""{input_text}"""'
        
        template = f"""[INST] <<SYS>> ...
<</SYS>>
{input_str}[/INST]"""

        print("starting on {}".format(InsightSourceId))
        s0 = time()
        # print(template)
        response = tgi_server(template)
        s1 = time()
        # print(response)
        response = response["generated_text"]

I asked this question from LoRAX team, and they mentioned:

This isn't surprising if your GPUs are connected via PCIe. Unless you're using NVLink, the network overhead of GPU-to-GPU communication will, in most cases, be the bottleneck for inference. The main situations where you would want to use multi-GPU would be: When the model is too large to fit on a single GPU When your GPUs are connected by NVLink If neither condition is met, you're definitely better off on a single GPU.

I am using 2x L40 on Runpod

11 replies

Gaming

Programming