RunPod•2mo ago

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this. Additionally, I want to have prefill batch size as 1. And decode batch size = 16.

4 Replies

yhlong00000•2mo ago

I haven’t tried this setup before, but given that the model is relatively small, using multiple GPUs might not be beneficial. If the GPUs you’re using aren’t connected via NVLink, the communication overhead between them could actually make it slower than running everything on a single GPU.

nerdylive•2mo ago

https://docs.vllm.ai/en/latest/features/disagg_prefill.html might be relevant https://docs.vllm.ai/en/latest/getting_started/examples/disaggregated_prefill.html maybe you can launch 2vllm instance per worker, with those config setting like in the .sh example

cellular-automatonOP•2mo ago

A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?

nerdylive•2mo ago

What kind of help do you need?

Gaming

Programming

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Did you find this page helpful?