Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving
Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like
1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task.
Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this.
Additionally, I want to have prefill batch size as 1. And decode batch size = 16.
4 Replies
I haven’t tried this setup before, but given that the model is relatively small, using multiple GPUs might not be beneficial. If the GPUs you’re using aren’t connected via NVLink, the communication overhead between them could actually make it slower than running everything on a single GPU.
https://docs.vllm.ai/en/latest/features/disagg_prefill.html
might be relevant
https://docs.vllm.ai/en/latest/getting_started/examples/disaggregated_prefill.html
maybe you can launch 2vllm instance per worker, with those config setting like in the .sh example
A high throughput use case can be served with inter-leaved decoding on a single gpu .
However I'm interested in a low latency setup.
Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?
What kind of help do you need?