Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this. Additionally, I want to have prefill batch size as 1. And decode batch size = 16.
4 Replies
yhlong00000
yhlong000002mo ago
I haven’t tried this setup before, but given that the model is relatively small, using multiple GPUs might not be beneficial. If the GPUs you’re using aren’t connected via NVLink, the communication overhead between them could actually make it slower than running everything on a single GPU.
nerdylive
nerdylive2mo ago
https://docs.vllm.ai/en/latest/features/disagg_prefill.html might be relevant https://docs.vllm.ai/en/latest/getting_started/examples/disaggregated_prefill.html maybe you can launch 2vllm instance per worker, with those config setting like in the .sh example
cellular-automaton
cellular-automatonOP2mo ago
A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?
nerdylive
nerdylive2mo ago
What kind of help do you need?

Did you find this page helpful?