cellular-automaton
RRunPod
•Created by cellular-automaton on 1/10/2025 in #⚡|serverless
Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving
Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like
1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task.
Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this.
Additionally, I want to have prefill batch size as 1. And decode batch size = 16.
5 replies