cellular-automaton
cellular-automaton
RRunPod
Created by cellular-automaton on 1/10/2025 in #⚡|serverless
Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving
Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this. Additionally, I want to have prefill batch size as 1. And decode batch size = 16.
5 replies