cellular-automaton
cellular-automaton
RRunPod
Created by cellular-automaton on 1/10/2025 in #⚡|serverless
Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving
A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?
5 replies