cellular-automaton Comments

cellular-automaton

•Created by cellular-automaton on 1/10/2025 in #⚡｜serverless

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

A high throughput use case can be served with inter-leaved decoding on a single gpu . However I'm interested in a low latency setup. Agreed on the nvlink part. Do you have any guidance on how to set that up on runpod ?

5 replies

Gaming

Programming