R
RunPod8mo ago
octopus

Distributing model across multiple GPUs using vLLM

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs
7 Replies
haris
haris8mo ago
cc: @Alpay Ariyak
Alpay Ariyak
Alpay Ariyak8mo ago
You don't need it, as it's automatically set to the number of GPUs of the worker
nerdylive
nerdylive8mo ago
Hey I think last time this had problems, like on llama 3 70b with 6gpus*24, I forgot what is the error but it has to do with this amount set automatically to number of gpus If im not wrong it works with 8gpu but not 6
Alpay Ariyak
Alpay Ariyak8mo ago
Yeah that’s a vllm issue, it doesn’t allow 6 or 10
Charixfox
Charixfox8mo ago
vLLM specifically says 64 / (GPU Count) must have no modulus. So, 1 , 2, 4, 8, 16, 32, and 64.
nerdylive
nerdylive8mo ago
Ah that sucs
Charixfox
Charixfox7mo ago
It does. I blame vLLM.

Did you find this page helpful?