R
RunPod3w ago
Teddy

vLLM and multiple GPUs

Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.
8 Replies
Dj
Dj3w ago
Some machines don't have the required technology for using multiple GPUs (NCCL) enabled. But the error for that should be super straightforward, this is how I personally setup vLLM with multiple GPUs.
model = LLM(
model="mistralai/Ministral-8B-Instruct-2410",
tokenizer_mode="mistral",
config_format="mistral",
load_format="mistral",
tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)
model = LLM(
model="mistralai/Ministral-8B-Instruct-2410",
tokenizer_mode="mistral",
config_format="mistral",
load_format="mistral",
tensor_parallel_size=int(os.environ.get("RUNPOD_GPU_COUNT", "1")),
)
This uses the $RUNPOD_GPU_COUNT variable we set for you, its the amount of GPUs you selected. If it's not set - like it wouldn't be on your localhost - it just uses 1.
Teddy
TeddyOP3w ago
Any way to know if the machines have the NCCL?
Jason
Jason3w ago
if there's no way to check from the nvidia sdk inside the docker container, then you gotta ask support for it
Teddy
TeddyOP3w ago
So, I need to reserve for example 4xL4 and then check it. If not, I should contact directly with support team? It would be nice to inform before reservations
Jason
Jason3w ago
no, i mean you need to figure out a way once you can reuse the same way to check other pod im guessing (if there is a feature from nvidia's apps), but if there is no other way than getting more permissions then you should contact runpod to check manually or you can ask support team directly about the method of checking
yhlong00000
yhlong000003w ago
Usually, if a model can fit on a single GPU, it’s best to use just one. Using multiple GPUs adds overhead for splitting and aggregating the workload.
Teddy
TeddyOP3w ago
And what do you recommend to get, at least, 4.000 rpm? Serve the model in different independents gpus directly?
Jason
Jason3w ago
well test it out, for your own case it might different with other's use its the best way of knowing by testing on your own sure that works, or just one strong, big vram gpu, or multiple gpu splits in vllm

Did you find this page helpful?