vLLM and multiple GPUs
Hi, I am trying to deploy a model (LLM) of 3B in Runpod with vLLM. I have tried different configurations (4xL4 or 2xL40, etc) but in all I get a CUDA memory error, as if both GPUs are not sharing memory. I have tried pipeline-parallel-size and tensor-parallel-size but I still get the same error.
8 Replies
Some machines don't have the required technology for using multiple GPUs (NCCL) enabled. But the error for that should be super straightforward, this is how I personally setup vLLM with multiple GPUs.
This uses the $RUNPOD_GPU_COUNT variable we set for you, its the amount of GPUs you selected. If it's not set - like it wouldn't be on your localhost - it just uses 1.
Any way to know if the machines have the NCCL?
if there's no way to check from the nvidia sdk inside the docker container, then you gotta ask support for it
So, I need to reserve for example 4xL4 and then check it. If not, I should contact directly with support team?
It would be nice to inform before reservations
no, i mean you need to figure out a way once you can reuse the same way to check other pod im guessing (if there is a feature from nvidia's apps), but if there is no other way than getting more permissions then you should contact runpod to check manually
or you can ask support team directly about the method of checking
Usually, if a model can fit on a single GPU, it’s best to use just one. Using multiple GPUs adds overhead for splitting and aggregating the workload.
And what do you recommend to get, at least, 4.000 rpm?
Serve the model in different independents gpus directly?
well test it out, for your own case it might different with other's use
its the best way of knowing by testing on your own
sure that works, or just one strong, big vram gpu, or multiple gpu splits in vllm