What is the recommended GPU_MEMORY_UTILIZATION?
All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM.
1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%?
2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?
18 Replies
Yes, its to prevent OOM. The setting is different for different model types.
Some models have to have it as low as 0.8
I see. Do you know what is the recommended value for Llama3-70B ?
Solution
0.94 works
@nerdylive I'm planning to deploy it on a RTX 4000 Ada that comes with 20 GB VRAM. If I choose 0.99 it leaves 200MB available for the OS. 0.94 leaves 1.2 GB untilised. Isn't that too much wasted?
I understand on smaller GPUs 0.94 would make sense, but unsure about bigger ones.
No thats not how it works
There will be some of that used for other things
But unlike Windows there isn't much graphical to load on a Linux server, isn't it?
Wait are you using vllm?
if not, then im not sure about that
No, Aphrodite-engine
has nothing to do with the graphical to load on the server, just other things that needed to run the llm
so the llm itself + some other thing for inferencing you can imagine it like that
I see.
yeah
i forgot what its called, but it takes gpu memory
Well on vLLM there is flashboot that takes extra gpu memory. But I'm not using vLLM. 🙂
yeah im not sure about whats the architecture there in aphrodite engine
not the flashboot
aphrodite engine is actually very similar to vllm
Takes the best parts of various different things all rolled up into one awesome engine
@digigoblin Agreed. I really like Aphrodite engine. Are you using Llama3 by any chance? What is your experience with the optimal setting? 0.94 or higher?
Which version, there are different sizes of Llama3
3 70b
I don't use that, its too big and too slow
Takes extremely long just to load the model from network storage before you can even do anything with it