R
RunPod•6mo ago
houmie

What is the recommended GPU_MEMORY_UTILIZATION?

All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM. 1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%? 2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?
Solution:
0.94 works
Jump to solution
18 Replies
digigoblin
digigoblin•6mo ago
Yes, its to prevent OOM. The setting is different for different model types. Some models have to have it as low as 0.8
houmie
houmieOP•6mo ago
I see. Do you know what is the recommended value for Llama3-70B ?
Solution
nerdylive
nerdylive•6mo ago
0.94 works
houmie
houmieOP•6mo ago
@nerdylive I'm planning to deploy it on a RTX 4000 Ada that comes with 20 GB VRAM. If I choose 0.99 it leaves 200MB available for the OS. 0.94 leaves 1.2 GB untilised. Isn't that too much wasted? I understand on smaller GPUs 0.94 would make sense, but unsure about bigger ones.
nerdylive
nerdylive•6mo ago
No thats not how it works There will be some of that used for other things
houmie
houmieOP•6mo ago
But unlike Windows there isn't much graphical to load on a Linux server, isn't it?
nerdylive
nerdylive•6mo ago
Wait are you using vllm? if not, then im not sure about that
houmie
houmieOP•6mo ago
No, Aphrodite-engine
nerdylive
nerdylive•6mo ago
has nothing to do with the graphical to load on the server, just other things that needed to run the llm so the llm itself + some other thing for inferencing you can imagine it like that
houmie
houmieOP•6mo ago
I see.
nerdylive
nerdylive•6mo ago
yeah i forgot what its called, but it takes gpu memory
houmie
houmieOP•6mo ago
Well on vLLM there is flashboot that takes extra gpu memory. But I'm not using vLLM. 🙂
nerdylive
nerdylive•6mo ago
yeah im not sure about whats the architecture there in aphrodite engine not the flashboot
digigoblin
digigoblin•6mo ago
aphrodite engine is actually very similar to vllm Takes the best parts of various different things all rolled up into one awesome engine
houmie
houmieOP•6mo ago
@digigoblin Agreed. I really like Aphrodite engine. Are you using Llama3 by any chance? What is your experience with the optimal setting? 0.94 or higher?
digigoblin
digigoblin•6mo ago
Which version, there are different sizes of Llama3
nerdylive
nerdylive•6mo ago
3 70b
digigoblin
digigoblin•6mo ago
I don't use that, its too big and too slow Takes extremely long just to load the model from network storage before you can even do anything with it
Want results from more Discord servers?
Add your server