RunPod•11mo ago

What is the recommended GPU_MEMORY_UTILIZATION?

All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM. 1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%? 2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?

Solution:

0.94 works

Jump to solution

18 Replies

digigoblin•11mo ago

Yes, its to prevent OOM. The setting is different for different model types. Some models have to have it as low as 0.8

houmieOP•11mo ago

I see. Do you know what is the recommended value for Llama3-70B ?

Solution

Jason•11mo ago

0.94 works

houmieOP•11mo ago

@nerdylive I'm planning to deploy it on a RTX 4000 Ada that comes with 20 GB VRAM. If I choose 0.99 it leaves 200MB available for the OS. 0.94 leaves 1.2 GB untilised. Isn't that too much wasted? I understand on smaller GPUs 0.94 would make sense, but unsure about bigger ones.

Jason•11mo ago

No thats not how it works There will be some of that used for other things

houmieOP•11mo ago

But unlike Windows there isn't much graphical to load on a Linux server, isn't it?

Jason•11mo ago

Wait are you using vllm? if not, then im not sure about that

houmieOP•11mo ago

No, Aphrodite-engine

Jason•11mo ago

has nothing to do with the graphical to load on the server, just other things that needed to run the llm so the llm itself + some other thing for inferencing you can imagine it like that

houmieOP•11mo ago

I see.

Jason•11mo ago

yeah i forgot what its called, but it takes gpu memory

houmieOP•11mo ago

Well on vLLM there is flashboot that takes extra gpu memory. But I'm not using vLLM. 🙂

Jason•11mo ago

yeah im not sure about whats the architecture there in aphrodite engine not the flashboot

digigoblin•11mo ago

aphrodite engine is actually very similar to vllm Takes the best parts of various different things all rolled up into one awesome engine

houmieOP•10mo ago

@digigoblin Agreed. I really like Aphrodite engine. Are you using Llama3 by any chance? What is your experience with the optimal setting? 0.94 or higher?

digigoblin•10mo ago

Which version, there are different sizes of Llama3

Jason•10mo ago

3 70b

digigoblin•10mo ago

I don't use that, its too big and too slow Takes extremely long just to load the model from network storage before you can even do anything with it

Gaming

Programming

What is the recommended GPU_MEMORY_UTILIZATION?

Did you find this page helpful?