What is the recommended GPU_MEMORY_UTILIZATION? | Runpod | Page 1

queen brook Jun 13, 2024, 7:28 AM

#

All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM.

What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%?
Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?

snow hollowBOT Jun 13, 2024, 7:28 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

vestal moat Jun 13, 2024, 7:30 AM

#

Yes, its to prevent OOM. The setting is different for different model types.

#

Some models have to have it as low as 0.8

queen brook Jun 13, 2024, 7:49 AM

#

I see. Do you know what is the recommended value for Llama3-70B ?

radiant pecan Jun 13, 2024, 10:24 AM

#

queen brook I see. Do you know what is the recommended value for Llama3-70B ?

0.94 works

snow hollowBOT Jun 13, 2024, 5:24 PM

#

Thank you for marking this question as solved!

Learn more

https://answeroverflow.com

queen brook Jun 13, 2024, 5:27 PM

#

@radiant pecan I'm planning to deploy it on a RTX 4000 Ada that comes with 20 GB VRAM. If I choose 0.99 it leaves 200MB available for the OS. 0.94 leaves 1.2 GB untilised. Isn't that too much wasted?

#

I understand on smaller GPUs 0.94 would make sense, but unsure about bigger ones.

radiant pecan Jun 13, 2024, 5:27 PM

#

No thats not how it works

#

There will be some of that used for other things

queen brook Jun 13, 2024, 5:28 PM

#

But unlike Windows there isn't much graphical to load on a Linux server, isn't it?

radiant pecan Jun 13, 2024, 5:29 PM

#

Wait are you using vllm?

#

if not, then im not sure about that

queen brook Jun 13, 2024, 5:29 PM

#

No, Aphrodite-engine

radiant pecan Jun 13, 2024, 5:29 PM

#

queen brook But unlike Windows there isn't much graphical to load on a Linux server, isn't i...

has nothing to do with the graphical to load on the server, just other things that needed to run the llm

#

so the llm itself + some other thing for inferencing you can imagine it like that

queen brook Jun 13, 2024, 5:29 PM

#

I see.

radiant pecan Jun 13, 2024, 5:31 PM

#

yeah

#

i forgot what its called, but it takes gpu memory

queen brook Jun 13, 2024, 5:31 PM

#

Well on vLLM there is flashboot that takes extra gpu memory. But I'm not using vLLM. 🙂

radiant pecan Jun 13, 2024, 5:32 PM

#

yeah im not sure about whats the architecture there in aphrodite engine

#

not the flashboot

vestal moat Jun 13, 2024, 7:02 PM

#

aphrodite engine is actually very similar to vllm

#

Takes the best parts of various different things all rolled up into one awesome engine

queen brook Jun 14, 2024, 8:50 AM

#

@vestal moat Agreed. I really like Aphrodite engine. Are you using Llama3 by any chance? What is your experience with the optimal setting? 0.94 or higher?

vestal moat Jun 14, 2024, 8:51 AM

#

queen brook <@456226577798135808> Agreed. I really like Aphrodite engine. Are you using Llam...

Which version, there are different sizes of Llama3

radiant pecan Jun 14, 2024, 8:57 AM

#

3 70b

vestal moat Jun 14, 2024, 8:58 AM

#

I don't use that, its too big and too slow

#What is the recommended GPU_MEMORY_UTILIZATION?