Very slow inference. Low % of GPU utilization. | Mistral AI | Page 1

spiral mantle Apr 12, 2024, 2:37 PM

#

The problem:
My goal is to use Mixtral8x7B-Instruct for large text summarization. From the screenshot you can see that the model was successfully loaded into VRAM (~63% of each GPUs VRAM is used). But when I load a text of ~3500k tokens and start inference, each GPU's utilization gets only up to 15-20% (8% and 12% on screenshot). Thus, summarization of that text takes about 140 seconds to finish.

I've tried:

Loading with 4-bit quantization
Batching
Increasing text length
Switching to Mixtral7B-Instruct and running it on a single GPU (it fits)
Looking through HF and Langchain docs and GitHub issues

Question 1:
Is this expected behavior? Are GPUs supposed to be not fully utilized?

Question 2:
Are there any ways to increase the utilization / inference speed?

My specs:

2xRTX3090
48 GB VRAM total
32 GB RAM

Frameworks:

Hugging Face Transformers
Langchain

Would really appreciate if @lofty carbon could take a look at this

lofty carbon Apr 12, 2024, 2:37 PM

#

batching in moe does not really work

#

neither with vllm

#

thats normal

#

as the full model is pretty much hot - so it runs like a dense model

spiral mantle Apr 12, 2024, 2:39 PM

#

Yeah, I've seen GitHub issues with answers saying that batching won't work, but I tried anyways

#

pretty much hot

wym?

lofty carbon Apr 12, 2024, 2:44 PM

#

experts are selected per token - if you have N requests at the same time .. odds are that all experts will be active at all time

#

makes sense yes ?

#

so instead of inferencing like a dense 13b it will have the same speed as a dense 50ish B model

spiral mantle Apr 12, 2024, 2:47 PM

#

Shouldn't keeping all experts active increase the %?

lofty carbon Apr 12, 2024, 2:47 PM

#

you still pass throw the routing

spiral mantle Apr 12, 2024, 2:49 PM

#

Ok, but I've tried "regular" Mistral. And the result is the same (% is even lower)

#

I still don't understand the answer to Question 1.

lofty carbon Apr 12, 2024, 2:53 PM

#

plenty factors are that effects the gpu utilisation .. if there is not high enough of a batch .. you need push the requests into the gpu

#

so cpu threads / bandwidth and what not

#

all effect that

#

regardless .. thats an inference question

#

and has 0 todo with the model

spiral mantle Apr 12, 2024, 2:56 PM

#

Understood, I thought it might be something model-specific. If so, I'll keep experimenting

#Very slow inference. Low % of GPU utilization.