#Very slow inference. Low % of GPU utilization.

1 messages · Page 1 of 1 (latest)

spiral mantle
#

The problem:
My goal is to use Mixtral8x7B-Instruct for large text summarization. From the screenshot you can see that the model was successfully loaded into VRAM (~63% of each GPUs VRAM is used). But when I load a text of ~3500k tokens and start inference, each GPU's utilization gets only up to 15-20% (8% and 12% on screenshot). Thus, summarization of that text takes about 140 seconds to finish.

I've tried:

  1. Loading with 4-bit quantization
  2. Batching
  3. Increasing text length
  4. Switching to Mixtral7B-Instruct and running it on a single GPU (it fits)
  5. Looking through HF and Langchain docs and GitHub issues

Question 1:
Is this expected behavior? Are GPUs supposed to be not fully utilized?

Question 2:
Are there any ways to increase the utilization / inference speed?

My specs:

  • 2xRTX3090
  • 48 GB VRAM total
  • 32 GB RAM

Frameworks:

  • Hugging Face Transformers
  • Langchain

Would really appreciate if @lofty carbon could take a look at this

lofty carbon
#

batching in moe does not really work

#

neither with vllm

#

thats normal

#

as the full model is pretty much hot - so it runs like a dense model

spiral mantle
#

Yeah, I've seen GitHub issues with answers saying that batching won't work, but I tried anyways

#

pretty much hot

wym?

lofty carbon
#

experts are selected per token - if you have N requests at the same time .. odds are that all experts will be active at all time

#

makes sense yes ?

#

so instead of inferencing like a dense 13b it will have the same speed as a dense 50ish B model

spiral mantle
#

Shouldn't keeping all experts active increase the %?

lofty carbon
#

you still pass throw the routing

spiral mantle
#

Ok, but I've tried "regular" Mistral. And the result is the same (% is even lower)

#

I still don't understand the answer to Question 1.

lofty carbon
#

plenty factors are that effects the gpu utilisation .. if there is not high enough of a batch .. you need push the requests into the gpu

#

so cpu threads / bandwidth and what not

#

all effect that

#

regardless .. thats an inference question

#

and has 0 todo with the model

spiral mantle
#

Understood, I thought it might be something model-specific. If so, I'll keep experimenting