The problem:
My goal is to use Mixtral8x7B-Instruct for large text summarization. From the screenshot you can see that the model was successfully loaded into VRAM (~63% of each GPUs VRAM is used). But when I load a text of ~3500k tokens and start inference, each GPU's utilization gets only up to 15-20% (8% and 12% on screenshot). Thus, summarization of that text takes about 140 seconds to finish.
I've tried:
- Loading with 4-bit quantization
- Batching
- Increasing text length
- Switching to Mixtral7B-Instruct and running it on a single GPU (it fits)
- Looking through HF and Langchain docs and GitHub issues
Question 1:
Is this expected behavior? Are GPUs supposed to be not fully utilized?
Question 2:
Are there any ways to increase the utilization / inference speed?
My specs:
- 2xRTX3090
- 48 GB VRAM total
- 32 GB RAM
Frameworks:
- Hugging Face Transformers
- Langchain
Would really appreciate if @lofty carbon could take a look at this