Optimal way of running a LLM | Text Generation WebUI | Page 1

pale forge Dec 10, 2023, 6:42 PM

#

I am unsure how best to run an LLM. GPU only or GPU + CPU? What about AWQ models?

I have found good success running GPTQ models but anything bigger than a 7b and 4096 context is horrible. Probably not enough memory. Just curious what you would recommend for my pc specs and if there is any tricks to load larger models with decent performance:

2080TI w/ 11 gb vram
AMD 3700x cpu
31gb system ram

rough acorn Dec 10, 2023, 7:08 PM

#

2096 context? Not 4096?
For 7B GPTQ models you shouldn't have issues with 11GB VRAM even at 4096 context, and if you use Mistral 7B instead of Llama2-7B, you might go at higher context as these consume less memory.
If you use Windows machine, you could check Task Manager Performance Tab.
If memory usage spills over from dedicated Card RAM to Shared Memory, then you indeed have out of memory issues but I doubt it should happen...
What loader do you use? Exllama1/2 is by far the fastest for GPTQ/exl2 models. AutoGPTQ or GPTQ-for-Llama are incredibly slow in comparison.

GPU+CPU is a good choice if you wish to use bigger models than your graphics card can handle. I think yours might handle 7B Mistral, 7B Llama1/2 or perhaps even 13B Llama1/2 at low context (1024-2048 but likely not enough for full 4096).

In my limited experience AWQ uses significantly more memory than GPTQ or EXL2 without clear benefits. But I tried it only when it initially appeared so it could be flawed support at that moment...

pale forge Dec 10, 2023, 9:11 PM

#

I do use 4096 mostly. Wrote the wrong number in my OP. I use exllama2 when it works. For reference I can run TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ at 4096 w/ exllama2. But only if I use the "cache_8bit" option. I get 11.56 T/s. Without cache_8bit checked it's 0.13 T/s.

rough acorn Dec 11, 2023, 10:01 AM

#

Not sure what you exactly want, bigger context or bigger models.

You could try Mistral-7B family, it has better quality than Llama2-7B and some claim it rivals Llama2-13B.
I have mixed impressions when I compare it with 13B but it's definitely much better than average 7B, and it can handle bigger contexts better than Llama models.
It can handle more or less reliably 6k context with usable quality, although it slowly drops as context get bigger.
But it doesn't have any issues with GPU memory at all even if you try to use high context versions. Even MistalLite with 16k should load perfectly fine.

If you really want bigger models instead, take a look at GGUF models that run on CPU+GPU.
I have no idea what speed you'd get for 13B GGUF models.

But I can run 33B GGUF models at about 0.6t/s with a similar setup but with much older CPU and ancient DDR3 memory.
My setup is RTX3060/12GB 32GB DDR3 and i7-3370 that doesn't have quite few optimizations your CPU has.
I'd assume your setup should handle this noticeably better than mine.
And even this 0.6t/s for 33B is better than your exllama2 speed with 13B without 8bit cache. So I think this option is worth testing.

pale forge Dec 11, 2023, 11:33 PM

#

Thanks ill give thag a try. As for what i am looking for.. well, not really sure. Still learning this stuff and just trying to find the best model for my hardware. I do plan on upgrading to a rtx3090 soon.

Context seems to be a real issue with. I suppose i should try smaller models with higj context settings to see hkw that performs. Have nkt experimented much with that yet.

pale forge Dec 12, 2023, 1:39 AM

#

The Yarn-Mistral-7B-128k-AWQ model runs really good with high context. Using AutoAWQ I can get 19.26 T/s with 26k context. 32k context works fast too but it takes a bit for the text to start streaming. But once it streams it's fast. ABout to try a 33b model and see how that goes.

jaunty knoll Dec 12, 2023, 2:51 AM

#

currently I'm using rtx 4070 and core i9th, it run around 1-4 secs to generate max 200 token, will it be faster if i change to 4090 or higher than that such as nvidia a100 ?

pale forge Dec 15, 2023, 4:52 PM

#

jaunty knoll currently I'm using rtx 4070 and core i9th, it run around 1-4 secs to generate m...

Look into renting some GPU access on a platform like https://vast.ai/ . You can see what sort of performance you can expect with differn't cards. Or combo of cards. And it's cheap to just try a few for some testing.

Vast AI

Rent GPUs | Vast.ai

Reduce your cloud compute costs by 3-5X with the best cloud GPU rentals. Vast.ai's simple search interface allows fair comparison of GPU rentals from all providers.

#

pale forge Dec 17, 2023, 6:07 PM

#

2024 be like this... 😕

rough acorn Dec 17, 2023, 6:09 PM

#

~~$428~~ $600 /s

pale forge Dec 17, 2023, 6:10 PM

#

oops. posted this in wrong discord. haha

rough acorn Dec 17, 2023, 6:10 PM

#

oh, it reminds me the crypto epidemic when there was GPU shortage until suddenly it just stopped

#

I barely managed to buy a 2GB card in February 2022 and then already had 3060/12GB by the end of spring

pale forge Dec 17, 2023, 6:12 PM

#

Yeah, bitcoin reached a milestone where the difficulty in mining had a large leap. Crushing all the margins for profit of the non datacenter factory mining operations.

#Optimal way of running a LLM