slow generation on GTX4080 | Text Generation WebUI | Page 1

marble marlin Dec 19, 2023, 7:55 PM

#

I have GTX4080 16gbVram
32Gb system ram.
windows 11

I am new to OoobaBooga, and using TheBloke_LLamMA2-13B-TieFighter-AWQ and text generation for me is quite slow sometimes (200 - 600 seconds per Generation) and I am trying to figure out why. Are there settings I should check in WebUI to troubleshoot?

paper burrow Dec 19, 2023, 7:59 PM

#

You could try the same model but GPTQ version with exllamav2_hf loader.
Though I'm not sure what's exact problem here but AWQ never worked good on my end.
So I'm not even sure what's the purpose trying to use it since it isn't even good at benchmarks vs other quantization versions.
I think TheBloke makes them mostly for vLLM app.

marble marlin Dec 19, 2023, 8:01 PM

#

My understanding is that GPTQ is for if you want to use your processor which is slower than GPU...

paper burrow Dec 19, 2023, 8:02 PM

#

no, GPTQ/AWQ/exl2 are for GPU and GGUF is for CPU (and optionally +GPU)

#

so GPTQ/AWQ/exl2 are meant to be used when you have enough VRAM and GGUF when you don't (i.e. weak GPU or want to use big models)

marble marlin Dec 19, 2023, 8:03 PM

#

Sometimes it is fast though, like 8 seconds. but it gets slow-slow-slow sometimes. even after restart

paper burrow Dec 19, 2023, 8:04 PM

#

try GPTQ for now

marble marlin Dec 19, 2023, 8:12 PM

#

Does it download a lot of models or just one for GPTQ?

paper burrow Dec 19, 2023, 8:12 PM

#

it would download a folder with one single model version

#

file listing would be exactly as with AWQ

marble marlin Dec 19, 2023, 8:13 PM

#

Got it

#

Loaded the model, but got no welcome message from AI assistant

paper burrow Dec 19, 2023, 8:16 PM

#

does it reply anything to messages?

#slow generation on GTX4080