#slow generation on GTX4080

1 messages · Page 1 of 1 (latest)

marble marlin
#

I have GTX4080 16gbVram
32Gb system ram.
windows 11

I am new to OoobaBooga, and using TheBloke_LLamMA2-13B-TieFighter-AWQ and text generation for me is quite slow sometimes (200 - 600 seconds per Generation) and I am trying to figure out why. Are there settings I should check in WebUI to troubleshoot?

paper burrow
#

You could try the same model but GPTQ version with exllamav2_hf loader.
Though I'm not sure what's exact problem here but AWQ never worked good on my end.
So I'm not even sure what's the purpose trying to use it since it isn't even good at benchmarks vs other quantization versions.
I think TheBloke makes them mostly for vLLM app.

marble marlin
#

My understanding is that GPTQ is for if you want to use your processor which is slower than GPU...

paper burrow
#

no, GPTQ/AWQ/exl2 are for GPU and GGUF is for CPU (and optionally +GPU)

#

so GPTQ/AWQ/exl2 are meant to be used when you have enough VRAM and GGUF when you don't (i.e. weak GPU or want to use big models)

marble marlin
#

Sometimes it is fast though, like 8 seconds. but it gets slow-slow-slow sometimes. even after restart

paper burrow
#

try GPTQ for now

marble marlin
#

Does it download a lot of models or just one for GPTQ?

paper burrow
#

it would download a folder with one single model version

#

file listing would be exactly as with AWQ

marble marlin
#

Got it

#

Loaded the model, but got no welcome message from AI assistant

paper burrow
#

does it reply anything to messages?