#slow generation on GTX4080
1 messages · Page 1 of 1 (latest)
You could try the same model but GPTQ version with exllamav2_hf loader.
Though I'm not sure what's exact problem here but AWQ never worked good on my end.
So I'm not even sure what's the purpose trying to use it since it isn't even good at benchmarks vs other quantization versions.
I think TheBloke makes them mostly for vLLM app.
My understanding is that GPTQ is for if you want to use your processor which is slower than GPU...
no, GPTQ/AWQ/exl2 are for GPU and GGUF is for CPU (and optionally +GPU)
so GPTQ/AWQ/exl2 are meant to be used when you have enough VRAM and GGUF when you don't (i.e. weak GPU or want to use big models)
Sometimes it is fast though, like 8 seconds. but it gets slow-slow-slow sometimes. even after restart
try GPTQ for now
Does it download a lot of models or just one for GPTQ?
it would download a folder with one single model version
file listing would be exactly as with AWQ
does it reply anything to messages?