#Starting out, performance woes

7 messages · Page 1 of 1 (latest)

sterile mantle
#

You can try using GGUF with llama.cpp or ctransformers and see if that improves anything

#

Also, with 128k context that might be a large RAM/VRAM burden

#

Try reducing it to something like 4k when loading the model

#

See if that improves anything

#

As for which GGUF quant to use, maybe go for Q4_K_M.

sterile mantle
#

did you try reducing context?

#

128k context is very large