#Starting out, performance woes
7 messages · Page 1 of 1 (latest)
7 messages · Page 1 of 1 (latest)
You can try using GGUF with llama.cpp or ctransformers and see if that improves anything
Also, with 128k context that might be a large RAM/VRAM burden
Try reducing it to something like 4k when loading the model
See if that improves anything
As for which GGUF quant to use, maybe go for Q4_K_M.