I tried to run 14B Q4_K. If i run from Ollama console, it eats ~14GB of RAM, and runs OK. But from HA (no control) it throws error: Sorry, I had a problem talking to the Ollama server: model requiresmore system memory (36.2 GiB) than is available (17.3 GiB)
Is it the context window? What is default Ollama console context then? Why it asks for whopping 36GB?...
#Qwen 14B wants too much?
1 messages Β· Page 1 of 1 (latest)
how many entities are you exposing to it?
I said there - no control. So zero. π
Even 7B says it needs 23GB... Geez.
Alright, using 2048 context window, on 7B, it didn't fail at least. π
Turn on context quantization, that may help
Also enable flash attention that helps as well π
I have set the environmental variable OLLAMA_NUM_PARALLEL=1 because it would default to 4 on my 3090 and use considerable more VRAM with larger context windows. This was a while ago, maybe things have changed.
OLLAMA_NUM_PARALLEL: This parameter controls the maximum number of parallel requests each model can process simultaneously. The default value is automatically selected as either 4 or 1, depending on the available memory.
Damn Nick, you know so much about this! Cool! π
Think num parallel only matters when you have multiple things using your instance at once? Haven't really messed with that setting tbh π