#Qwen 14B wants too much?

1 messages Β· Page 1 of 1 (latest)

rigid orchid
#

I tried to run 14B Q4_K. If i run from Ollama console, it eats ~14GB of RAM, and runs OK. But from HA (no control) it throws error: Sorry, I had a problem talking to the Ollama server: model requiresmore system memory (36.2 GiB) than is available (17.3 GiB)
Is it the context window? What is default Ollama console context then? Why it asks for whopping 36GB?...

acoustic sparrow
#

how many entities are you exposing to it?

rigid orchid
#

Even 7B says it needs 23GB... Geez.

#

Alright, using 2048 context window, on 7B, it didn't fail at least. πŸ™‚

jade relic
#

Turn on context quantization, that may help

#

Also enable flash attention that helps as well πŸ™‚

serene monolith
#

I have set the environmental variable OLLAMA_NUM_PARALLEL=1 because it would default to 4 on my 3090 and use considerable more VRAM with larger context windows. This was a while ago, maybe things have changed.
OLLAMA_NUM_PARALLEL: This parameter controls the maximum number of parallel requests each model can process simultaneously. The default value is automatically selected as either 4 or 1, depending on the available memory.

rigid orchid
jade relic
#

Think num parallel only matters when you have multiple things using your instance at once? Haven't really messed with that setting tbh πŸ˜