#llama.cpp Gemma 4 -- what's the correct thinking mode to use?
1 messages · Page 1 of 1 (latest)
Yep, that’s normal, at least to a point.
llama-server web UI is basically a direct chat request. OpenClaw adds agent overhead: more prompt/context, session state, tool plumbing, and any thinking mode you enabled. On local llama.cpp models, that extra work is usually noticeable.
For lowest latency in OpenClaw, try:
/think off- use a faster model for normal chat
- enable prompt caching for longer sessions
- use the raw llama-server UI when you only want plain chat, and OpenClaw when you want tools/memory/automation
Docs:
- https://docs.openclaw.ai/tools/thinking
- https://docs.openclaw.ai/reference/prompt-caching
- https://docs.openclaw.ai/concepts/models
If it feels unusually slow, paste:
openclaw --version