#Openclaw slow telegram response

1 messages · Page 1 of 1 (latest)

old wedgeBOT
#

A 14B Q8 Ollama model can absolutely feel “slow” on a 24GB unified‑memory Mac — especially once the KV cache grows (long chats / large context), because you end up memory‑bandwidth bound and sometimes into swap.

A few quick checks that usually pinpoint what’s going on:

  1. First: what version + what model is OpenClaw actually using?
openclaw --version
openclaw models status

(or in Telegram: /model status)

  1. Is Ollama actually the bottleneck vs Telegram delivery?
  • Compare latency in the TUI (no Telegram involved): openclaw tui
  • Tail logs and look for the LLM request duration vs “send message” duration:
openclaw logs --follow
#

(please don’t paste your bot token; if you share logs, just share the timing lines)

  1. Make sure you’re not using the wrong Ollama URL (this one bites people):
  1. If it’s genuinely model inference speed:
  • Try reducing the effective context window (big context = big KV = slow). In OpenClaw that means overriding the model’s contextWindow smaller (e.g. 4k–8k) or just starting a fresh chat /new and keeping threads shorter.
  • If you want speed, use a smaller quantization (Q8 is the slow/huge one) or a smaller model.

If you paste just the output of openclaw --version + the relevant part of openclaw models status (provider/model + contextWindow), I can tell you which of the above is most likely in your setup.

#

I can help, but I need 2–3 concrete outputs to tell whether this is Ollama inference vs Telegram delivery / OpenClaw queueing.

On the Mac where the Gateway is running, paste:

openclaw --version
openclaw models status
openclaw status

And if you can, also:

ollama ps

What I’m looking for:

#
  • Is OpenClaw actually using ollama/qwen3:14b-q8_0 (vs some other default)?
  • What context window it’s running with (big context = slow KV cache).
  • Whether the gateway is healthy / blocked.
  • Whether Ollama is running on the intended backend (and not thrashing memory).

Once you paste those, I’ll point to the specific knob (usually: lower quant / cap contextWindow / start a fresh session / fix Ollama baseUrl).

blissful galleon
#

I just did ollama ps and it shows processor 19%/81% CPU/GPU also context is = 40960

old wedgeBOT
#

Yep — 40960 ctx is almost certainly why it feels glacial.

Even if your actual chat is short, a model configured with a 40k num_ctx tends to allocate a big KV cache / memory footprint, and on a 24GB unified‑memory Mac + a 14B Q8 model that can turn into “GPU is busy but everything is memory-bound (or swapping)”.

Fix: cap the model context in Ollama (recommended)

Create a variant of the model with a smaller context (try 8192 first; if still slow, try 4096):

cat > Modelfile <<'EOF'
FROM qwen3:14b-q8_0
PARAMETER num_ctx 8192
EOF

ollama create qwen3:14b-q8_0-8k -f Modelfile

Then switch OpenClaw to that model (whatever way you’re selecting models — /model, Control UI, or config). If you’re using implicit Ollama discovery, it should just show up automatically.

#

After switching, re-check:

ollama ps

You should see context closer to 8192 (not 40960), and Telegram replies should speed up noticeably.

Also helps (quick wins)

  • In Telegram, start a fresh session (/new) so you’re not dragging a long thread along.
  • Keep responses shorter while testing (long outputs are slower and Telegram can’t really “stream”).

If you paste openclaw --version + openclaw models status I can tell you the cleanest place to set the default to the new …-8k model in your setup.

blissful galleon
#

How do I this through terminal?

old wedgeBOT
#

On macOS it’s just a couple of shell commands. Open Terminal.app, then copy/paste this whole block:

cd ~
cat > Modelfile <<'EOF'
FROM qwen3:14b-q8_0
PARAMETER num_ctx 8192
EOF

ollama create qwen3-14b-q8_0-8k -f Modelfile

Sanity-check it exists:

ollama list | grep qwen3-14b-q8_0-8k
#

Now point OpenClaw at it (quick test):

openclaw models set ollama/qwen3-14b-q8_0-8k

And verify the context dropped:

ollama ps

If you still see it sluggish, repeat the same process with num_ctx 4096 (make a …-4k model) and switch OpenClaw to that.

If you tell me whether you want this as the default for all chats or just this Telegram session, I’ll give the cleanest “set default” way for your setup.

blissful galleon
#

Its still at context 40960

old wedgeBOT
#

If ollama ps is still showing 40960, one of these is happening:

  1. You’re still running the old model (qwen3:14b-q8_0) and the new …-8k model never got loaded, or
  2. OpenClaw is forcing num_ctx based on the model’s contextWindow (so it keeps requesting 40960).

Step 1: confirm which model is actually running

Please paste the full output of:

ollama ps

Look at the NAME column.

Step 2 (most common): stop the old one + force the new one to run

If NAME shows qwen3:14b-q8_0, stop it: