#Performance issue

1 messages · Page 1 of 1 (latest)

pine breach
#

A direct local Ollama session produces on my 4G vram gpu 30tps. The same query through OpenClaw 2026.4.15 (041266a) needs 25 seconds waiting time and estimatedly produces lik 5tps.

Why?

wise dustBOT
versed veldtBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

wise dustBOT
edgy masonBOT
#

Let me verify the local docs paths, then I’ll answer with the likely bottlenecks and what to measure.

#

Mostly because this is not the same workload.

Raw Ollama CLI is measuring near-pure model decode.

OpenClaw adds several layers on top:

  • a much larger prompt, system prompt + tool schemas + session/thread history
  • Discord thread history, default channels.discord.historyLimit is 20
  • agent/runtime bookkeeping before the model starts answering
  • Discord streaming is off by default, so you often see the reply only after generation finishes

On a 4 GB VRAM card, that prompt prefill cost gets brutal. So 30 tps raw can easily feel like “5 tps” end-to-end through OpenClaw.

Also, double-check this one because it matters a lot:

  • if you configured Ollama as http://...:11434/v1, that is the OpenAI-compatible path
  • for OpenClaw + Ollama, use the native Ollama API: http://...:11434 with no /v1
#
  • lower channels.discord.historyLimit
  • agents.defaults.experimental.localModelLean: true
  • if you do not need tools for that model, compat.supportsTools: false

If you paste your Ollama provider config block, I can tell you pretty quickly which of those is biting you hardest.

pine breach
#

Discord thread history, default channels.discord.historyLimit is 20?

What has discord to do with the performance of my local OpenClaw instance, I have no discord configured as a channel, are you halluzinating?

edgy masonBOT
#

You’re right, that Discord bit was wrong. I mixed up the support thread we’re in with your local runtime. Sorry.

If your local OpenClaw instance is not using Discord, then Discord has nothing to do with the slowdown.

The real reason is usually this:

  • ollama run = thin raw chat call
  • OpenClaw = full agent runtime

So OpenClaw sends a much fatter prompt:

  • system/developer instructions
  • tool schemas
  • session history
  • workspace/bootstrap context
  • sometimes memory/context plumbing

On a 4 GB VRAM card, that extra prompt ingestion, prefill, often hurts way more than decode speed. So raw Ollama can show ~30 tps, while OpenClaw feels much slower end-to-end, especially on first token.

#

A few concrete gotchas:

  • If you pointed OpenClaw at Ollama’s /v1 OpenAI-compatible endpoint, that can be worse. Native Ollama URL is better: http://host:11434 with no /v1.
  • First turn is often the worst because OpenClaw may load workspace/session context.
  • Small local backends often struggle with OpenClaw’s full tool surface. The docs explicitly call out agents.defaults.experimental.localModelLean: true as a workaround.

Best isolation test:

openclaw infer model run --model ollama/<your-model> --prompt "hi" --json

That uses the normal agent runtime. If this is already much slower than ollama run, the gap is prompt/runtime overhead, not the transport.

Things worth trying:

  • native Ollama API, not /v1