#ollama local timeout
1 messages Β· Page 1 of 1 (latest)
Let me verify the docs and the exact Ollama behavior before I tell them anything.
For plain ollama run, yeah, a 0.8B model can answer fast. For OpenClaw agent turns, not always, because the prompt is much fatter and usually includes tool schemas. Tiny local models often fall over there even when direct Ollama looks fine.
A few concrete checks:
- First, paste:
openclaw --version
- Make sure your Ollama base URL is the native one, not
/v1:
baseUrl: "http://<host>:11434"
api: "ollama"
Not:
http://<host>:11434/v1
- Run these and paste the output:
openclaw status --all
openclaw models status
openclaw infer model run --model ollama/<your-model> --prompt "hi" --json
openclaw logs --follow
Most likely causes from the docs/source:
- using Ollama via
/v1instead of native API - tiny model choking on OpenClawβs full agent prompt/tool surface
- local backend/model context too small for normal agent turns
If it only fails on agent turns, try this pressure-release flag:
{
agents: {
defaults: {
experimental: {
localModelLean: true
}
}
}
}
If you are on an OpenAI-compatible /v1 path for some reason, docs also say tool calling/streaming can be unreliable there, and you may need:
params: { streaming: false }
Docs:
- https://docs.openclaw.ai/help/faq
- https://docs.openclaw.ai/providers/ollama
- https://docs.openclaw.ai/gateway/local-models
- https://docs.openclaw.ai/gateway/troubleshooting
Paste the version plus the infer model run result and the timeout line from logs, and I can narrow it down fast.
π¦ OpenClaw 2026.4.20 (115f05d)
One CLI to rule them all, and one more restart because you changed the port.
β
OpenClaw status --all
Overview
βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Item β Value β
βββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Version β 2026.4.20 β
β OS β linux 6.12.75+rpt-rpi-2712 (arm64) β
β Node β 22.22.2 β
β Config β ~/.openclaw/openclaw.json β
β Dashboard β http://XXXXXXXXXXXXXXX:18789/ β
β Tailscale β off β
β Channel β stable (default) β
β Update β pnpm Β· up to date Β· npm latest 2026.4.20 β
β Gateway β local Β· ws://127.0.0.1:18789 (local loopback) Β· reachable β
β β 28ms Β· auth token Β· Pironman5-MAX (192.168.178.131) app β
β β 2026.4.20 linux 6.12.75+rpt-rpi-2712 β
β Security β Run: openclaw security audit --deep β
β Gateway self β unknown β
β Gateway service β systemd installed Β· enabled Β· running (pid 390845, state β
β β active) β
β Node service β systemd not installed β
β Agents β 1 total Β· 1 bootstrapping Β· 1 active Β· 96 sessions β
β Secrets β none β
βββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Channels
ββββββββββββ¬ββββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β Channel β Enabled β State β Detail β
ββββββββββββΌββββββββββΌβββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€
β Telegram β ON β OK β token config (sha256:XXXX Β· len 46) Β· β
β β β β accounts 1/1 β
ββββββββββββ΄ββββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββ
Telegram accounts
xxx
Agents
ββββββββββββββ¬βββββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββ
β Agent β Bootstrap file β Sessions β Active β Store β
ββββββββββββββΌβββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββββββββββββββββββ€
β main β PRESENT β 96 β 2m ago β ~/.openclaw/agents/main/ β
β β β β β sessions/sessions.json β
ββββββββββββββ΄βββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββ
Diagnosis (read-only)
Gateway connection details:
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/alex/.openclaw/openclaw.json
Bind: lan
----
π¦ OpenClaw 2026.4.20 (115f05d)
Gateway onlineβplease keep hands, feet, and appendages inside the shell at all times.
Config : ~/.openclaw/openclaw.json
Agent dir : ~/.openclaw/agents/main/agent
Default : ollama/glm-5.1:cloud
Fallbacks (0) : -
Image model : ollama/kimi-k2.5:cloud
Image fallbacks (0): -
Aliases (0) : -
Configured models (8): ollama/glm-4.7-9b, ollama/qwen3.5:9b, ollama/glm-5:cloud, ollama/glm-5.1:cloud, ollama/qwen3.5:0.8b, ollama/qwen3.5:demo, ollama/qwen2.5:3b, ollama/qwen2.5:0.5b
- none
----
openclaw infer model run --model ollama/qwen3.5:0.8b --prompt "hi" --json
22:49:07 [agent/embedded] Profile ollama:default timed out. Trying next account...
22:49:07 [agent/embedded] [llm-idle-timeout] ollama/qwen3.5:0.8b produced no reply before the idle watchdog; retrying same model
Nice, that narrows it down a lot. Iβm checking the idle-timeout path so I donβt guess.
Yep, this is much more useful.
The key clue is this part:
[llm-idle-timeout] ollama/qwen3.5:0.8b produced no reply before the idle watchdog
So this is not the big agent prompt/tool-schema problem I mentioned earlier, because openclaw infer model run ... --prompt "hi" is already timing out. This is lower-level: OpenClaw is not getting a first reply chunk back from that model.
Iβd test these next, in order:
1) Hit the exact Ollama API OpenClaw uses
OpenClaw uses Ollamaβs native /api/chat, not ollama run.
curl http://127.0.0.1:11434/api/chat \
-H 'content-type: application/json' \
-d '{"model":"qwen3.5:0.8b","messages":[{"role":"user","content":"hi"}],"stream":false}'
If this hangs or returns junk, itβs an Ollama/model/API-path issue, not really OpenClaw.
2) Try a control model
openclaw infer model run --model ollama/qwen2.5:3b --prompt "hi" --json
If 3B works and 0.8B does not, then itβs probably specific to that tiny model on the API path.
3) Force non-streaming for that one model
You already have the model in agents.defaults.models, so add this under that entry:
{
agents: {
defaults: {
models: {
"ollama/qwen3.5:0.8b": {
streaming: false
}
}
}
}
}
Thereβs a config comment in source saying Ollama models may need streaming: false.
4) Raise or disable the idle watchdog
{
agents: {
defaults: {
llm: {
idleTimeoutSeconds: 300
// or 0 to disable for testing
}
}
}
}
Docs/source confirm this path:
agents.defaults.llm.idleTimeoutSeconds
5) Double-check you are not using /v1
For Ollama it should be:
baseUrl: "http://127.0.0.1:11434"
api: "ollama"
Not:
http://127.0.0.1:11434/v1
If you paste:
- the
/api/chatcurl result - whether
qwen2.5:3bworks - your
models.providers.ollamablock (redact anything you want)
β¦I can probably tell you exactly which branch this is.
Docs:
curl http://127.0.0.1:11434/api/chat -H 'content-type: application/json' -d '{"model":"qwen3.5:0.8b","messages":[{"role":"user","content":"hi"}],"stream":false}'
{"model":"qwen3.5:0.8b","created_at":"2026-04-21T20:57:00.676511342Z","message":{"role":"assistant","content":"Hello! How can I help you today? π","thinking":"Okay, the user just said \"hi\" in English. My goal is to acknowledge their greeting and offer help. Since I'm an AI, I should be friendly and ready to assist. Let's respond with a warm greeting and invite them to ask anything. Keep it concise and friendly. Maybe offer to help with a question or task. Alright, let's make sure this is natural and helpful. I thin
k we're on the same page, so just a friendly response and an invitation to get started."},"done":true,"done_reason":"stop","total_duration":19913468542,"load_duration":353657437,"prompt_eval_count":11,"prompt_eval_duration":376968092,"eval_count":116,"eval_duration":18928502447}```
Perfect, that isolates it.
Your raw Ollama call works, and it returns in about 20s. So the local model itself is fine. The problem is very likely OpenClawβs streamed response path for that model, not Ollama generally.
Iβd do this next:
Set that one model to non-streaming
In ~/.openclaw/openclaw.json, under the existing model entry:
{
agents: {
defaults: {
models: {
"ollama/qwen3.5:0.8b": {
streaming: false
}
}
}
}
}
If that entry already has alias or params, keep them and just add streaming: false.
Then restart the gateway and retest:
openclaw gateway restart
openclaw infer model run --model ollama/qwen3.5:0.8b --prompt "hi" --json
Why I think this is it
curl ... /api/chat ... "stream": falseworks- OpenClaw logs show idle watchdog
- that means OpenClaw is likely waiting for a first streamed chunk that never arrives in the shape it expects
Optional smoking-gun test
If you want to confirm it, try Ollama with streaming on:
curl http://127.0.0.1:11434/api/chat \
-H 'content-type: application/json' \
-d '{"model":"qwen3.5:0.8b","messages":[{"role":"user","content":"hi"}],"stream":true}'
If that streamed form is weird, delayed, or silent until the end, thatβs basically the bug.
If streaming: false fixes it, Iβd call this a real bug/regression and file it here with the repro you already have:
https://github.com/openclaw/openclaw/issues
Relevant docs:
If you want, paste your agents.defaults.models["ollama/qwen3.5:0.8b"] entry after editing and Iβll sanity check it.
from my ai: Joel, [21.04.2026, 23:02:21]:
Found it. The config path is agents.defaults.llm.idleTimeoutSeconds. Default is 120 seconds (2 minutes). For local models processing 12k+ tokens of context on CPU, that's way too short.
The curl test showed the 0.8b model can respond to 11 tokens in 20s. But OpenClaw injects ~12k tokens of context, so prompt eval alone would take much longer.
Let me also check if there's a streaming disable option per model:
Found it. The key line:
/** Enable streaming for this model (default: true, false for Ollama to avoid SDK issue #1205). */
So there IS a per-model streaming config option. And Ollama defaults to false already due to an SDK bug. But the idle timeout is still the problem - it's 120s by default, and with 12k tokens of context on CPU, that's not enough....
Joel found two real knobs, but I donβt think the diagnosis is fully right yet.
The part Iβd push back on is this:
β120s is too short because OpenClaw injects ~12k tokensβ
That might matter for normal agent chats, but your earlier repro was:
openclaw infer model run --model ollama/qwen3.5:0.8b --prompt "hi" --json
That already hit the idle watchdog, and your raw Ollama non-streaming call returned in about 20s. So 120s alone does not explain the current minimal repro.
My read is:
- First suspect: streamed/non-streamed handling for that model
- Second suspect: idle timeout on bigger real agent turns
So Iβd test in this order:
{
agents: {
defaults: {
llm: {
idleTimeoutSeconds: 300
},
models: {
"ollama/qwen3.5:0.8b": {
streaming: false
}
}
}
}
}
Then:
openclaw gateway restart
openclaw infer model run --model ollama/qwen3.5:0.8b --prompt "hi" --json
How Iβd interpret the result
- If this fixes
infer model run: likely a streaming-path bug/quirk - If
infer model runstill fails but normal curl works: weβve got a more specific OpenClaw/Ollama integration bug - If
infer model runworks but normal chat still times out: then Joelβs βlarge context on CPUβ theory becomes much more plausible
So, close, but I wouldnβt treat β120s is the root causeβ as proven yet.
If you want, paste the result after that config change and Iβll help narrow the next branch.
doesnt work
My ai suspects now:
The direct curl proves the model works fine. 11 prompt tokens β 20 seconds response. The problem is OpenClaw injecting 12k+ tokens of context before the model even starts thinking.
Notice the log: MEMORY.md is 20500 chars (limit 12000); truncating β that's from the infer model run command, which still loads the full agent context. The model has to process all of that before generating a single token.
Streaming isn't the bottleneck here. The bottleneck is prompt processing time. With 12k+ tokens of context on CPU:
Let me verify whether infer model run really pulls in the full agent bootstrap, because that changes the diagnosis a lot.
Yeah, this is the first explanation that really fits the evidence.
The important bit is:
openclaw infer model runreuses the agent runtime- normal runs inject workspace bootstrap files
- docs explicitly say that includes
MEMORY.md - your log shows
MEMORY.md ... truncating
So the tiny local model is probably burning time on bootstrap/context ingestion before first token, and the idle watchdog fires before it answers.
One important correction though:
agents.defaults.skipBootstrapis not the fix here, it only stops auto-creating those files- the useful knobs are the injection size knobs
What Iβd try now, in this order
Recommended: shrink injected bootstrap a lot
{
agents: {
defaults: {
bootstrapMaxChars: 2000,
bootstrapTotalMaxChars: 8000,
llm: {
idleTimeoutSeconds: 300
},
experimental: {
localModelLean: true
}
}
}
}
Then restart and retry:
openclaw gateway restart
openclaw infer model run --model ollama/qwen3.5:0.8b --prompt "hi" --json
Why these help
bootstrapMaxChars/bootstrapTotalMaxCharsreduce how muchMEMORY.mdand friends get stuffed into prompt contextlocalModelLean: truetrims heavyweight default tools from the runtime promptidleTimeoutSeconds: 300gives the model a bit more breathing room on Pi CPU
My read now
Your AI is probably right that this is prompt/context pressure, not raw Ollama failure.
The smoking guns are:
- raw
/api/chatworks
infer model runreuses agent runtime- docs say bootstrap files are injected
- truncation warning for
MEMORY.mdappears during the failing path
Docs:
- https://docs.openclaw.ai/cli/infer
- https://docs.openclaw.ai/reference/token-use
- https://docs.openclaw.ai/gateway/configuration-reference
- https://docs.openclaw.ai/concepts/experimental-features
If that config makes it work, Iβd call the thread solved: 0.8B is fine for raw Ollama, but not with OpenClawβs default prompt payload unless you slim the bootstrap/tool surface down.
If it still fails after that, paste the new infer model run output and Iβll keep digging.
is there another way to run local models like this?
Short answer: yes, but there isnβt a magic backend that makes a 0.8B model handle full OpenClaw agent context well on a Pi5.
My recommendation:
Best option
Keep Ollama, but give the local model a tiny/minimal agent setup
- slim bootstrap
localModelLean: true- maybe a separate agent/workspace for local use
That keeps the model local without forcing it to swallow your whole normal agent prompt.
Most practical option
Use cloud for normal assistant chats, local only for narrow inference
openclaw infer model run- summarizing small text
- cheap/routine tasks
This is the least frustrating setup.
Other supported local backends
OpenClaw also supports local/OpenAI-compatible routes like:
- llama.cpp
- LM Studio
- vLLM
- SGLang
But on a Pi5, I honestly would not expect those to be better than Ollama for this specific problem. They usually add more compat pain, not less.
Docs:
If you want, I can sketch the smallest reasonable βlocal-only lightweight agentβ config for your Pi so you can keep cloud + local side by side.
and without ollama?
Yep. Without Ollama, the realistic non-Ollama local path on a Pi is llama.cpp.
OpenClaw can also talk to other local/OpenAI-compatible backends, but on a Pi5 Iβd rank them like this:
- llama.cpp server , best non-Ollama option on Pi
- generic OpenAI-compatible local server , if you already have one
- vLLM / SGLang / LM Studio , supported in general, but not what Iβd pick for a Pi
Important bit: switching away from Ollama probably wonβt fix the core issue by itself.
Your bottleneck looks like tiny model + big OpenClaw prompt/context, not βOllama specifically is brokenβ.
So without Ollama, Iβd only try it if you want:
- a lighter runtime stack
- direct control over the server
- a very minimal local-only setup
Docs:
If you want, I can give you the smallest viable llama.cpp + OpenClaw config for a Pi5.
I notice the model is thinkking alot
Yep, that could absolutely be part of it. Let me check whether thereβs a clean OpenClaw-side way to turn thinking off for this kind of model.
Yep, that absolutely matters.
Your raw Ollama response literally included:
"thinking": "Okay, the user just said ..."
So the model is spending time in a reasoning phase before giving visible text. On a Pi, that can be brutal.
Fastest test
Send this in the session:
/think off
That should disable thinking for the current session.
If it helps, make it permanent
{
agents: {
defaults: {
thinkingDefault: "off"
}
}
}
You can also do it per-agent instead of globally.
My take
I think you now have two things stacking:
- big OpenClaw prompt/bootstrap
- the model doing visible/internal thinking
So yes, turning thinking off is a very good next test, and Iβd combine it with the smaller bootstrap / lean-local settings we talked about.
Docs:
If /think off changes the behavior, then weβve found a real part of the slowdown.
that model cnat be stopped from thinking
Yeah, then I think weβve hit the real limit.
If that modelβs template forces reasoning/thinking, OpenClaw canβt really save it. On a Pi5, a tiny always-thinking model + full agent context is just a bad fit.
So at this point Iβd recommend one of these:
- Do not use that exact model as the main OpenClaw chat model
- Pick a non-thinking instruct-style local model instead
- Or keep cloud for normal agent chats and use local only for very small/raw tasks
I wouldnβt keep fighting this particular model, honestly.
For current model recommendations, check #models and the docs here:
- https://docs.openclaw.ai/concepts/models
- https://docs.openclaw.ai/concepts/models#model-discussions-community-notes
If you want, I can help you with the OpenClaw config shape for a lightweight local model setup, without picking the model itself.
so how do all these people on youtube do it? they have a pi and a local model on a pi5 running and its snappy
or is it all a lie
Not a lie, exactly, but definitely benchmark theater π
βRuns locallyβ usually means:
- short raw prompt
- no big system prompt
- no tool schema
- no chat history
- no forced reasoning
That is very different from:
11 prompt tokens β 20 seconds responseon a raw test
vs- 12k+ tokens of OpenClaw context, tools, bootstrap, and a thinking-heavy model
So the model probably does run locally. It just doesnβt run this workload well on a Pi5.
Thatβs the real distinction:
- local inference β
- full agentic OpenClaw chat on that tiny model β
If this fixed it, please right-click the solution message β Apps β β
Mark Solution.
This helps Answer Overflow index the solution for future users.
adding an llm8850 on the second slot next to my ssd on the pironman max board of my pi5 could make it faster?
Iβm checking what that board actually is before I guess.
Maybe, but I would not expect it to be a drop-in fix for your OpenClaw timeout issue.
The important distinction is:
- an accelerator can speed up model inference
- it does not shrink OpenClawβs prompt, tool schema, bootstrap files, or forced thinking
So for your case, it only helps if your local runtime can actually use that card.
My honest take:
- If Ollama/OpenClaw canβt offload to the LLM-8850, then it wonβt help
- If you have to switch to the vendor runtime/server, it might help generation speed
- but it still wonβt fix the big-context + thinking-model mismatch by itself
Also, on that Pironman Max setup, the extra card is sharing the Piβs PCIe path with the SSD/switch, so I wouldnβt assume miracle gains.
So, could it make inference faster? maybe.
Would I buy it expecting this exact OpenClaw problem to disappear? probably not.
If you want, I can help check the concrete question:
βDoes LLM-8850 have a usable path with Ollama or an OpenAI-compatible server that OpenClaw can talk to?β
okay thx for clearing up that stuff. so clod is the way to go even on a pi?
Yeah, for real OpenClaw agent use, cloud is usually the sane default even on a Pi.
The Pi can still be great for:
- running the gateway
- handling channels/tools
- doing lightweight local stuff
- hosting storage/automation
But for the main chat model, cloud is usually the smoother path.
So the practical split is:
- Pi = host/orchestrator
- Cloud = main brain
- Local models = side experiments or narrow tasks
Thatβs honestly a pretty normal setup, not a failure.
Docs:
- https://docs.openclaw.ai/concepts/models
- https://docs.openclaw.ai/concepts/models#model-discussions-community-notes
Glad we untangled it. π
If this fixed it, please right-click the solution message β Apps β β
Mark Solution.
This helps Answer Overflow index the solution for future users.
nothing got fixed but alot more is cleared up now and I see that my initial idea I have was based on wrong assumptions.
Thatβs still a solid outcome, honestly.
We didnβt βfixβ it, but we did reduce the mystery:
- your local runtime itself is fine
- OpenClaw is a much heavier workload than a raw Ollama prompt
- that model thinks a lot and canβt really be forced into a lightweight agent role
- so the original expectation was just based on the wrong comparison
Thatβs not wasted effort, thatβs good debugging.
If you revisit it later, Iβd treat the next decision as:
- cloud for main OpenClaw chat
- local for small/narrow tasks
- or a different non-thinking local model if you want to keep experimenting
Glad we got it clearer.
before i uselessly open a new post, my openclaw is not switiching to the image model when sending a picture via telegram
Thatβs a different issue, so please open a new #help post/thread for it, otherwise this one turns into a grab bag and gets hard to index.
Short version though: yes, OpenClaw is supposed to use imageModel as fallback when the selected/default model canβt accept image input.
In the new thread, please include:
openclaw --version- your
agents.defaults.modelandagents.defaults.imageModelconfig - whether this works from CLI:
openclaw infer image describe --file ./test.jpg --json - the gateway log lines right after you send the Telegram image
Docs: