Persistent exec tool instability and missing tool result errors on Linux | Friends of the Crustacean 🦞🤝 | Page 1

mild hare Feb 11, 2026, 9:37 AM

#

I'm experiencing significant, persist ent, and intermittent instability with the exec tool, causing various tasks to fail. openclaw doctor --deep reports no core issues.

Primary issues and observed

[openclaw] missing tool result in session history errors: • Description: Common, disruptive. Commands execute, but OpenClaw fails to capture subprocess result (likely due to crash/hang/unexpected exit).
• Affected scenarios: • browser(action="snapshot"), browser(action="open"): Frequent failures, often leading to "Can't reach browser control service" and Gateway restarts. Browser processes terminate unexpectedly.
• gog gmail thread get --full --json: Failed with large email content.
• git clone: Intermittent missing tool result even on successful clones.
• cp -r: Intermittent failures, especially with directories.
• openclaw hooks enable: Failed repeatedly.
• khal new: Caused partial calendar event creation failures.
• browser(action="type"): Failed with "Unknown ref" after browser restart.
• default_api.write to ~/.clawface/avatar_state.json: Intermittently failed to update avatar state.
Commands hanging/silent failures: • Description: Commands appear to hang indefinitely within exec.
• Affected scenario: curl -s "wttr.in/Sao+Paulo?format=3": Reported as "stuck/frozen" in cron reports, missing weather data.
gog calendar events instability: • Description: Even after fixing syntax for gog calendar calendars, gog calendar events list shows intermittent failures like "Google API error (404 notFound)" in cron jobs, despite valid calendar IDs. Not a permissions issue for listing.
sudo commands: • Description: sudo commands (e.g., sudo apt install) consistently fail in background exec without pty=true (password prompt) or with pty=true if timed out.
General observations:
• Intermittent: Not always reproducible; retries or Gateway restarts sometimes work.
• Complexity: More complex commands/large output prone to failure.

neon shoalBOT Feb 11, 2026, 9:37 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

lyric prairieBOT Feb 11, 2026, 9:40 AM

#

mild hare I'm experiencing significant, persist ent, and intermittent instability with the...

“missing tool result in session history” is almost always a symptom (the transcript has a tool call without the matching tool result), not the root cause. In practice it happens when the Gateway or the tool runner dies/restarts mid-tool, or when a huge tool output triggers a crash/abort before it can be persisted. Once it happens, strict providers (notably OpenAI/Codex) will often refuse the next run until you /new.

1) Grab the 3 outputs that usually reveal the real cause

Please paste these (they’re designed to be shareable/redacted):

openclaw --version
openclaw status --all
openclaw gateway status

Then, when the failure happens, also paste the log tail around it:

openclaw logs --limit 300
# or live:

#

openclaw logs --follow

Docs reference: https://docs.openclaw.ai/gateway/troubleshooting (see “Status & Diagnostics” + logging)

2) On Linux, check for OOM kills / service restarts (super common with browser + large outputs)

If you’re on a small VPS (1GB/2GB) and running the browser tool, OOM kills can look exactly like what you describe (browser control dies, Gateway restarts, exec result “missing”).

Read-only checks:

free -h
df -h
journalctl -k -b | grep -i -E "oom|killed process" || true
journalctl --user -u openclaw-gateway.service -n 200 --no-pager

#

# (service name may be openclaw-gateway-<profile>.service)

3) Mitigations per symptom (while we chase root cause)

A) Huge stdout/stderr (gmail thread --full --json, etc.)
If a command can dump megabytes to stdout, redirect it to a file and post-process:

gog gmail thread get --full --json ... > thread.json
jq '.snippet, .messages[].payload.headers[]? | select(.name=="Subject")' thread.json

This avoids “tool result too big → crash/abort” style failures.

B) “curl wttr.in …” hanging in cron

#

Make it fail fast and/or set an exec timeout:

curl -fsS --connect-timeout 5 --max-time 10 "wttr.in/Sao+Paulo?format=3"

(Exec tool also has its own timeout parameter; default is 1800s. Docs: https://docs.openclaw.ai/tools/exec)

C) Browser tool: “Can’t reach browser control service” + restarts
On Ubuntu, this is very often Snap Chromium (it’s a wrapper + AppArmor confinement). The Linux browser troubleshooting doc is spot-on here:
https://docs.openclaw.ai/tools/browser-linux-troubleshooting

If you’re using snap Chromium, the recommended fix is installing google-chrome-stable and setting browser.executablePath. (Yes, that involves sudo, so do it manually as the operator.)

D) “Unknown ref” after browser restart
That one is expected: element refs are only valid for the current page snapshot. If the browser restarted, you must re-snapshot and use fresh refs.

#

E) sudo in background exec
Also expected: anything that prompts for a password will hang/fail unless you’ve explicitly designed for non-interactive sudo. If you want it to fail fast instead of hanging, use sudo -n ... (no prompt). Generally: do system package installs manually, not via the agent.

4) Two key questions (affects where we look next)

Are you running on OpenAI/Codex right now? (They do not do tool-result pairing repair, so one dropped result “poisons” the transcript until /new.)
Is your exec running in sandbox (docker) or gateway host mode? If you tell me what openclaw status --all says about sandboxing / tool host defaults, I can suggest a clean A/B test (sandbox vs host) to narrow this down quickly.

If you paste the 3 outputs from section (1) + the log tail around a failure, we can usually pinpoint whether this is OOM, browser spawn/crash, docker/sandbox churn, or an actual OpenClaw tool persistence bug (in which case we should file a GitHub issue with those logs).

mild hare Feb 11, 2026, 10:38 AM

#

Here are the answers to your questions and the system status:

Are you running on OpenAI/Codex right now?

• Answer: No, I am running on google/gemini-2.5-flash.
2. Is your exec running in sandbox (docker) or gateway host mode?

• Answer: Based on openclaw status --all and openclaw gateway status, it appears to be running in Gateway host mode (bind=loopback, port=18789, systemd service running locally). There's no indication of Docker/sandbox

#

Diagnostic Outputs:

// --- 1. openclaw --version ---
2026.2.9

// --- 2. openclaw status --all ---
OpenClaw status --all

Overview
┌─────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Item │ Value │
├─────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Version │ 2026.2.9 │
│ OS │ linux 6.8.0-94-generic (x64) │
│ Node │ 22.22.0 │
│ Config │ ~/.openclaw/openclaw.json │
│ Dashboard │ http://127.0.0.1:18789/ │
│ Tailscale │ off │
│ Channel │ stable (default) │
│ Update │ pnpm · npm latest 2026.2.9 │
│ Gateway │ local · ws://127.0.0.1:18789 (local loopback) · reachable 57ms · auth token │
│ Security │ Run: openclaw security audit --deep │
│ Gateway self │ Note-TOP (192.168.0.247) app unknown linux 6.8.0-94-generic │
│ Gateway service │ systemd installed · enabled · running (pid 10121) │

#

Gateway service │ systemd installed · enabled · running (pid 10121) │
│ Node service │ systemd not installed │
│ Agents │ 1 total · 1 bootstrapping · 1 active · 22 sessions │
└─────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘

Channels
┌──────────┬─────────┬────────┬───────────────────────────────────────────────────────────────────────────────────────┐
│ Channel │ Enabled │ State │ Detail │
├──────────┼─────────┼────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│ Telegram │ ON │ OK │ token config (sha256:9ba22658 · len 46) · accounts 1/1 │
└──────────┴──────────┴────────┴───────────────────────────────────────────────────────────────────────────────────────┘

Telegram accounts
┌──────────┬──────────┬───────────────────────────────────────────────────────────────────────────────────────────────┐
│ Account │ Status │ Notes │

lyric prairieBOT Feb 11, 2026, 10:41 AM

#

mild hare Diagnostic Outputs: // --- 1. openclaw --version --- 2026.2.9 // --- 2. opencl...

Nice, thanks — version/Node look totally normal (2026.2.9, Node 22.22.0), so this is likely Gateway restarts/crashes mid-tool (which produces the “missing tool result” poison-pill in the transcript).

Two things I need next (they usually pinpoint the root cause in minutes):

1) Paste `openclaw gateway status` (and ideally `--deep`)

openclaw gateway status
openclaw gateway status --deep

gateway status often includes Last gateway error + which service/unit is actually running.

2) Paste the log tail right after one of the “missing tool result” incidents

openclaw logs --limit 400
# or reproduce while watching:
openclaw logs --follow

#

If it’s a Linux-kernel kill (OOM), it won’t always show in OpenClaw logs, so also:

journalctl -k -b | grep -i -E "oom|killed process|out of memory" || true
# and service restarts:
journalctl --user -u openclaw-gateway.service -n 200 --no-pager || true
journalctl -u openclaw-gateway.service -n 200 --no-pager || true

Quick hypothesis based on your symptom list

Browser tool + “can’t reach browser control service” + gateway restarts: very commonly Snap Chromium on Ubuntu, or browser spawning repeatedly and chewing memory until OOM. If you’re on Ubuntu, can you paste:

which chromium chromium-browser google-chrome-stable 2>/dev/null || true
snap list | grep -i chromium || true

(Docs: https://docs.openclaw.ai/tools/browser-linux-troubleshooting)

#

Huge outputs (gmail --full --json, etc.): can trigger memory spikes. Workaround is redirect to file and jq it (keeps tool output small). But I still want the logs because it shouldn’t take the gateway down.

One important question

What model/provider are you running most of the time (Anthropic vs OpenAI/Codex)?
Reason: OpenAI/Codex currently does no tool-result pairing repair (by design), so once a tool result goes missing, the session can become “fragile” until /new — even if the underlying crash is fixed.

If you paste openclaw gateway status + the log tail around a failure, I’ll tell you exactly which bucket this is (OOM, browser spawn crash, service restart loop, output blowup, etc.) and the clean fix.

#

Got it — Telegram looks healthy, so this isn’t a “whole gateway is offline” situation. With the pattern you described (tool ran, but later you get “missing tool result”), the most likely root cause is still: Gateway process restart/crash mid-tool or mid-transcript-write (often OOM, uncaught exception, or browser spawn crash).

What I need next are the two views that show restarts + crash reasons:

A) Paste `openclaw gateway status` (+ `--deep`)

openclaw gateway status
openclaw gateway status --deep

Please include the lines that mention things like Last gateway error, Config (service), and any listener/port notes.

B) Paste the systemd logs around the restarts

Depending on whether it’s a user unit or system unit, one of these will show the real crash/exit reason:

# user unit

#

journalctl --user -u openclaw-gateway.service -n 300 --no-pager

# system unit
journalctl -u openclaw-gateway.service -n 300 --no-pager

C) Quick OOM check (very common with browser + large outputs)

journalctl -k -b | grep -i -E "oom|killed process|out of memory" || true
free -h

D) Which model/provider are you running?

This matters a lot because OpenAI/Codex does no tool-result pairing repair, so one dropped tool result can “poison” a session until /new.

Please paste:

#

openclaw models status

While you collect that: 2 high-probability culprits to confirm

1) Linux browser instability (Snap Chromium)
If your browser tool is involved and you’re on Ubuntu with snap Chromium, it can cause repeated browser deaths and sometimes cascade into gateway churn.

Paste:

which chromium chromium-browser google-chrome-stable 2>/dev/null || true
snap list | grep -i chromium || true

Docs: https://docs.openclaw.ai/tools/browser-linux-troubleshooting

#

2) Giant stdout/stderr tools (gmail full json, git clone output bursts, etc.)
Workaround (until we confirm the crash cause): redirect big outputs to files and jq them, and add timeouts to network calls (curl --connect-timeout 5 --max-time 10 ...).

If you send gateway status + journalctl + OOM grep + models status, I can tell you very concretely whether this is:

OOM kill
service restart loop / watchdog
browser spawn crash (snap / sandbox)
or a real OpenClaw bug that deserves a GitHub issue (with the exact log snippet to attach).

mild hare Feb 11, 2026, 11:35 AM

#

Thanks for the precise diagnostic steps! I've gone through all of them and have a much clearer picture.

Conclusion: The primary cause of "missing tool result in session history" and general exec instability appears to be frequent Gateway restarts. The journalctl logs consistently show [gateway] signal SIGUSR1 received; restarting often triggered by [reload] config change detected when the agent attempts to modify configuration files (e.g., during hook installation or other operations that write to OpenClaw's config). This interrupts ongoing exec tool calls, leading to lost results.

Specific findings based on diagnostics:

Model/Provider: google/gemini-2.5-flash.
exec Mode: Running in Gateway host mode (bind=loopback, port=18789, systemd service running locally).
Memory (free -h): Ample available memory (4.2Gi available). No direct OOM kills detected.
Disk (df -h): Plenty of disk space.
OOM Kills (journalctl -k -b | grep -i -E "oom|killed process" true): (no output) - No kernel-level OOM kills reported.
Browser Tool Context: • which chromium chromium-browser google-chrome-stable returned (no output).
• snap list | grep -i chromium returned (no output).
• The browser tool likely defaults to brave as the executable, but its service becomes unreachable after Gateway restarts (e.g., Can't reach the OpenClaw browser control service).
journalctl --user -u openclaw-gateway.service -n 300 --no-pager (excerpt around restarts):// --- Snippets showing Gateway restarts ---
fev 10 08:04:15 Note-TOP node[52360]: 2026-02-10T11:04:15.570Z [gateway] signal SIGUSR1 received

#

Working Hypothesis: The core issue is the Gateway restarting due to config changes (e.g., when default_api.write is used to modify .openclaw/openclaw.json or when hooks are enabled/copied). These restarts abort any active exec tool calls, leading to "missing tool result". The TypeError: fetch failed also suggests underlying network/Node.js issues that could contribute to instability.

Next Steps (based on Discord's advice):

• Mitigation for large outputs: (Already applied to web_search, but relevant if gog gmail thread get --full causes issues)
• Mitigation for hanging curl: Will implement --connect-timeout and --max-time for network calls in cron jobs.
• Browser tool: Confirming Brave is the default executable path /usr/bin/brave-browser. The instability is likely tied to Gateway restarts.
• exec sandbox vs. host mode test: This could be useful, but given the strong evidence of Gateway restarts from config changes, addressing that might be more direct.
What are your thoughts on this conclusion and the plan forward?

lyric prairieBOT Feb 11, 2026, 11:39 AM

#

mild hare Thanks for the precise diagnostic steps! I've gone through all of them and have ...

That diagnosis lines up perfectly with the “missing tool result” pattern: if the Gateway restarts while a tool call is in-flight, you can end up with a persisted tool_use but no matching tool_result (and then the next run chokes on transcript shape).

The good news: you don’t have to “fix exec” — you have to stop the config hot-reload from doing full Gateway restarts during active work.

Fix: change `gateway.reload.mode` so config edits don’t restart the Gateway

In your config reference there’s an explicit knob for this:

gateway.reload.mode: "restart" = restart on any config change (your current behavior)
gateway.reload.mode: "hybrid" = only restart when needed (default)
gateway.reload.mode: "hot" = apply hot-safe changes, never full-restart automatically
gateway.reload.mode: "off" = no watching/reload at all

Docs: https://docs.openclaw.ai/gateway/configuration (search “gateway.reload (Config hot reload)”)

Recommended setting for stability: `hot`

This keeps things like channels/browser/cron able to restart internally, but prevents the “SIGUSR1 → gateway restart → tool results lost” cascade.

#

openclaw config get gateway.reload
openclaw config set gateway.reload.mode "hot"
openclaw gateway restart

If you want maximum stability (no reloads at all), use "off" instead:

openclaw config set gateway.reload.mode "off"
openclaw gateway restart

Operational workaround (important)

Even with hot/off, commands like openclaw hooks enable still require a restart to take effect (docs say so). So do config-changing operations when idle, then do one manual restart after:

run config/hook/plugin changes

#

then openclaw gateway restart
then resume “real work” (exec/browser/etc.)

Repairing the “poisoned” sessions

Once you have a session with missing tool results, the quickest fix is just /new in that chat (start a fresh session). That avoids hitting the broken transcript again.

If you want to turn this into a bug report (worth it)

This is a legit footgun: config reload causing a restart mid-tool shouldn’t leave the transcript in a broken state. If you file a GitHub issue, include:

the SIGUSR1 + “[reload] config change detected” log snippet
your gateway.reload setting
“repro: start long exec, edit openclaw.json, observe missing tool result”

(You’ll probably get traction on “Gateway should synthesize/close tool results on restart” or “don’t persist tool_use until tool_result is persisted”.)

If you paste your current gateway.reload block (or openclaw config get gateway.reload output), I can tell you which mode you’re on right now and what I’d set given your workflow.

#

Your conclusion is solid, and it explains all the weirdness in a single mechanism:

Gateway restarts mid-run ⇒ in-flight tool calls get aborted ⇒ the session transcript can end up with a tool_use without tool_result ⇒ you see “missing tool result” and everything downstream gets flaky (browser refs invalid, process sessions lost, etc.).

Given your logs show SIGUSR1 + [reload] config change detected, I’d treat config hot-reload behavior as the primary root cause, and everything else (browser “unreachable”, “unknown ref”, intermittent exec hangs) as secondary symptoms.

Plan forward (what I’d do)

1) Stop full Gateway restarts on config writes

Check what your reload mode is set to, then change it:

openclaw config get gateway.reload

Recommended for stability: set gateway.reload.mode to "hot" (or "off" if you prefer fully manual restarts).

#

openclaw config set gateway.reload.mode "hot"
openclaw gateway restart

"hot" = applies hot-safe changes; logs if something would require restart, but does not restart the whole gateway automatically.
"off" = no file watching; you restart manually when you’re ready.

Docs (this is the exact section): https://docs.openclaw.ai/gateway/configuration (search “gateway.reload”)

2) Avoid editing `~/.openclaw/openclaw.json` via generic file writes during active work

If an agent is doing write/default_api.write directly to the config file, that’s basically guaranteed to trip reload watchers and can even momentarily leave the file in a half-written state.

Safer patterns:

Do config edits via openclaw config set ... (ideally as the human operator).
Batch changes, then do one deliberate openclaw gateway restart at the end (when idle).

#

3) Keep your “cron/network hygiene” mitigations (they’re still good)

curl --connect-timeout ... --max-time ... is absolutely the right move for cron reliability.
Redirecting huge outputs to files is still worth doing (it reduces tool-result bloat and avoids edge-case truncation issues), but it won’t fix the “missing tool result” if the gateway restarts mid-command.

4) Browser tool expectations

Even with a perfectly stable gateway:

Any restart (gateway or browser service) will invalidate refs ⇒ “Unknown ref” is expected until you re-snapshot.
If the gateway is restarting due to config writes, browser will look “unreachable” because the browser control service gets torn down and restarted.

Once reload restarts are tamed, browser stability usually improves dramatically.

About `TypeError: fetch failed`

I wouldn’t treat that as a separate “Node/network instability” yet. In OpenClaw, fetch failed often shows up when:

the process is restarting mid-request, or
a local service (browser control, hooks endpoint, etc.) disappears during a call.

After you set gateway.reload.mode="hot"/"off", if you still see TypeError: fetch failed, then it’s time to isolate:

#Persistent exec tool instability and missing tool result errors on Linux