🎫 Context Overflow / Session Stability — Ongoing Issue
Environment:
OpenClaw 2026.3.7 (42a1394) on macOS 26.3.1 (arm64, Mac mini M4)
Node v25.6.1
Model: claude-opus-4-6 (150K context) for DMs
25 Discord channels on Sonnet, DMs on Opus
Gateway via LaunchAgent (auto-restart on crash)
The Problem:
During heavy work sessions (building iOS apps, reading logs, running sub-agents), the main DM session hits context limits and triggers compaction. This has happened 7 times in tonight's session alone. When it gets bad enough, the gateway appears to restart entirely — killing the session, losing all logs, and leaving the user staring at silence with no response.
Timeline of tonight's incident:
Session was active with heavy tool use (xcodebuild output, file reads, grep results, sub-agent spawns)
At ~11:05 PM PDT, the agent froze mid-response
Gateway restarted at 11:43 PM (PID 47474 → 2570)
Previous log file was completely lost — no rotation, no .old file, can't post-mortem
After restart, config validation error spammed every ~2 seconds until config was auto-rewritten
What we've tried / current mitigations:
sessionRetention: "2h" (on cron config) — prunes old sessions
historyLimit: 20 was set previously (may have been lost in config rewrites)
channelHealthCheckMinutes: 0 — disabled health checks to reduce overhead
listenerTimeout: 120000 — increased listener timeout
All 25 Discord channels set to Sonnet (cheaper, smaller transcripts)
Only DMs use Opus
All 5 cron jobs on claude-sonnet-4-20250514 with lightContext: true
Manual session resets when channels hit ~80-100K tokens
Sub-agents for heavy coding work (but tool results still land in parent session)
Why it keeps happening:
Even with historyLimit: 20, individual tool call results can be massive. A single xcodebuild archive output is thousands of tokens. A grep across a codebase can be hundreds of lines. 20 messages Ă— heavy tool output = 100K+ tokens fast. The compaction summary itself is also large (preserving build UUIDs, file paths, exact identifiers needed for continuity).
What would help (feature requests):
Log rotation on gateway restart — When the gateway restarts (crash or launchd respawn), the old log file is overwritten. There's no .old or rotated copy. Makes post-mortem impossible. Even keeping the last 1-2 log files would help.
Tool result auto-truncation — Cap tool call results at a configurable max (e.g., 2000 tokens) before they enter the transcript. Build logs, grep output, and file reads are the biggest offenders. The agent can always re-read if it needs more.
Graceful compaction without gateway restart — Currently when compaction can't keep up, the gateway seems to crash/restart. If the compaction could happen in-place (summarize + truncate transcript without killing the process), it would prevent the user-facing freeze.
Config validation: strip unknown keys silently — Tonight an unrecognized "label" key on a channel config caused error spam every ~2 seconds. Could strip unknown keys with a single warning instead of repeating.
Configurable compaction aggressiveness — Option to discard tool results entirely during compaction (keep only the summary/final answer), or to set a target post-compaction size.
Transcript size visibility — A way for the agent to check current transcript size (tokens used / remaining) so it can proactively manage context before hitting limits.
Current session stats (post-compaction):
Context: 92k/150k (62%) — already at 62% right after compaction
Compactions: 7 (in one session tonight)
The 92K is mostly the compaction summary carrying forward exact identifiers, build UUIDs, file paths, and project state needed for continuity. Each compaction preserves this, then new work piles on top, triggering the next compaction in ~20-30 minutes of active use.