#Incorrect surfaced error, terminated appears to mask actual context overflow in persistent Discord s

1 messages · Page 1 of 1 (latest)

waxen sierra
#

Hi, I’m reporting what looks like an error-classification / observability issue in OpenClaw.

In our persistent Discord channel session, the error often surfaced to us is:

terminated

But in the logs, the actual root error is:

Context overflow: estimated context size exceeds safe threshold during tool loop.

So from an operator perspective, terminated looks like the primary failure, when it is actually a downstream symptom after context overflow and compaction/retry activity.
Environment
OpenClaw: 2026.4.14
Model: openai-codex/gpt-5.4
Agent: editorial-acp
Surface: persistent Discord channel session
Session key: agent:editorial-acp:discord:channel:1483839578682032390

Observed pattern
Session grows over time in a long-lived, tool-heavy Discord channel.
A turn hits:Context overflow: estimated context size exceeds safe threshold during tool loop.
OpenClaw attempts auto-compaction/retry.
The run later surfaces as:terminated

Concrete evidence
From /tmp/openclaw/openclaw-2026-04-14.log:
At 2026-04-17T19:10:51.490Z:
[context-overflow-diag] sessionKey=agent:editorial-acp:discord:channel:1483839578682032390 ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.

Immediately after:
context overflow detected (attempt 1/3); attempting auto-compaction for openai-codex/gpt-5.4

Then at 2026-04-17T19:11:26.797Z:
Full summarization failed: Summarization failed: terminated

Then later:
embedded_run_agent_end ... isError:true error:"terminated"

I observed the same pattern earlier the same day as well, including around 2026-04-17T18:58:08.266Z and 2026-04-17T11:22:50.328Z.

Additional correlated signal
I also see repeated Active Memory timeouts around these failing turns, e.g.:
active-memory ... done status=timeout elapsedMs=14120 summaryChars=0

This may not be the root cause, but it appears to add latency/noise around already fragile turns.

Relevant config context
contextTokens: 200000
compaction.mode: "safeguard"
editorial-acp.memorySearch.enabled: true
sources: ["memory", "sessions"]
experimental.sessionMemory: true
active-memory enabled for editorial-acp
active-memory.allowedChatTypes: ["channel", "direct"]
active-memory.timeoutMs: 9000

This is a long-lived shared channel session with many heavy read / exec tool outputs. I do not currently see contextPruning configured.

Expected behavior
If a context-overflow-diag occurs before terminated, I would expect OpenClaw to surface context overflow as the primary/root error, with terminated only as a secondary compaction/retry consequence.

Main ask
Could you please check:
whether terminated is incorrectly masking the real primary error,
whether the compaction/retry path should preserve context overflow as the surfaced root cause,
whether Active Memory on persistent channel sessions is expected to worsen this path,
whether contextPruning should be recommended or auto-enabled more aggressively for this kind of session.

#

@humble sphinx help

humble sphinxBOT
#

Yep, digging now. This looks like a real masking problem, not just operator confusion.

#

Yeah, I think your read is correct.

I checked the docs + code, and this looks like a real masking/observability bug in 2026.4.14, not just bad wording.

What I found:

  • The preemptive overflow you saw is a real first-class path:
    • src/agents/pi-embedded-runner/tool-result-context-guard.ts
    • src/agents/pi-embedded-runner/run.ts
  • In that run-loop code, overflow is explicitly supposed to stay classified as context_overflow (or compaction_failure if compaction itself fails).
  • But the lifecycle emitter at src/agents/pi-embedded-subscribe.handlers.lifecycle.ts reports the final assistant error text from the last failed step.
  • So if compaction/summarization dies later with bare terminated, the user-facing lifecycle/error surface can end up showing only terminated, even though the earlier root trigger was overflow.

So to your asks:

  1. Is terminated incorrectly masking the primary error?
    Probably yes. At least in this path, it looks like the later compaction failure is overwriting the more useful root cause.
#
  1. Should compaction/retry preserve context overflow as the surfaced root cause?
    Yes, I think so. Especially when a [context-overflow-diag] happened first in the same run.

  2. Is Active Memory worsening this?
    Likely latency/noise, yes. The docs describe it as a blocking sub-agent, and your summaryChars=0 timeout logs suggest it timed out without even contributing recall, so it probably is not the root cause here, but it is making fragile turns slower and noisier. For long-lived channel sessions I would be cautious with it.

  3. Should contextPruning be recommended here?
    Yes. Strongly, for this exact session shape. It trims old tool results from in-memory context before each LLM call. It is documented here:
    https://docs.openclaw.ai/concepts/session-pruning
    https://docs.openclaw.ai/gateway/configuration-reference#agentsdefaultscontextpruning

For your setup, I’d recommend at least:

{
  agents: {
#
    defaults: {
      contextPruning: {
        mode: "cache-ttl",
        ttl: "1h",
      },
    },
  },
}

Also worth considering:

#

Net: your diagnosis seems right. The system appears to be surfacing a downstream terminated from compaction/summarization instead of preserving the earlier overflow as the primary operator-visible failure.

This is GitHub-issue-worthy. If you file one, I’d include exactly the timestamps/log sequence you posted plus the note that embedded_run_agent_end ends up reporting the final assistant error instead of the earlier [context-overflow-diag] root cause: https://github.com/openclaw/openclaw/issues

waxen sierra
#

Ok thx. Also why sometimes my openclaw bot in discord stops showing "he's typing", but he's still working. So I can't really know what's going on

humble sphinxBOT
#

Let me check the Discord typing path, because that’s usually either a channel-client timeout or a gap between streaming/working phases.

#

Yeah, that can happen, and I found a concrete reason.

For Discord, typing is best-effort, not a guaranteed “still alive” signal. In your case there are two likely pieces:

  • Start behavior is controlled by typingMode
  • More importantly: the runtime currently has an internal typing TTL of 2 minutes
    • src/auto-reply/reply/typing.ts
    • if a run stays busy for a long time in tool work / compaction / silent housekeeping without fresh visible deltas, typing can auto-stop even though the run is still going

That matches what you’re seeing pretty well: long tool-heavy / compaction-heavy turns keep working, but the Discord typing bubble disappears.

Also relevant here:

  • silent NO_REPLY housekeeping suppresses typing/draft streaming
  • compaction can be silent unless you enable a notice
#

So if you want better UX, I’d try:

{
  agents: {
    defaults: {
      typingMode: "instant",
      typingIntervalSeconds: 6,
      compaction: {
        notifyUser: true,
      },
    },
  },
}

That gives you: