#active-memory embedded runs ignoring AbortController after 4.15 upgrade

1 messages · Page 1 of 1 (latest)

merry thorn
#

Hi Krill — active-memory embedded runs ignoring AbortController after 4.15 upgrade
Upgraded from 4.14 → 4.15 today. After the upgrade, active-memory embedded runs are hanging well past timeoutMs before failing.
Config:
json"active-memory": {
"enabled": true,
"config": {
"agents": ["main"],
"queryMode": "message",
"promptStyle": "contextual",
"timeoutMs": 3000
}
}
Log evidence:
active-memory: ...start timeoutMs=3000 queryChars=507
embedded run failover decision: reason=timeout from=openai-codex/gpt-5.4-mini
active-memory: ...done status=timeout elapsedMs=20623 summaryChars=0
timeoutMs=3000 but elapsedMs=20623 — the AbortController fires at 3000ms but the embedded run keeps running for ~20s before eventually failing. Source code (index.ts lines 1816–1856) shows the abort should short-circuit immediately.
Model: openai-codex/gpt-5.4-mini (ChatGPT Plus OAuth)
Question: Is the 4.15 change moving recall from system prompt injection → prependContext (CHANGELOG line 196, #66144) related to this? The abort signal doesn't appear to be terminating the Codex embedded run cleanly. Is this a known issue with Codex-backed embedded runs post-4.15, and is there a workaround?

raven flintBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

loud trailBOT
terse notchBOT
#

I dug through the source, and I don’t think #66144 (system prompt injection → prependContext) is the cause.

Why I’m pretty confident about that:

  • extensions/active-memory/index.ts:2003-2023 only returns { prependContext: promptPrefix } after maybeResolveActiveRecall() finishes.
  • Your timeout is happening inside that recall run, earlier, at extensions/active-memory/index.ts:1761-1828.

What looks more likely is this:

  1. active-memory arms its own AbortController at extensions/active-memory/index.ts:1764-1768.
  2. runRecallSubagent() passes that abortSignal into runEmbeddedPiAgent() at extensions/active-memory/index.ts:1664-1674.
  3. In src/agents/pi-embedded-runner/run/attempt.ts:1531-1539, aborting does void activeSession.abort(), so it is fire-and-forget.
  4. In the Codex path, extensions/codex/src/app-server/run-attempt.ts:195-226, abort turns into a best-effort turn/interrupt, not a hard synchronous stop.
  5. After that, cleanup still runs, and cleanupEmbeddedAttemptResources() waits for idle before flushing. That path goes through:
    • src/agents/pi-embedded-runner/run/attempt.subscription-cleanup.ts:39
    • src/agents/pi-embedded-runner/wait-for-idle-before-flush.ts:10
    • default idle wait is 30_000ms
#

So the important distinction is:

  • timeoutMs=3000 is when abort gets requested
  • it is not a guaranteed “return to active-memory caller by 3000ms”

That makes your elapsedMs=20623 sadly believable on a Codex-backed embedded run. The abort fires at 3s, but the run can still spend ~17s unwinding / waiting to become idle before runRecallSubagent() actually returns and active-memory logs done status=timeout.

Short answer

  • Related to prependContext? Probably no
  • Known Codex-backed embedded-run cancellation limitation? Looks like yes, functionally
  • Why 20s? Abort is advisory/best-effort downstream, and cleanup waits for idle
  • Best workaround: make Active Memory use a non-Codex recall model via config.model or config.modelFallback, so recall avoids the Codex app-server path entirely

Docs for that config are here:

#

Practical workarounds

  1. Best: set plugins.entries.active-memory.config.model to a non-openai-codex/... model
    or at least set config.modelFallback so recall has a non-Codex escape hatch.
  2. If staying on Codex: raise timeoutMs above 3000
    because 3000 is likely to hit this cancellation path constantly.
  3. If latency matters more than recall: temporarily disable Active Memory.

If you can, paste the exact output of openclaw --version. If this reproduces on the latest patch too, I’d file it as a bug with these code pointers and your three log lines:

  • extensions/active-memory/index.ts
  • src/agents/pi-embedded-runner/run/attempt.ts
  • extensions/codex/src/app-server/run-attempt.ts

If you want, I can help turn this into a tight GitHub issue repro.

merry thorn
#

Follow-up: active-memory timeout still ~15s after switching from Codex to OpenRouter

Tried your suggested workaround — switched active-memory config to a non-Codex model:

"model": "openrouter/arcee-ai/trinity-large-preview:free"

Same result:

active-memory: ...start timeoutMs=3000
embedded run failover decision: reason=timeout from=openrouter/arcee-ai/trinity-large-preview:free
active-memory: ...done status=timeout elapsedMs=15562 summaryChars=0

So it's not Codex-specific. The abort fires at 3s but the embedded run still takes ~15s to return.

Based on your earlier analysis, wait-for-idle-before-flush in src/agents/pi-embedded-runner/wait-for-idle-before-flush.ts has a default idle wait of 30,000ms — but we're seeing ~15s, not 30s. Two questions:

  1. Is there a config to reduce or skip the idle wait specifically for active-memory embedded runs?
  2. Is elapsedMs measuring from abort-requested to cleanup-complete, or something else? Trying to understand why we're seeing ~15s consistently instead of the full 30s.

OpenClaw v2026.4.15, Docker on Windows 11.


terse notchBOT
#

Nice catch, that changes the diagnosis a lot.

This does not look Codex-specific anymore. With the OpenRouter repro, I’d treat it as a generic embedded-run timeout/teardown issue in runEmbeddedPiAgent, not a prependContext issue and not a Codex-only issue.

1) Is there a config to reduce/skip the idle wait for active-memory?

I don’t see one in v2026.4.15.

What I found:

  • flushPendingToolResultsAfterIdle() does accept an internal timeoutMs param in src/agents/pi-embedded-runner/wait-for-idle-before-flush.ts
  • but I do not see any user-facing config in:
    • extensions/active-memory/index.ts
    • docs/concepts/active-memory.md
    • the embedded runner wiring in src/agents/pi-embedded-runner/run/attempt.ts

So in current source, Active Memory can set timeoutMs for the recall run itself, but not for the post-abort idle-wait cleanup.

#

2) What does elapsedMs actually measure?

It is total wall-clock from recall start until runRecallSubagent() finally returns/throws.

Specifically:

  • extensions/active-memory/index.ts:1720 does const startedAt = Date.now()
  • then it awaits runRecallSubagent(...)
  • then on success/timeout it logs elapsedMs: Date.now() - startedAt

So yes, elapsedMs includes everything:

  • recall setup
  • embedded run execution
  • abort request
  • downstream unwind/cleanup
  • whatever wait happens before runEmbeddedPiAgent() finally gives control back
#

It is not “time until AbortController fired”.

Why ~15s instead of the full 30s?

Because 30_000 in src/agents/pi-embedded-runner/wait-for-idle-before-flush.ts:10 is just the upper bound.

flushPendingToolResultsAfterIdle() returns as soon as agent.waitForIdle() resolves. So seeing ~15.5s means the runner is becoming “idle enough” around then, before the 30s cap.

So the shape is:

  • abort requested at ~3s
  • embedded run does not unwind immediately
  • some deeper waiter settles around ~15s
  • cleanup finishes
  • Active Memory logs timeout

I do see other 15_000 defaults elsewhere in the runtime, for example src/auto-reply/reply/reply-run-registry.ts:403, but I have not proven that exact one is the waiter on this path. I’d call that a plausible clue, not a confirmed root cause.

#

Practical conclusion

At this point I’d describe it as:

Active Memory timeoutMs aborts the recall request, but the embedded runner can still spend ~15s unwinding before control returns.

So the bug is probably in one of these areas:

  • embedded-run cleanup waiting too long for silent/tool-only runs
  • provider/request abort not settling the agent promptly
  • waitForIdle() resolving on a later internal timeout boundary instead of immediate abort

Current workaround

Honestly, not a great one:

  • assume the effective floor is around 15s
#
  • set active-memory.config.timeoutMs accordingly, or
  • disable Active Memory if that latency is worse than the value it adds

That lines up with the docs too, which recommend starting around 15000 ms even for message mode:

Next step

This looks bug-report-worthy now. I’d file it with:

  • exact version: OpenClaw v2026.4.15
  • both repros:
    • openai-codex/gpt-5.4-mini
    • openrouter/arcee-ai/trinity-large-preview:free
  • the two timing pairs:
    • 3000 -> 20623
    • 3000 -> 15562
#
  • code pointers:
    • extensions/active-memory/index.ts:1720
    • src/agents/pi-embedded-runner/run/attempt.ts
    • src/agents/pi-embedded-runner/wait-for-idle-before-flush.ts

GitHub issues:

If you want, I can help you write a tight issue body that points straight at the likely fix area.

merry thorn
#

Follow-up: webhooks plugin registering route multiple times per session

Seeing registered route ai-news appear 2-3 times in logs for a single incoming message:

[webhooks] registered route ai-news on /plugins/webhooks/ai-news for session agent:main:main
[webhooks] registered route ai-news on /plugins/webhooks/ai-news for session agent:main:main
[webhooks] registered route ai-news on /plugins/webhooks/ai-news for session agent:main:main

Source research shows register(api) is triggered via loadGatewayStartupPlugins() at boot — expected once. But we're seeing it fire multiple times mid-session, each time bound to agent:main:main via taskFlow.bindSession().

No config changes between messages. replaceExisting: true is hardcoded so functionally it's not breaking anything, but the repeated registrations are suspicious.

Context: We're also hitting the active-memory embedded-run timeout bug (reported earlier) on the same setup. Could the active-memory cleanup cycle be triggering session resets on agent:main:main, which re-fires the webhook plugin's register() each time?

Question: Is multiple taskFlow.bindSession() firing on session reset expected behavior for the webhooks plugin, or is this a separate bug?

OpenClaw v2026.4.15, Docker on Windows 11.


terse notchBOT
#

I don’t think the Active Memory timeout bug is re-firing the webhooks route through a session reset path.

What the code says:

  • extensions/webhooks/index.ts logs registered route ... inside register(api)
  • inside that same function it does:
    • api.runtime.taskFlow.bindSession({ sessionKey: route.sessionKey })
    • then api.registerHttpRoute(...)
  • src/plugins/runtime/runtime-taskflow.ts shows bindSession(...) is just a thin binder that returns createBoundTaskFlowRuntime(...)
  • that bound runtime is not a session-reset listener or lifecycle hook

So, if you see:

[webhooks] registered route ai-news ...
#

multiple times, that means the plugin register() path ran multiple times. It does not mean taskFlow.bindSession() got “fired by session reset”.

Short answer

  • Is multiple taskFlow.bindSession() on session reset expected? No.
  • Is repeated route registration expected on plugin reload? Yes, that is why replaceExisting: true is there.
  • Is 2 to 3 re-registrations per single incoming message normal? I would say no, that’s suspicious.

What is probably happening instead

This looks more like a plugin registry reload / re-registration issue than a session-reset issue.

A few things can legitimately cause register(api) to run again:

  • gateway startup load
  • deferred plugin reload paths
  • config/secrets reload
#
  • a runtime plugin-registry compatibility miss that causes loadOpenClawPlugins() to rebuild the registry

But that is different from “the session reset, so webhooks rebound”.

On the possible link to Active Memory

I don’t see a direct code path from Active Memory cleanup to webhook route re-registration.

If there is a connection, it would be indirect, like:

  • Active Memory embedded runs causing some plugin-registry reload/cache miss
  • then webhooks.register() runs again
  • then replaceExisting: true quietly replaces the same route

So I’d treat this as a separate bug class, even if both bugs happen in the same setup.

Practical read on severity