#Need help debugging a recurring Codex app-server/session-lane failure
1 messages · Page 1 of 1 (latest)
I checked current docs/source and the v2026.5.22..v2026.5.27 range. Short answer: yes, this looks like a known family of Codex app-server/session-lane failure modes, and I would upgrade before tuning anything else.
What I found:
2026.5.27includes relevant Codex/session fixes after2026.5.22, including:fix(codex): arm completion idle watch after rawResponseItem/completed with no active itemsfix(codex): keep attempt watchdog for queued terminal turnsfix(agents): release session lock on timeout abortfix(agents): recover failed subagent lifecycle completionsfix(codex): preserve native hook relay across restarts- several bounded compaction/steering retry fixes
- There was also an earlier fix for stale lane locks when Codex
turn/completedhandling throws:Codex/app-server: release the session lane when a downstream consumer throws while draining the turn/completed notification(#69072 / fixes #67996). Your symptom is very close to that class, but5.27has more follow-up hardening.
Recommended path:
- Upgrade Bob/OpenClaw and
@openclaw/codexto at least2026.5.27(or latest available), then restart the gateway/container cleanly. - Keep the disk-receipt workflow. That design is good. But for long autonomous product workflows, I would avoid making WebChat the durable orchestrator. Treat WebChat as a control/inspection surface, and drive the actual workflow from a server-side supervisor/worker or isolated agent job that can resume from disk.
- Tune timeouts only after upgrading:
- overall agent run timeout:
agents.defaults.timeoutSeconds
- overall agent run timeout:
- LLM idle timeout:
agents.defaults.llm.idleTimeoutSeconds(set0to disable; use carefully)- compaction timeout:
agents.defaults.compaction.timeoutSeconds(docs default says900)
Docs: https://docs.openclaw.ai/gateway/configuration-reference
- compaction timeout:
- I do not see a public config knob for subagent announce retry count. In source it is currently hardcoded around 3 retries with short backoff, plus expiry windows.
- I also do not see a documented command that “imports/consumes” a completed subagent result from arbitrary disk output after direct announce fails. Recovery is currently via the session/subagent lifecycle machinery, or your own durable supervisor prompting/continuing from receipts.
/newor/resetis a reasonable workaround for the poisoned current session, but it should not be considered the desired steady state. Try/stopfirst if the lane is wedged; recent code says/stopclears queued followups and pending session-lane work. If the transcript/session itself is bad,/resetis still the cleaner escape hatch.- The active-subagent marker should clear through lifecycle/announce cleanup, not merely because a workflow-specific output file exists. So “output exists but active marker remains” is plausible when announce/lifecycle cleanup failed. The
5.27“recover failed subagent lifecycle completions” fix is especially relevant there.
For next diagnostics, please provide after upgrading/retesting:
openclaw --version
openclaw status --all
openclaw health --verbose
openclaw models status
openclaw logs --follow
If you’re running inside Docker, run those inside the container or include the container log around the first codex app-server client closed before turn completed line through the subsequent subagent announce give-up. Also include whether the failure happens before/after compaction starts, and the exact Bob agent config block for agents.defaults, agents.list, session, and any Codex/plugin runtime overrides. If it still reproduces on 2026.5.27+, this is worth filing as a GitHub issue with the sanitized log bundle: https://github.com/openclaw/openclaw/issues
Thanks, this is very helpful. We are currently on OpenClaw/@openclaw-codex 2026.5.22, Docker on Ubuntu, runtime openai-codex/gpt-5.5. We will first finish/safely stop the active product workflow, then prepare a backup and upgrade-readiness audit, then upgrade to at least 2026.5.27 or latest stable.
The strongest clue is that /new or /reset restores progress from disk, while the old session remains poisoned after “codex app-server client closed before turn completed”. We also see subagent output exists on disk but subagent completion announce fails, leaving the parent session not consuming it.
We will test after upgrade with:
- openclaw --version
- openclaw status --all
- openclaw health --verbose
- openclaw models status
- container logs around first client-closed line
- supervisor audit-only against known terminal product fixtures
- subagent launch/complete smoke test
We’ll keep Discord disabled during product workflows and treat WebChat as control/inspection only, with disk supervisor/watchdog as durable workflow source.
That sounds like the right order.
One extra thing I’d capture before the upgrade: a small snapshot of the session/subagent state while it is still “poisoned”, if you can do it safely. The useful bit is the first client closed before turn completed through the Subagent announce give up, plus whether the child session is still listed as active after its disk output exists.
For the upgrade audit, I’d specifically preserve:
/data/.openclaw/config*/data/.openclaw/sessions*/ agent session store/data/.openclaw/agents/- the active workflow receipt/output directory
- container image/tag and env vars, with secrets redacted
After upgrade, your smoke test should be: parent launches tiny subagent → subagent writes a known receipt → parent consumes completion without /new or /reset → active subagents returns to 0. If that still fails on 2026.5.27+ or latest stable, I’d file it with the sanitized bundle because that’s a clean repro shape: https://github.com/openclaw/openclaw/issues
And yes, keeping WebChat as control/inspection while the durable supervisor owns workflow state is the architecture I’d trust for this kind of long product pipeline.
Good capture. Two answers:
For verifying the fixes are active after upgrade, there isn’t a special “subagent lifecycle recovery enabled” toggle I found. The practical verification is version + behavior + diagnostics:
openclaw --version
openclaw status --all
openclaw health --verbose
openclaw models status
openclaw gateway stability --json
openclaw gateway diagnostics export --output openclaw-diagnostics.zip
Docs: https://docs.openclaw.ai/gateway/diagnostics
Trajectory capture is also useful for this exact class of bug and is on by default unless disabled with OPENCLAW_TRAJECTORY=0. You can export the active session with /export-trajectory / /trajectory, or set OPENCLAW_TRAJECTORY_DIR before starting the gateway if you want sidecars in a known directory.
Docs: https://docs.openclaw.ai/tools/trajectory
For a poisoned lane, the supported first-line “clear only current run/lane work” command is:
/stop
Current docs/source describe /stop as stopping the current run, and recent changelog/code says it clears queued followups and pending session-lane work. If the stale part is a child run/marker, also check:
/subagents list
/subagents info <id|#>
/subagents kill <id|#|all>
I would not treat /subagents kill all as routine automation during a product workflow, but it is the supported operator cleanup path for stuck controlled children.
If /stop + subagent cleanup does not revive the old WebChat session, then /new or /reset is still the correct escape hatch. On 2026.5.22, I would not spend much time trying to surgically recover the old lane; the relevant fixes landed after that. On 2026.5.27+, if /stop cannot clear it and a new session is still required, that is strong GitHub-issue material with the diagnostics zip and your 7-step repro shape: https://github.com/openclaw/openclaw/issues