#2026.5.12-beta.4 session rollout file

1 messages · Page 1 of 1 (latest)

leaden dragon
#

got a ton of data when i asked it to look at the session rollouit file

#

Position: 🟡 YELLOW. The Forge Codex harness is mostly functional, but the rollout/session files show several real reliability risks beyond raw model latency.

Top findings:

1.Codex thread metadata is bloated. state_5.sqlite has 8 threads, and every threads.title is huge: current thread title is 42,136 chars, prior ones are 19k-47k. That title is effectively storing the full inbound prompt/context, not a title. That is bad for list/read paths and leaks internal context into Codex metadata.

2.Rollout files are growing fast. Current rollout is already 687 KB; previous Forge rollout files are 0.25 MB to 2.9 MB. Current rollout has large records: session_meta 66 KB, response items up to 53 KB. That is not catastrophic yet, but it is shaped like future per-turn drag if Codex reads or indexes these repeatedly.

3.The slow path is session-lane local, not gateway-wide. Logs during this review still show chat.history mostly 240-270ms and sessions.list mostly 430-700ms, while the Forge lane is processing,q=1 with activeWorkKind=embedded_run or tool_call. That explains why UI endpoints look okay while turns feel slow.

4.The liveness diagnostics are confused by embedded Codex state. It reports queued_work_without_active_run, then skips recovery because activeWorkKind=embedded_run. That means recovery knows an active run exists, but the stale-session classifier initially cannot model it cleanly.

5.Codex tool shells have the wrong default HOME for OpenClaw CLI use. Inside Codex, HOME=/home/lumadmin/.openclaw/agents/forge/agent/codex-home/home, so openclaw config get falsely reports ~/.openclaw/openclaw.json missing unless OPENCLAW_CONFIG_PATH=/home/lumadmin/.openclaw/openclaw.json is set. That will poison diagnostics and doctor/config checks.

6.The Codex runtime still hardcodes this developer instruction in thread-lifecycle-DG94Po4d.js: visible replies use message. That conflicts with our direct WebChat mitigation. The inbound WebChat directive is fixed by config, but the harness prompt still nudges Codex toward the broken message path.

7.The binding file stores the full dynamic tool fingerprint as a 36 KB JSON string. Any schema/catalog drift can clear the binding and start a new Codex thread. Forge currently has 22 dynamic tools per thread across 8 Codex threads.

Confidence: medium-high. The evidence points at Codex harness rollout/state handling plus serialized session-lane behavior. HyperMem is present, but the measured context compile path is fast and stable around 75 KB; it is not the primary suspect from this pass.

leaden dragon
#

im going to flip it back to pi and try to repeat the analysis and see if it behaves differently. asking my 2026.5.7 prod server to do a similar analysis is so different its not useful

tame raven
leaden dragon
#

i updated to beta.5 and its a lot better on the areas i hit

#

ive not been able to do as much testing as i had hoped, its a bit crazy with $work atm

tame raven
#

I would just ask your codex to inspect the rollouts to answer question 5 in my list ^ there

#

When you have a moment

leaden dragon
#

i can copy paste that no problem

#

beta.4 was rough, beta.5 is looking quite good

#
Position: 🟡 YELLOW. The 150s beta.4 GPT-5.5/codex harness turns were not dynamic tool waits and not startup retry. They were mostly single model/agent-thinking spans with no tool activity recorded.

beta.4 read

From Forge trajectory 23393686...:

09:05:45 turn 147.9s, tools=0
09:22:58 turn 158.5s, tools=0
09:25:45 turn 263.9s, tools=0
Snippet📋
Those are clean prompt.submitted -> model.completed spans. No tool.call, no tool.result, no failed status, no retry marker in the trajectory. So the maintainer answer is:

For the worst beta.4 turns, the time was inside the Codex/model turn itself. It was not waiting on dynamic tools. It does not look like startup retry. If there were retries below the Codex client layer, they are not surfaced in the OpenClaw trajectory.

There were other beta.4 long turns with tools:

08:31:32 turn 143.2s, tool total 9.5s
08:46:19 turn 130.1s, tool total 19.6s
📋
Even there, most time was still model/agent time, not tool execution.

beta.5 read

From GPT-5-5/codex harness trajectory 7958c67b...:

18:59:51 first post-reboot turn 127.9s
19:02:11 turn 18.2s
19:10:22 turn 15.9s
19:12:06 turn 6.2s
19:13:04 turn 5.2s
19:14:10 turn 51.5s
19:15:11 turn 107.3s
Snippet📋
The first beta.5 turn overlaps the known first-turn warmup/stuck-lane event, and the session file shows lots of real inspection tool work for the beta.5 comparison. The later 107s turn corresponds to the anthropic connection investigation, which also used multiple shell inspections. So beta.5’s long spans are mostly real investigative work plus first-turn warmup, not the beta.4 pattern of text-only 150s model waits.

Top risk: trajectory and mirrored session JSONL disagree on tool-call visibility in beta.5, so maintainers should not rely on trajectory alone for tool attribution.

Confidence: medium-high. The distinction is clear enough: beta.4 worst turns were model/agent waits; beta.5 long turns were warmup or actual inspection work.
#

comparison of beta.4 and beta.5, much happier on all fronts

#

the anthropic investigation was just really a turn speed comparison with gpt-5.5/codex harness

#

sonnet on adaptive was pretty quick but it said the overall turn latency was about 3s

#

gpt-5.5 on medium (my default) was maybe slower but it said the turn latency was 1s (probably just the wrong calculation point)

#

both match my expected behavior

#

even my own memory system, which is a dual slot contextEngine/memory system, seems to be working, though one of my health monitors is triggering some alerts related to asymmetric tool calls, did codex harness change how tools are handled? i had to build in tool call catching to make some of my context management handle model failovers properly

#

im digging in on this, nothing seems to be failing, so this could just be some changes in tool call pairing persistence, but not in a way that destroys things

#
Codex persists many tool calls as separate assistant messages, often one call per message.
•Multi-tool turns can produce odd parent chains:
assistant toolCall A
assistant toolCall B
toolResult B
toolResult A
or results parented under previous tool results.
#
HyperMem’s global pair repair can handle some of that if both IDs are present. The drop happens when replay assembly, compaction, truncation, or budget slicing includes only part of the pair.

So yes, this could be a Codex harness difference. Codex’s mirrored format creates more small, separately persisted tool-call/tool-result messages, which gives HyperMem more boundaries where a pair can be split. Anthropic-style multi-call messages have a different failure shape, but Pylon shows the same parent-chain oddity can happen there too.

Top risk: future context can lose tool-call evidence even though the live turn succeeded.

Action: fix in two layers:

1.HyperMem should preserve tool-call/result groups atomically during replay selection and trim.
2.OpenClaw/Codex should persist explicit stable pair metadata, independent of parent chain adjacency.
#

im just tracking it and not causing any issues with it, so, i think this is just a hygeine check adjustment