#codex harness stalling

1 messages · Page 1 of 1 (latest)

reef thunder
#

Is this diagnosis accurate? Finding
This is not CPU and not Discord. The bad 12:14 message was a Codex harness stall.

Agatha received the prompt at about 12:14:09, then the native Codex app-server never sent the terminal
turn/completed event. OpenClaw waited until its 30-minute terminal watchdog fired at 12:45:05, then
surfaced the partial text Agatha had drafted. No env file was changed and no tool/file operation
actually completed.

The same pattern appears repeatedly on May 13-14: codex app-server turn idle timed out waiting for
terminal event, with last activity like notification:item/completed, notification:item/agentMessage/
delta, rawResponseItem/completed, and account/rateLimits/updated.

CPU is not the cause: NUC load was low (0.35, 0.61, 0.69), OpenClaw was around 3% CPU, Codex around 1-
3%.

Current State
Agatha currently inherits the global native Codex route:

openai/gpt-5.5 + agentRuntime.id: "codex"

I tested an isolated PI config copy without changing live config:

openai-codex/gpt-5.5 + agentRuntime: pi

That returned OK successfully with agentHarnessId: "pi" and model duration about 10.9s. Real command
time was 34.5s because the local test had cold plugin/CLI startup overhead. So PI is viable, but the
live config does not currently allow that model for Agatha until we add it.

Recommendation
Switch Agatha only back to PI

red otterBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

stone sleetBOT
#

Yes — based on what you wrote, that diagnosis is internally consistent and matches how the runtimes are split.

The key tell is: openai/gpt-5.5 + agentRuntime.id: "codex" means the native Codex app-server owns the turn loop. If the turn was accepted, emitted deltas/progress, but never emitted the terminal completion event until OpenClaw’s watchdog/timeout path surfaced a partial, that points at a Codex harness/app-server liveness failure, not Discord delivery and not CPU starvation.

The PI test is also a good control: openai-codex/gpt-5.5 + agentRuntime: pi uses Codex OAuth through OpenClaw’s normal PI runner, not the native Codex app-server harness. If that completes cleanly, switching only Agatha back to PI is a reasonable mitigation.

Two caveats I’d keep in the writeup:

  • Existing sessions may keep their recorded runtime; config changes usually affect new/reset sessions, not necessarily an already-bound transcript.
  • “No env/file operation completed” is only proven if the logs show no tool/file completion events after the prompt. The terminal-event stall alone proves the run didn’t finish normally, not automatically that no side effect happened.

So: recommendation looks good. I’d make it agent-scoped for Agatha only, then start/reset Agatha’s session and verify /status or logs show agentHarnessId: "pi" on the next run.

Docs refs: https://docs.openclaw.ai/concepts/agent-runtimes and https://docs.openclaw.ai/providers/openai

timid elbow
#

I had a similar problem
Had to move to PI harness…. Million times better in performance and speed of replies

stone sleetBOT
#

That’s a useful data point, thanks. That lines up with the split here: PI keeps OpenClaw owning the loop, while the native Codex app-server path depends on Codex sending the terminal turn event reliably.

If you still have logs from the bad runs, the interesting lines are any turn idle timed out waiting for terminal event / missing turn/completed patterns plus the OpenClaw version. That makes it much easier to tell whether this is the same harness failure mode versus a different slowdown.

reef thunder
#

The specific Agatha bad turn is here:

Session:

  • /home/george/Scripts/Clawdbot/state/agents/agatha/sessions/f90d7aec-ac35-4852-91a9-655e8070e0e6.jsonl
  • /home/george/Scripts/Clawdbot/state/agents/agatha/sessions/f90d7aec-ac35-4852-91a9-
    655e8070e0e6.trajectory.jsonl

Log line:
2026-05-14T12:45:05.669+01:00
threadId=019e260c-ea05-7b62-917d-c7aa733176a0
turnId=019e2631-aa4d-78b3-bd7d-6262ce82d6cc
idleMs=1800949
timeoutMs=1800000
lastActivityReason=notification:item/completed
message="codex app-server turn idle timed out waiting for terminal event"

Immediately after:
2026-05-14T12:45:05.766+01:00
rawErrorPreview="codex app-server attempt timed out"

And the trajectory confirms the final bad turn went:

session.started
context.compiled
prompt.submitted
turn.terminal_idle_timeout
model.completed timedOut=true aborted=true promptError="codex app-server attempt timed out"
session.ended status=error promptError="codex app-server turn idle timed out waiting for turn/
completed"

I also checked the final bad turn segment for tool calls: there were no tool.call / tool.result events
between that prompt.submitted and the timeout. No files under Agatha’s workspace were modified after
2026-05-14 12:14.

stone sleetBOT
#

Yep, that evidence makes the diagnosis much stronger.

That sequence is basically the clean signature:

prompt.submitted → no tool.call / tool.result → no turn/completed → turn.terminal_idle_timeout at ~30 min → model.completed timedOut=true aborted=true → session.ended status=error

So I’d phrase it as: the native Codex app-server accepted the turn and produced some activity, but never delivered the terminal turn/completed event OpenClaw needed to close the run. Since there were no tool events in that segment and no workspace mtimes after 12:14, it’s fair to say the bad turn did not complete tool/file work.

Switching Agatha only to PI is the right mitigation. For confidence after the change, start/reset Agatha’s session and check the next trajectory/log shows agentHarnessId: "pi" and a normal session.ended status=ok.

reef thunder
#

what model are you using @stone sleet ? what makes you so snappy?

stone sleetBOT
#

I shouldn’t treat my current runtime as model advice — it changes and can be misleading. For picking models, #models has the freshest real-world notes, and the docs are here: https://docs.openclaw.ai/concepts/models

What makes me feel snappy is mostly the execution path: I’m on the OpenClaw PI/default runtime here, not the native Codex app-server harness. PI keeps the turn loop inside OpenClaw, so replies don’t depend on Codex app-server terminal events in the same way. Also this thread has relatively little context, no heavy tool work, and good cache hit rate — all claws, no molasses 🦞

hollow kernel
#

I'm having the same problem. Is this a bug or bad config?

stone sleetBOT
# hollow kernel I'm having the same problem. Is this a bug or bad config?

Short version: if your logs match George’s pattern, I’d treat it as a native Codex harness/app-server stall, not “bad config” in the usual sense.

The tell is:

prompt.submitted -> some activity -> no tool.call / tool.result -> no turn/completed -> after ~30 min: codex app-server turn idle timed out waiting for terminal event

That means OpenClaw handed the turn off, but the native Codex app-server path never delivered the terminal event OpenClaw needs to close the run. Config matters only in the sense that you may be routed onto the fragile path: openai/gpt-5.5 with embeddedHarness.runtime: "codex" is the native Codex route, while openai-codex/gpt-5.5 is the PI/Codex OAuth route.

First thing: what does openclaw --version show?

Then grab the safe status/log info:

openclaw status --all
openclaw logs --follow