#codex runtime stalling

1 messages · Page 1 of 1 (latest)

icy shadow
#

Question: is this a known 2026.5.12 native Codex harness/app-server cleanup bug? If yes, is the recommended workaround to move these agents off native Codex temporarily, reset the affected session row, or restart the gateway/app-server?

OpenClaw version: 2026.5.12 (f066dd2)
Install: pnpm, stable/default
Runtime/model path: openai/gpt-5.5 via native Codex app-server / Codex harness
Gateway: local loopback, Telegram direct sessions

Relevant log pattern:

08:52:53 warn agent/embedded
codex app-server connection closed during startup; restarting app-server and retrying

08:52:54 warn agents/harness
{"harnessId":"codex","provider":"openai","modelId":"gpt-5.5","error":"codex app-server client is closed"}
Codex agent harness failed; not falling back to embedded PI backend

Then for the affected agent/session:

08:56:20 warn diagnostic
stalled session: sessionId=b5f02d3e... sessionKey=agent:chan:telegram:direct:... state=processing age=145s queueDepth=1 reason=active_work_without_progress classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed terminalProgressStale=true recovery=none

Later, after a reset/new attempt:

09:16:46 warn diagnostic
stalled session: sessionId=ec349745... sessionKey=agent:chan:telegram:direct:... state=processing age=123s queueDepth=1 reason=active_work_without_progress classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:rawResponseItem/completed terminalProgressStale=true recovery=none

The weird part: the transcript shows the agent did send a Telegram message and produced a final assistant “Sent.”, but the session store still says status="running". After that, /stop and RPC abort report no active abortable run, while the lane/session remains marked busy. So it looks like the abort controller/final event is gone, but session/lane cleanup did not complete.

gloomy furnaceBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

tepid anvilBOT
# icy shadow Question: is this a known 2026.5.12 native Codex harness/app-server cleanup bug?...

Rodrigo: yes, that pattern lines up with a native Codex app-server / Gateway session-lane cleanup issue in the 2026.5.12 line, not a normal model stall.

The strongest signal is this combo:

  • app-server closes during startup, then Codex agent harness failed; not falling back...
  • later diagnostics show lastProgress=codex_app_server:notification:rawResponseItem/completed with terminalProgressStale=true
  • transcript has a visible final reply, but the session store still says running
  • /stop says there is no active abortable run

That means the actual Codex turn appears terminal, but OpenClaw's session/lane bookkeeping missed cleanup. Current main has fixes after f066dd2 that are directly adjacent: one prevents Codex app-server surrogate stalls, and another shortens stalled Codex recovery from the old safer window to the newer default recovery window. The docs now describe stalled embedded runs being abort-drained after diagnostics.stuckSessionAbortMs so queued turns can resume.

Recommended order:

  1. Upgrade to a build newer than 2026.5.12 / f066dd2 if you can, then restart the Gateway.
  2. For the already-wedged conversation, prefer a Gateway restart over only restarting the app-server. The app-server is only one side; the stuck lane/session ownership is in Gateway memory/session state.
  3. After restart, send /new or /reset in the affected Telegram session. Harness choice is pinned per session id, so config/runtime changes do not fully apply until a new/reset session.
  4. If it repeats, temporarily move those agents off native Codex. For GPT-5.5 via the PI/OAuth path, use openai-codex/gpt-5.5; for native app-server, it is openai/gpt-5.5 plus embeddedHarness.runtime: "codex".
tepid anvilBOT
# icy shadow Question: is this a known 2026.5.12 native Codex harness/app-server cleanup bug?...
  1. I would not live-edit the session row first. If it remains stuck after restart + reset, stop the Gateway, back up the session store, then do any surgical session-store cleanup offline. Also file this with the sanitized log pattern because it is useful repro material: https://github.com/openclaw/openclaw/issues

Relevant docs:
https://docs.openclaw.ai/plugins/codex-harness
https://docs.openclaw.ai/concepts/session
https://docs.openclaw.ai/cli/sessions

icy shadow
#

but there's no newer build after 2026.5.12

tepid anvilBOT
# icy shadow but there's no newer build after 2026.5.12

You're right. I should have said “newer than that commit / from main,” not “newer stable build.”

If you’re staying on stable 2026.5.12, the practical workaround is:

  1. Restart the Gateway, not just the Codex app-server.
  2. Send /new or /reset in the affected Telegram session so the pinned harness/session binding is refreshed.
  3. If it repeats, move the affected agents off native Codex temporarily: use openai-codex/gpt-5.5 through the PI/OAuth path instead of openai/gpt-5.5 + embeddedHarness.runtime: "codex".
  4. If you’re comfortable running unreleased code, the fixes I saw are on main, so a git/source install from main is the “newer build” path until the next stable lands.

I’d avoid manually editing the session row unless restart + reset does not clear it. If it gets that far, stop the Gateway first, back up sessions.json, then clean it offline.

icy shadow
#

in openai-codex/gpt-5.5 should I use runtime pi?

tepid anvilBOT
tepid anvilBOT
# icy shadow in openai-codex/gpt-5.5 should I use runtime pi?

Or remove the forced Codex harness override entirely and keep runtime: "auto" with fallback: "pi", but if you’re specifically avoiding the app-server stall, I’d force runtime: "pi" for now.

After changing it, restart the Gateway and send /new or /reset in the affected Telegram conversation, because harness choice is pinned per session id.

icy shadow
#

is embeddedHarness the correct syntax? give me docs with proof

tepid anvilBOT
# icy shadow is embeddedHarness the correct syntax? give me docs with proof

Yes, embeddedHarness is the documented key.

Proof from the docs:

tepid anvilBOT
lucid hazel
#

OpenClaw config is invalid
File: ~/.openclaw/openclaw.json
Problem:

  • agents.defaults.embeddedHarness: Unrecognized key: "fallback"
    Legacy config keys detected:
  • agents.defaults.embeddedHarness: agents.defaults.embeddedHarness is legacy and ignored; set provider/model runtime policy instead. Run "openclaw doctor --fix".
tepid anvilBOT
tepid anvilBOT
# lucid hazel OpenClaw config is invalid File: ~/.openclaw/openclaw.json Problem: - agents.d...
    defaults: {
      model: "openai/gpt-5.5"
    }
  }
}

Do not use openai-codex/gpt-5.5 as the model ref on that build. The 2026.5.12 docs say legacy openai-codex/gpt-* config should be repaired with openclaw doctor --fix; the auth profile can still be openai-codex, but the model ref should be openai/gpt-5.5.

Proof from the 2026.5.12 docs/source:

So the cleanup path is: remove agents.defaults.embeddedHarness, run openclaw doctor --fix, set models.providers.openai.agentRuntime.id = "pi", restart Gateway, then /new or /reset.

lucid hazel
#

I moved to main, how do I override to pi?

tepid anvilBOT
# lucid hazel I moved to main, how do I override to pi?

On main, don’t use embeddedHarness. It is still accepted only as a legacy input and runtime selection ignores it. Use provider/model runtime policy.

For all OpenAI agent turns through PI:

pnpm openclaw doctor --fix
pnpm openclaw config set models.providers.openai.agentRuntime.id pi
pnpm openclaw config validate
pnpm openclaw gateway restart

Keep the model as openai/gpt-5.5, not openai-codex/gpt-5.5:

pnpm openclaw config set agents.defaults.model.primary openai/gpt-5.5
tepid anvilBOT
# lucid hazel I moved to main, how do I override to pi?

If you still have the old key, remove it:

pnpm openclaw config unset agents.defaults.embeddedHarness

Then send /new or /reset in the affected Telegram conversation.

Docs proof: current main says runtime selection uses model-scoped agentRuntime first, then models.providers.<provider>.agentRuntime; whole-agent embeddedHarness is legacy/ignored. OpenAI docs also show openai/gpt-5.5 + provider/model agentRuntime.id: "pi" as the PI route.
https://docs.openclaw.ai/concepts/agent-runtimes
https://docs.openclaw.ai/providers/openai

lucid hazel
#

openclaw config set models.providers.openai.agentRuntime.id pi

🦞 OpenClaw 2026.5.16 (3064d61) — One CLI to rule them all, and one more restart because you changed the port.

Error: Config validation failed: models.providers.openai.baseUrl: Invalid input: expected string, received undefined

tepid anvilBOT
# lucid hazel openclaw config set models.providers.openai.agentRuntime.id pi 🦞 OpenClaw 2026...

That error is because setting models.providers.openai.agentRuntime.id creates/edits the provider block, and provider blocks require baseUrl. Don’t use the provider-level override unless you’re also defining the full provider.

Use the model-scoped override instead:

pnpm openclaw config set agents.defaults.models '{"openai/gpt-5.5":{"agentRuntime":{"id":"pi"}}}' --strict-json --merge
pnpm openclaw config set agents.defaults.model.primary openai/gpt-5.5
pnpm openclaw config validate
pnpm openclaw gateway restart

Then send /new or /reset in the affected Telegram conversation.

If that failed command left a rejected file, you can ignore the openclaw.json.rejected.* file. If pnpm openclaw config validate still complains about models.providers.openai.baseUrl, check whether a partial provider block actually exists:

tepid anvilBOT
# lucid hazel openclaw config set models.providers.openai.agentRuntime.id pi 🦞 OpenClaw 2026...
pnpm openclaw config get models.providers.openai --json

If it only contains agentRuntime, remove it:

pnpm openclaw config unset models.providers.openai

Docs basis: model-scoped runtime policy wins at agents.defaults.models["provider/model"].agentRuntime, and OpenAI docs list openai/gpt-5.5 + provider/model agentRuntime.id: "pi" as the PI route.
https://docs.openclaw.ai/concepts/agent-runtimes
https://docs.openclaw.ai/providers/openai

lucid hazel
#

still seems off /status shows this:

🦞 OpenClaw 2026.5.16 (3064d61)
⏱️ Uptime: gateway 1m 52s · system 15d
🧠 Model: openai/gpt-5.5 · 🔑 api-key (env: OPENAI_API_KEY)
↪️ Fallback: openai-codex/gpt-5.5 · 🔑 oauth (openai-codex:noreply@gmail.com (noreply@gmail.com)) (selected model unavailable)
🗄️ Cache: 100% hit · 32k cached, 0 new
📚 Context: 0/200k (0%) · 🧹 Compactions: 0
📊 Usage: 5h 97% left ⏱4h 4m · Week 59% left ⏱3d 10h
🧵 Session: agent:main:telegram:group:-1003773061671:topic:2 • updated just now
⚙️ Execution: direct · Runtime: OpenClaw Pi Default · Think: medium · Text: low

tepid anvilBOT
# lucid hazel still seems off /status shows this: 🦞 OpenClaw 2026.5.16 (3064d61) ⏱️ Uptime: ...

That status is actually good on the runtime side:

Runtime: OpenClaw Pi Default = you are off native Codex/app-server and on PI.

The weird part is auth/model selection:

Model: openai/gpt-5.5 · api-key (env: OPENAI_API_KEY)
then
Fallback: openai-codex/gpt-5.5 · oauth ... (selected model unavailable)

So OpenClaw is trying your OPENAI_API_KEY path first, that path does not have gpt-5.5, then it falls back to the Codex OAuth route. If you want PI + Codex OAuth as the primary path, make the OpenAI auth order prefer the Codex OAuth profile and remove the legacy model fallback:

pnpm openclaw models auth list --provider openai-codex
pnpm openclaw config set auth.order.openai '["openai-codex:cheechnd@gmail.com"]' --strict-json
pnpm openclaw config set agents.defaults.model '{"primary":"openai/gpt-5.5","fallbacks":[]}' --strict-json
tepid anvilBOT
lucid hazel
#

rolled back to 2026.5.7

tepid anvilBOT
# lucid hazel rolled back to 2026.5.7

On 2026.5.7, the config model is different again. The simple PI + Codex OAuth route is:

  • model ref: openai-codex/gpt-5.5
  • runtime: PI, either omitted or agents.defaults.agentRuntime.id: "pi"
  • do not use model-scoped agents.defaults.models["openai/gpt-5.5"].agentRuntime; that was for newer main and 2026.5.7 does not support it.

Try:

pnpm openclaw config set agents.defaults.model '{"primary":"openai-codex/gpt-5.5","fallbacks":[]}' --strict-json
pnpm openclaw config set agents.defaults.agentRuntime.id pi
pnpm openclaw config unset agents.defaults.embeddedHarness
pnpm openclaw config validate
pnpm openclaw gateway restart

Then /new or /reset.

tepid anvilBOT
# lucid hazel rolled back to 2026.5.7

Expected /status on 2026.5.7 should be roughly:
Model: openai-codex/gpt-5.5 · oauth ...
and
Runtime: OpenClaw Pi Default

If validation now complains about agents.defaults.models.*.agentRuntime, that’s leftover main-branch config. Remove just that model-scoped agentRuntime key; 2026.5.7 only knows whole-agent agents.defaults.agentRuntime, not per-model runtime policy.

lucid hazel
#

all good, I had a backup config