#Hi! App-server codex timeout…

1 messages · Page 1 of 1 (latest)

sonic finch
#

Hi. We are seeing multiple regressions after switching OpenClaw agents to the Codex runtime.

Environment:

  • OpenClaw: 2026.5.12
  • Node: 22.22.1
  • Gateway: systemd user service, loopback ws://127.0.0.1:18789
  • Telegram direct chats + DM topics
  • agents.defaults.model.primary = openai/gpt-5.5
  • model config uses agentRuntime.id = "codex"
  • Telegram streaming is enabled:
    channels.telegram.streaming.mode = "partial"
    preview.toolProgress = true
    progress.toolProgress = true
  • Active Memory is enabled with official safe-default config:
    agents = ["main"]
    allowedChatTypes = ["direct"]
    queryMode = "recent"
    promptStyle = "balanced"
    timeoutMs = 15000

Main symptoms:

  1. Codex app-server turns hang and never emit a terminal event.
    Logs show repeated:

    • "codex app-server turn idle timed out waiting for terminal event"
    • idleMs around 1800000, timeoutMs 1800000
    • followed by "codex app-server client retired after timed-out turn"
    • then "Profile openai-codex:pashaganson@gmail.com timed out. Trying next account..."
    • then embedded_run_failover_decision with failoverReason="timeout", timedOut=true, aborted=true, status=408

    Counts from 2026-05-15 log:

    • codex app-server idle timeout: 10
    • codex app-server client retired: 10
    • profile openai-codex timed out: 9
    • embedded_run_failover_decision: 23

    Example:
    time=2026-05-15T15:05:54+03:00
    message="codex app-server turn idle timed out waiting for terminal event"
    extra={"idleMs":1800002,"timeoutMs":1800000,"lastActivityReason":"turn:start"}

  2. User-facing runs sometimes fail with:
    "Request timed out before a response was generated. Please try again, or increase agents.defaults.timeoutSeconds in your config."

    I did not find this exact UI string in local logs, but the underlying runtime logs show the same failure path:
    codex app-server timeout -> client retired -> profile timed out -> failover decision surface_error.

Please help me 🙏

bold glacierBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

hexed harborBOT
# sonic finch Hi. We are seeing multiple regressions after switching OpenClaw agents to the Co...

PashaGans, I checked the current FAQ + Codex harness docs/source. The important bit: this is not primarily a Telegram streaming or Active Memory symptom. lastActivityReason:"turn:start" means OpenClaw got as far as starting the Codex app-server turn, then saw no lifecycle/progress/terminal event for 30 minutes. Raising agents.defaults.timeoutSeconds will mostly just make the stuck run wait longer.

Things I’d check in this order:

  1. Confirm you’re using one runtime path consistently.

For native Codex app-server:

agents: {
  defaults: {
    model: { primary: "openai/gpt-5.5" },
    embeddedHarness: {
      runtime: "codex",
      fallback: "none",
    },
  },
hexed harborBOT
hexed harborBOT
# sonic finch Hi. We are seeing multiple regressions after switching OpenClaw agents to the Co...
}

Docs: https://docs.openclaw.ai/plugins/codex-harness and https://docs.openclaw.ai/gateway/configuration-reference

  1. After any harness/model change, start a fresh session with /new or /reset. Harness choice is pinned per session, so old Telegram DM/topic sessions may keep the previous runner even after config changes.

  2. Run these from the affected chat/session:

/status
/codex status
/codex account

/codex status should show app-server connectivity and version. The docs require Codex app-server 0.118.0+; 2026.5.12 normally pins newer Codex bits, but systemd/PATH can still run a different codex binary than your shell.

  1. Because this is a systemd user service, check privately that the service sees the same auth/home state as your shell. Don’t paste secrets publicly, but verify:
hexed harborBOT
# sonic finch Hi. We are seeing multiple regressions after switching OpenClaw agents to the Co...
openclaw gateway status
openclaw status --all
openclaw models status
openclaw doctor
systemctl --user show-environment

The likely bad states here are: stale/different codex on PATH, missing ~/.codex auth for the service user, Codex entitlement/rate-limit/auth refresh trouble, or a native Codex app-server bug where the turn is accepted but never emits events.

  1. Practical workaround while debugging: switch production chats back to PI + openai-codex/gpt-5.5, then use a new test session for native codex harness. That separates “Codex OAuth works” from “native app-server lifecycle is stuck”.

If the fresh-session repro still hits codex app-server turn idle timed out waiting for terminal event on 2026.5.12, this is worth filing as a GitHub issue with the redacted openclaw status --all, /codex status, /codex account, the exact config block, and one timestamped failed run: https://github.com/openclaw/openclaw/issues

FAQ debug loop is here too: https://docs.openclaw.ai/help/faq

sonic finch
#

We’re seeing intermittent Codex app-server terminal idle timeouts on OpenClaw 2026.5.12 with Codex CLI/app-server 0.130.0.

Environment:

  • OpenClaw: 2026.5.12 (f066dd2)
  • Node: 22.22.1
  • OS: Linux 5.15.0-174-generic x64
  • Gateway: systemd user service, active/running
  • Codex app-server process is running:
    /root/.openclaw/npm/node_modules/@openai/codex/bin/codex.js app-server --listen stdio://
  • Codex auth is OK via OAuth, account expires in 10d, quota is fine.
  • Agent runtime is OpenAI Codex for gpt-5.5.

Current agent sessions can work, but we have a fresh failure where a turn was submitted and then no terminal event arrived for 30 minutes:

{
  "ts": "2026-05-15T12:05:54.701Z",
  "type": "turn.terminal_idle_timeout",
  "sessionKey": "agent:arkadiy:telegram:direct:854067528🧵854067528:806808",
  "provider": "openai",
  "modelId": "gpt-5.5",
  "modelApi": "openai-responses",
  "idleMs": 1800002,
  "timeoutMs": 1800000,
  "lastActivityReason": "turn:start"
}

The run then ended with:

promptError: codex app-server turn idle timed out waiting for turn/completed

This looks like the Codex app-server accepts the turn, but OpenClaw receives no lifecycle/progress/terminal event afterward. Increasing timeout would only delay the failure.

One extra symptom: openclaw doctor completes, but its gateway health check sometimes reports:
GatewayTransportError: gateway timeout after 3000ms
while the gateway service itself is still running.

Question: is there a known issue in 2026.5.12 / Codex app-server 0.130.0 where a submitted turn can stall before any terminal event, especially with lastActivityReason: "turn:start"? What additional debug logs/flags should we capture?
hexed harborBOT
# sonic finch We’re seeing intermittent Codex app-server terminal idle timeouts on OpenClaw 20...

PashaGans, I don’t see a public issue hit for the exact turn.terminal_idle_timeout + lastActivityReason:"turn:start" signature. But 2026.5.12’s changelog does contain several Codex app-server fixes around stranded native turns, timed-out clients, and post-tool stalls, so this area is actively moving.

For this specific trace, your read is right: turn:start means OpenClaw submitted/started the app-server turn and then received no useful app-server lifecycle/progress/terminal event for the 30 minute terminal idle watchdog. Increasing agents.defaults.timeoutSeconds is not the fix.

The extra GatewayTransportError: gateway timeout after 3000ms from doctor is worth treating as a second signal. If the gateway is sometimes too busy to answer a lightweight health RPC, capture gateway liveness/stability at the same timestamp as the Codex stall.

Please capture this bundle for a GitHub issue:

openclaw status --all
openclaw health --verbose --timeout 15000
openclaw gateway status
openclaw gateway stability
openclaw gateway diagnostics export

Also capture Codex-side status from the affected OpenClaw chat/session:

hexed harborBOT
hexed harborBOT
hexed harborBOT
# sonic finch We’re seeing intermittent Codex app-server terminal idle timeouts on OpenClaw 20...
liveness
stability
event loop
memory

I’d also include these two checks because systemd/PATH mismatches are still common:

systemctl --user show-environment
ps -eo pid,ppid,etime,pcpu,pmem,args | rg 'openclaw|codex app-server'
node -p "process.version"

My working hypothesis from the evidence: either Codex app-server 0.130.0 accepted a turn and then wedged before emitting any event, or the gateway/app-server bridge was starved enough that events/health RPCs were not processed. The diagnostics export + stability bundle should help separate those.

hexed harborBOT
sonic finch
# hexed harbor If this reproduces after a fresh `/new` session and gateway restart, please file...

Audited per your checklist: removed snap codex + system-global npm codex; 5/10 per-agent
codex-home/ dirs had no auth.json; one codex app-server was shared across all 10 agents with CODEX_HOME pinned to
angela's home; wiped all 10 codex-home/ dirs (kept auth-profiles.json), removed /root/.codex, ran openclaw doctor --fix, restarted gateway. Not claiming this fixed anything — small post-restart window, wedge can still be there.

From arkadiy DM after restart:
/status → gpt-5.5 · Context 135k/272k · Cache 94% hit · Runtime Codex · Think high · session
...🧵854067528:807447
/codex status → connected.
/codex account → Subscription · Weekly 64% · Short-term 1% · active
/codex threads arkadiy → "No Codex threads returned."
Test: 3 messages to agent arkadiy from 3 chats ~17:25Z — analytics (80c) to DM, "привет тест" to a DM topic, "привет тест" to a group topic. Inbound message log lines fire at spool dequeue:

  • 17:25:14Z analytics DM
  • 17:33:00Z DM topic (+7m46s)
  • 17:33:21Z group (+8m07s)
    Msgs 2-3 sat in /root/.openclaw/telegram/ingress-spool-arkadiy/ for ~8 min before openclaw logged them inbound. The
    analytics turn (thinking:high, multi-tool) — the spool didn't release the other two until it finished, they never reached the embedded runner. One app-server child for arkadiy, ~5% CPU, no errors. A long turn freezes ALL inbound for the agent across channels. Also: the 8-min reply was cut mid-sentence. TG 4096 ruled out. No client retired / idle timed out / surface_error.
    Candidates: finish_reason=length from thinking:high, codex item/agentMessage cap, or streaming.mode=partial race.
    Questions: (1) Is per-agent ingress-spool-<agent> supposed to block dequeue while embedded runner is busy? Decoupling
    knob? acp.maxConcurrentSessions:8 seems ACP-only. (2) Way to surface finish_reason + usage.output_tokens app-server got on terminal?
    What else to capture before filing? Ready to flip diagnostics.flags=["*"] + logging.level=debug and reproduce.
hexed harborBOT
# sonic finch Audited per your checklist: removed snap codex + system-global npm codex; 5/10 p...

PashaGans, good audit. The shared app-server + CODEX_HOME pinned to Angela’s home is a serious enough finding that I’d keep it in the issue timeline, even if it’s not proven fixed.

Answers from source:

  1. Yes, with Telegram isolated ingress the spool drain is currently serial. The worker writes updates to telegram/ingress-spool-<accountId>; the gateway drain loop does:
for update in updates:
  await bot.handleUpdate(update)
  delete spooled file

So if bot.handleUpdate includes the full agent turn, one long turn can head-of-line block later spooled Telegram updates for that account/agent before they even reach OpenClaw’s active-run queue. acp.maxConcurrentSessions is ACP-only. agents.defaults.maxConcurrent is global agent-run concurrency, but “each session still serialized” and it does not decouple this Telegram spool drain. The /queue / messages.queue settings only help after the message has been dequeued and routed into an active session, so they won’t fix files sitting in ingress-spool-arkadiy.

That’s issue-worthy by itself: “Telegram isolated ingress spool has head-of-line blocking while an agent turn is active.”

  1. For usage/output tokens, use:
hexed harborBOT
# sonic finch Audited per your checklist: removed snap codex + system-global npm codex; 5/10 p...
/usage full
/status

/usage full appends per-response usage; OAuth hides cost but should still show tokens. For completion metadata, use:

/trace raw
/export-trajectory

Current Codex harness appears to normalize terminal stop state mostly to stop / error / aborted / toolUse. I would not assume raw provider finish_reason=length is surfaced cleanly in chat. The trajectory is the best thing to capture because Codex app-server records model.completed with usage, assistant text, promptError, timeout/aborted flags, and message snapshots.

Before filing, I’d capture one clean repro with:

hexed harborBOT
hexed harborBOT
hexed harborBOT
# sonic finch Audited per your checklist: removed snap codex + system-global npm codex; 5/10 p...
agent_end
Telegram

For the mid-sentence cut, try one repro with Telegram partial streaming off or final-only if you can tolerate it briefly. If final-only still cuts mid-sentence and /usage full shows high output tokens or trajectory shows the assistant text is already truncated, that points toward Codex/model/output cap. If trajectory has the full text but Telegram receives a partial, that points toward delivery/partial streaming.

Docs: https://docs.openclaw.ai/reference/token-use and https://docs.openclaw.ai/tools/trajectory

I’d file this now with two suspected threads in the same issue: the Telegram isolated-ingress HOL blocking, and the truncated final reply. The old terminal idle timeout may be related, but this new evidence is concrete enough on its own.

sonic finch
#

Krill, filed it: https://github.com/openclaw/openclaw/issues/82274

Two threads in one issue per your suggestion:
(A) Telegram isolated-ingress HOL — reproduced live with spool drain timing (light → 30s wait → heavy → 3m10s wait →
light)
(B) Codex app-server stops emitting JSON-RPC notifications mid-turn after rawResponseItem/completed of type
custom_tool_call_output. Watchdog fires at the 30-min mark with lastActivityReason: notification:rawResponseItem/completed. Internally codex kept running 1000+ events for ~7 more min before going
fully idle in futex_wait — so the JSON-RPC outgoing stream goes silent earlier than the codex session loop does.

Artifact bundle (trajectory diff success vs fail, log slice, spool monitor) linked from the issue body. Diagnostics
zip from gateway diagnostics export available on request.

hexed harborBOT
# sonic finch Krill, filed it: https://github.com/openclaw/openclaw/issues/82274 Two thread...

PashaGans, I opened #82274. This is a strong report — the custom_tool_call_output detail is the key upgrade from “timeout somewhere” to “Codex internal loop kept working while the app-server JSON-RPC notification stream stopped.”

Two small additions I’d make if you’re comfortable:

  • Attach the sanitized gateway diagnostics export zip now rather than waiting. Since it’s payload-free/redacted and only 33 KiB, it gives maintainers config shape + stability context without another round trip.
  • Add the exact relevant config shape for Telegram isolated ingress and runtime selection, even if redacted: channels.telegram..., plugins.entries.codex..., and the model/runtime block. The prose says agentRuntime.id="codex", but a minimal JSON5 block helps maintainers spot whether this is legacy per-model runtime selection vs embeddedHarness.runtime.

I’d keep the two threads in one issue for now because the repro window ties them together, but they may split it later. The HOL part is clearly OpenClaw-side. The notification-stream stall smells more Codex app-server-side, but OpenClaw can still improve the watchdog/diagnostics around it.

Nice work getting the success/fail trajectory diff. That’s exactly the kind of artifact that makes this actionable instead of folklore.

sonic finch