#STUCK SESSION

1 messages · Page 1 of 1 (latest)

ocean hawk
#

01:21:34 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=135s queueDepth=1
01:22:04 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=165s queueDepth=1
01:22:34 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=195s queueDepth=1
01:23:04 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=225s queueDepth=1
01:23:34 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=255s queueDepth=1
01:24:04 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=285s queueDepth=1
01:24:19 [agent/embedded] embedded run timeout: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e sessionId=0aa3896c-c615-419c-a8d5-b075a4566c8c timeoutMs=300000
01:24:49 [agent/embedded] Profile openai-codex:default timed out. Trying next account...
01:24:49 [agent/embedded] embedded run failover decision: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e stage=assistant decision=fallback_model reason=timeout from=openai-codex/gpt-5.5 profile=sha256:06bfb5171eff
01:24:49 [diagnostic] lane task error: lane=main durationMs=340196 error="FailoverError: LLM request timed out."
01:24:49 [diagnostic] lane task error: lane=session:agent:medicare-web:main durationMs=340199 error="FailoverError: LLM request timed out."
01:24:49 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=timeout next=openai-codex/gpt-5.4 detail=LLM request timed out.
01:24:52 [plugins] memory-lancedb-pro: all 3 memories were filtered out due to redundancy policy
01:25:01 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent medicare-web
01:25:42 [agent/embedded] embedded run failover decision: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4 profile=sha256:06bfb5171eff
01:25:42 [model-fallback/decision] model fallback decision: decision=candidate_succeeded requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.4 reason=unknown next=none
01:28:04 [plugins] memory-lancedb-pro: injecting 3 memories into context for agent peanut
01:28:24 [plugins] memory-lancedb-pro: regex fallback found 0 capturable texts for agent peanut

cloud whaleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

kind glacierBOT
#

That log looks like a provider/model timeout, not a memory/plugin issue.

What’s happening:

  • stuck session ... processing is the diagnostic warning while the agent turn is still running.
  • At timeoutMs=300000, the embedded run hit the 5-minute timeout.
  • OpenClaw then tried failover from openai-codex/gpt-5.5 to openai-codex/gpt-5.4.
  • The memory-lancedb-pro lines are probably incidental context injection, not the cause.

First thing: what version are you on?

openclaw --version
openclaw status --all

If this is still blocking the medicare-web session, try a fresh session/reset for that agent, or restart the gateway once. If it comes back repeatedly, the useful bug report is: version, openclaw status --all, and ~100 log lines before the first embedded run timeout.

Relevant docs: https://docs.openclaw.ai/help/faq and https://docs.openclaw.ai/concepts/model-failover

ocean hawk
#

Version 4.23

It's on a fresh /new session

kind glacierBOT
#

Got it — if /new is fresh, then I’d stop looking at stale session state. This is almost certainly the live model call hanging until the runner’s 300s timeout.

Two quick clarifiers:

  1. Is the exact version output 2026.4.23? If yes, I’d update first before chasing ghosts.
  2. Did the fallback reply from openai-codex/gpt-5.4 ever arrive, or did the user-facing turn still fail?

For triage, please grab:

openclaw status --all
openclaw gateway status

And the log window from ~1 minute before embedded run timeout through the final model-fallback/decision line.

If you need to unblock immediately: switch that agent/session to a different configured working model/provider, or remove the slow Codex model from that agent’s fallback chain temporarily. The queueDepth=1 part is just the next message stuck behind the active turn; the root symptom is the 5-minute LLM timeout.

ocean hawk
#

It's not 4.23 - something else is wrong.

Answer from the agent usually is :25:42 [agent/embedded] embedded run failover decision: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4 profile=sha256:06bfb5171eff

kind glacierBOT
#

Ah yep — I misread your 4.23, sorry. The important line is this one:

stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4

That means the fallback candidate also timed out, and OpenClaw had no successful next step left, so it surfaced the timeout instead of producing an assistant answer. In other words: fresh /new rules out stale session history; this is the active LLM call path timing out twice.

The weird part is your later line:

candidate_succeeded ... candidate=openai-codex/gpt-5.4

paired with surface_error reason=timeout. That suggests the fallback wrapper considered the gpt-5.4 candidate “handled”, but the embedded runner handled it by surfacing a timeout/error payload. That’s worth filing as a bug if the user-visible reply is just bad/internal instead of a clean timeout message.

Next things I’d check:

#
openclaw status --all
openclaw models status

And whether a direct minimal run on that same provider/model hangs too. If both openai-codex/gpt-5.5 and openai-codex/gpt-5.4 are timing out under the same auth profile hash, I’d temporarily move this agent to a different known-good provider/model or remove Codex from its fallback chain to unblock.

For a GitHub issue, include the exact log window you pasted plus config model/fallback shape with secrets redacted: https://github.com/openclaw/openclaw/issues

ocean hawk
#

To be clear this was the exact sequence over multiple attempts

queueDepth=1
01:24:04 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=285s queueDepth=1
01:24:19 [agent/embedded] embedded run timeout: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e sessionId=0aa3896c-c615-419c-a8d5-b075a4566c8c timeoutMs=300000
01:24:49 [agent/embedded] Profile openai-codex:default timed out. Trying next account...
01:24:49 [agent/embedded] embedded run failover decision: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e stage=assistant decision=fallback_model reason=timeout from=openai-codex/gpt-5.5 profile=sha256:06bfb5171eff
01:24:49 [diagnostic] lane task error: lane=main durationMs=340196 error="FailoverError: LLM request timed out."
01:24:49 [diagnostic] lane task error: lane=session:agent:medicare-web:main durationMs=340199 error="FailoverError: LLM request timed out."
01:24:49 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=timeout next=openai-codex/gpt-5.4 detail=LLM request timed out.
01:24:52 [plugins] memory-lancedb-pro: all 3 memories were filtered out due to redundancy policy
01:25:01 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent medicare-web
01:25:42 [agent/embedded] embedded run failover decision: runId=301f22c8-0ccf-4c38-9936-192b9cf5535e stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4 profile=sha256:06bfb5171eff
01:25:42 [model-fallback/decision] model fallback decision: decision=candidate_succeeded requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.4 reason=unknown next=none
01:28:04 [plugins] memory-lancedb-pro: injecting 3 memories into context for agent peanut
01:28:24 [plugins] memory-lancedb-pro: regex fallback found 0 capturable texts for agent peanut
01:34:18 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=120s queueDepth=1
01:34:49 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=151s queueDepth=1
01:35:19 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=181s queueDepth=1

kind glacierBOT
#

Yep, this sequence makes it clearer. I’d read it as:

  1. queueDepth=1 = another message/turn is waiting behind the active medicare-web run. It’s a symptom, not the root cause.
  2. 01:24:19 embedded run timeout ... timeoutMs=300000 = the primary openai-codex/gpt-5.5 run hit the 5-minute run timeout.
  3. 01:24:49 Profile openai-codex:default timed out. Trying next account... = profile rotation was attempted, but there wasn’t a usable next profile/account, so it escalated.
  4. candidate_failed ... next=openai-codex/gpt-5.4 = model fallback kicked in correctly.
  5. 01:25:42 ... surface_error reason=timeout from=openai-codex/gpt-5.4 = the fallback model also produced a timeout-shaped failure.
  6. candidate_succeeded ... candidate=openai-codex/gpt-5.4 is the suspicious bit. That looks like the outer fallback wrapper treated “the fallback candidate returned a surfaced error result” as success, instead of logging it as candidate_failed / final failure. So the logging/state here is probably misleading even though the underlying issue is still: both Codex candidates timed out.

So I don’t think this is memory, stale /new state, or LanceDB. The memory-lancedb-pro lines are just context being rebuilt for the fallback/new attempt.

The real question is why openai-codex is timing out for that agent. I’d check three things:

openclaw status --all
openclaw models status
#

Then temporarily switch medicare-web away from openai-codex/gpt-5.5 + gpt-5.4 to any known-good model/provider. If that fixes it, we’ve isolated this to the Codex provider/auth path or that model fallback chain.

Also: please file this log sequence as a GitHub issue. The surface_error reason=timeout followed by candidate_succeeded is worth fixing or at least making much clearer: https://github.com/openclaw/openclaw/issues

ocean hawk
#

Are older versions of OpenClaw maintained and updated?

kind glacierBOT
#

Short answer: no, not as long-lived maintenance branches.

OpenClaw moves fast and support is generally for current latest / beta / dev builds, not old pinned versions. Older versions may still exist on npm/GitHub, but bugfixes are not usually backported to them.

So for a bug like this, first test/update to current stable or beta/dev if asked by maintainers:

openclaw update status
openclaw update

Docs: https://docs.openclaw.ai/help/faq-first-run

ocean hawk
#

Updating to a newer version isn't an option as it bricks my entire OpenClaw

kind glacierBOT
#

Oof, that’s frustrating — and it changes the priority. If newer OpenClaw bricks the install, that’s the first bug to isolate, because staying pinned means you’re effectively stuck on a snapshot with no backported fixes.

I wouldn’t tell you to blindly update in-place. Safer path:

  1. Keep your current working install pinned/backed up.
  2. Test the newer version in a separate install/state dir if possible.
  3. Capture the actual “bricks my OpenClaw” failure so it can be fixed.

Useful info:

openclaw --version
openclaw update status
openclaw status --all

And for the failed newer version: what exactly breaks — gateway won’t start, config migration fails, auth disappears, sessions won’t load, tools fail, Discord stops replying?

#

If you can paste the first fatal error from the newer-version logs, we can triage that without risking your current setup. If it’s reproducible, it should go to GitHub too: https://github.com/openclaw/openclaw/issues

ocean hawk
#

What CLI is latest beta branch running

kind glacierBOT
#

Latest npm dist-tags right now:

latest: 2026.5.7
beta: 2026.5.12-beta.6

So the beta CLI would be openclaw@2026.5.12-beta.6.

Check locally with:

openclaw --version
openclaw update status
ocean hawk
#

What harness?

kind glacierBOT
#

By “harness” I mean the runner layer OpenClaw uses to talk to the model/CLI and turn it into an agent session — not a separate thing you install.

For openai-codex/..., that’s the OpenClaw embedded/agent runner path around the Codex provider: it sends the prompt/tools, watches streaming/output, enforces timeouts, and reports failover. Your logs are from that layer:

[agent/embedded] embedded run timeout
[model-fallback/decision]

So in your case: the OpenClaw runner/harness is timing out waiting for the Codex model response.

ocean hawk
#

o OpenAI OAuth complete
Config write rejected: C:\Users\desti.openclaw\openclaw.json (size-drop:37509->12843). Rejected payload saved to C:\Users\desti.openclaw\openclaw.json.rejected.2026-05-15T06-10-57-064Z.
Error: Config write rejected: C:\Users\desti.openclaw\openclaw.json (size-drop:37509->12843). Rejected payload saved to C:\Users\desti.openclaw\openclaw.json.rejected.2026-05-15T06-10-57-064Z.

kind glacierBOT
# ocean hawk o OpenAI OAuth complete Config write rejected: C:\Users\desti\.openclaw\opencla...

First: what does openclaw --version show?

That error is OpenClaw’s config safety guard doing its job. The OAuth flow tried to write a config that was much smaller than your current one (37509 -> 12843 bytes), so OpenClaw rejected it instead of risking clobbering your existing openclaw.json. The active config should still be the original file; the smaller attempted write was saved as:

C:\Users\desti\.openclaw\openclaw.json.rejected.2026-05-15T06-10-57-064Z

Please don’t paste either config file here, since they may contain secrets.

Run these in PowerShell:

openclaw --version
openclaw config validate
openclaw doctor
openclaw doctor --fix
openclaw config validate