#Trouble with session stalls

1 messages · Page 1 of 1 (latest)

slate sedge
#

Logs:
20:13:34+00:00 warn diagnostic {"subsystem":"diagnostic"} stuck session: sessionId=unknown sessionKey=agent:main:telegram:direct:8668358572 state=processing age=143s queueDepth=1 reason=queued_work_without_active_run classification=stale_session_state recovery=checking
20:13:34+00:00 warn diagnostic {"subsystem":"diagnostic"} stuck session recovery skipped: reason=active_reply_work action=keep_lane sessionId=c29a1f24-a652-466e-98f1-ac8644cd00e6 sessionKey=agent:main:telegram:direct:8668358572 age=143s queueDepth=1 activeSessionId=c29a1f24-a652-466e-98f1-ac8644cd00e6

half blazeBOT
# slate sedge Logs: 20:13:34+00:00 warn diagnostic {"subsystem":"diagnostic"} stuck session: s...

Bandit, this log is the stuck-session watchdog, not necessarily the root error.

What it’s saying:

  • queued_work_without_active_run means the diagnostic state sees queued work for agent:main:telegram:direct:8668358572 but doesn’t see the normal active-run marker for that queued item.
  • recovery=checking means OpenClaw considered auto-recovery.
  • recovery skipped: reason=active_reply_work action=keep_lane ... activeSessionId=... means it found an active embedded reply/run for that same lane, so it refused to clear/reset the queue while work might still be legitimately running. That’s intentional protective behavior.

First check the version, because session recovery has been moving recently:

openclaw --version
openclaw status --all
openclaw logs --follow

If Telegram never replies and the same session keeps repeating this for several minutes, the least invasive recovery is usually to send /reset or /new in that Telegram chat, then try again. If that doesn’t release it, restart the gateway and re-check:

half blazeBOT
slate sedge
#

Is Anthropic having issues? I keep getting these fallbacks and they continue to resolve at failures.

half blazeBOT
# slate sedge Is Anthropic having issues? I keep getting these fallbacks and they continue to ...

I checked the public Claude status page just now: Claude API is showing Operational. That doesn’t rule out regional/account/rate-limit trouble, but there isn’t an obvious global outage signal from the status page.

If OpenClaw is falling back and each fallback also resolves to a failure, I’d check these in order:

openclaw --version
openclaw models status --probe
openclaw status --all
openclaw logs --follow

Things to look for in the logs:

  • 429 / rate_limit_error / overloaded_error: likely provider capacity, quota, or long-context eligibility.
  • 401 / auth / OAuth expiry: credential/profile issue, not Anthropic status.
  • fallback targets using the same provider/account: the fallback can fail for the same reason as the primary.
  • Model is not allowed: agents.defaults.models allowlist doesn’t include the selected fallback.
half blazeBOT
slate sedge
#

I'm on 5.7.
Logs: 20:10:05+00:00 warn agent/cli-backend {"subsystem":"agent/cli-backend"} claude live session turn failed: provider=claude-cli model=claude-sonnet-4-6 durationMs=195668 error=FailoverError
20:10:05+00:00 warn model-fallback/decision {"subsystem":"model-fallback/decision"} {"event":"model_fallback_decision","tags":["error_handling","model_fallback","candidate_failed"],"runId":"b9a9ec7d-0f5d-4089-bcab-1faa3e39683b","sessionId":"c29a1f24-a652-466e-98f1-ac8644cd00e6","lane":"main","decision":"candidate_failed","requestedProvider":"anthropic","requestedModel":"claude-sonnet-4-6","candidateProvider":"anthropic","candidateModel":"claude-sonnet-4-6","attempt":1,"total":1,"reason":"timeout","status":408,"errorPreview":"CLI produced no output for 180s and was terminated.","errorHash":"sha256:8349e7990402","fallbackStepType":"fallback_step","fallbackStepFromModel":"anthropic/claude-sonnet-4-6","fallbackStepFromFailureReason":"timeout","fallbackStepFromFailureDetail":"CLI produced no output for 180s and was terminated.","fallbackStepChainPosition":1,"fallbackStepFinalOutcome":"chain_exhausted","isPrimary":true,"requestedModelMatched":true,"fallbackConfigured":false} model fallback decision

half blazeBOT
# slate sedge I'm on 5.7. Logs: 20:10:05+00:00 warn agent/cli-backend {"subsystem":"agent/cli-...

This one points more at the Claude CLI backend hanging locally than a clean Anthropic API outage.

Key bits:

  • provider=claude-cli means OpenClaw is driving the local claude CLI, not calling Anthropic API directly.
  • errorPreview: "CLI produced no output for 180s and was terminated." means the CLI process was silent for 3 minutes, so OpenClaw killed that turn.
  • attempt:1 total:1 + fallbackConfigured:false means there was no real model fallback chain configured. The fallbackStep... chain_exhausted wording is the failover machinery reporting “primary failed and there’s nowhere else to go.”

I’d check the CLI directly on the same machine/user that runs the gateway:

claude --version
claude auth status
claude -p "say hi"
openclaw models status --probe
openclaw logs --follow

If claude -p "say hi" also hangs or takes ages, fix/re-login/update Claude CLI first. If direct Claude CLI works but OpenClaw keeps timing out, try /reset or /new in that Telegram session, then restart the gateway.