#stuck session

1 messages · Page 1 of 1 (latest)

frank pond
#

Constantly getting these errors

12:16:46 [agent/embedded] embedded run failover decision: runId=af171053-690b-4007-ba93-bbe94a7f7d6e stage=assistant decision=surface_error reason=timeout from=openai-codex/gpt-5.4 profile=sha256:06bfb5171eff
12:16:46 [model-fallback/decision] model fallback decision: decision=candidate_succeeded requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.4 reason=unknown next=none
12:21:04 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent medicare-web
12:23:29 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=140s queueDepth=1
12:23:59 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=170s queueDepth=1
12:24:29 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=200s queueDepth=1
12:24:59 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=230s queueDepth=1
12:25:29 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=260s queueDepth=1
12:25:59 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=290s queueDepth=1
12:26:09 [agent/embedded] embedded run timeout: runId=9a4d87d3-40c5-4ca2-9d2a-9a51a89c8e56 sessionId=7547c0ce-a7e8-4e17-a0c9-c9ab9264b6b3 timeoutMs=300000
12:26:10 [agent/embedded] Profile openai-codex:default timed out. Trying next account...
12:26:10 [agent/embedded] embedded run failover decision: runId=9a4d87d3-40c5-4ca2-9d2a-9a51a89c8e56 stage=assistant decision=fallback_model reason=timeout from=openai-codex/gpt-5.5 profile=sha256:06bfb5171eff
12:26:10 [diagnostic] lane task error: lane=main durationMs=308564 error="FailoverError: LLM request timed out."
12:26:10 [diagnostic] lane task error: lane=session:agent:medicare-web:main durationMs=308565 error="FailoverError: LLM request timed out."
12:26:10 [model-fallback/decision] model fallback decision: decision=candidate_failed requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5 reason=timeout next=openai-codex/gpt-5.4 detail=LLM request timed out.
12:26:12 [plugins] memory-lancedb-pro: all 3 memories were filtered out due to redundancy policy
12:26:21 [plugins] memory-lancedb-pro: injecting 1 memories into context for agent medicare-web
12:26:31 [plugins] memory-lancedb-pro: regex fallback found 2 capturable text(s) for agent medicare-web
12:26:32 [model-fallback/decision] model fallback decision: decision=candidate_succeeded requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.4 reason=unknown next=none
12:26:32 [plugins] memory-lancedb-pro: auto-captured 1 memories for agent medicare-web in scope agent:medicare-web
12:28:36 [plugins] memory-lancedb-pro: injecting 2 memories into context for agent medicare-web
12:30:59 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=138s queueDepth=1

12:43:40 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=134s queueDepth=1
12:44:10 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=164s queueDepth=1
12:44:40 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=194s queueDepth=1
12:45:10 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=224s queueDepth=1
12:45:40 [diagnostic] stuck session: sessionId=medicare-web sessionKey=agent:medicare-web:main state=processing age=254s queueDepth=1

patent wyvernBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

heady pelicanBOT
#

This is the agent model call timing out, not the memory lines themselves.

The key bit is:
embedded run timeout ... timeoutMs=300000 → the assistant call sat for 5 minutes and OpenClaw aborted it. The stuck session ... state=processing age=... lines are diagnostics saying “this session has been busy for a while”; they’re usually a symptom of the model/provider call hanging, not a separate root cause.

First thing: what does openclaw --version show? Version mismatches cause weird runtime behavior, so check that early.

Then please run/paste the safe report:

openclaw status --all
openclaw gateway status
openclaw models status

Things I’d check next:

  • If this is only happening on openai-codex/gpt-5.5 and it succeeds after fallback to gpt-5.4, your primary model/profile is likely slow/unhealthy/rate-limited rather than the session being corrupted.
  • Try a fresh /new or /reset in that agent/session. If old session context is huge, every turn can become a slow 5-minute request.
#
  • If you pinned the model manually with /model, remember explicit user selections are stricter; configured defaults/fallbacks behave differently.
  • If this is a Discord thread, the Discord stuck session log doesn’t mean Discord timed it out — Discord hands the turn to the agent queue and waits for the agent runtime to finish/abort.

Docs for the relevant pieces:
https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/gateway/troubleshooting
https://docs.openclaw.ai/concepts/model-failover

If this keeps reproducing on the latest version with a tiny fresh prompt, I’d treat it as a provider/model reliability issue and bring the openclaw status --all output + logs to a GitHub issue: https://github.com/openclaw/openclaw/issues

frank pond
#

going to post in multiple sections because of the 4000/ character max limit

heady pelicanBOT
#

Thanks — this already shows two useful things.

Main suspect: medicare-web (Luke) | Bootstrap file | PRESENT.

That agent is the one in your stuck-session logs, and a lingering BOOTSTRAP.md can keep an agent acting like it’s still in first-run/bootstrap mode. That can cause repeated long turns, big prompts, and the 5-minute model timeout loop you’re seeing.

Try this first:

openclaw gateway stop
ren %USERPROFILE%\.openclaw\workspace-medicare-web\BOOTSTRAP.md BOOTSTRAP.md.bak
openclaw gateway start

If that path doesn’t exist, find it with:

#
dir %USERPROFILE%\.openclaw /s /b | findstr /i "workspace-medicare-web BOOTSTRAP.md"

Also: your gateway/service looks healthy. The Port 18789 is already in use warning is expected because the gateway is already running on that port; it’s not the cause.

Secondary recommendation: update when you can. You’re on 2026.4.23, and status says npm update 2026.5.7 is available. Since your previous auto-update restart failed, I’d do it manually:

npm update -g openclaw
openclaw gateway restart
openclaw --version

After moving BOOTSTRAP.md, start a fresh turn in that agent — ideally /new or /reset — and see if medicare-web stops producing stuck session + embedded run timeout logs.

Docs: https://docs.openclaw.ai/help/faq and https://docs.openclaw.ai/gateway/troubleshooting

frank pond
#

Luke is still on gpt-5.5. There is no model override to 5.4 in config or the active session.

Current state:

  • Global primary model: openai-codex/gpt-5.5
  • Luke / medicare-web model: gpt-5.5
  • Luke thinkingLevel: minimal
  • Context: ~114k / 400k
  • BOOTSTRAP.md exists but agents.defaults.skipBootstrap=true, and prior run report showed BOOTSTRAP.md rawChars ~530 but injectedChars 0.

The failure pattern I’m seeing is:

  • gpt-5.5 attempt runs until the 300s embedded timeout
  • OpenClaw falls back to gpt-5.4
  • gpt-5.4 succeeds and posts shortly after

Question: does 4.23 have any known issue where gpt-5.5 embedded PI runs can hang/tool-loop until timeout even when config/thinking/context look fine? Also, is there any other session-level cache besides agents/<id>/sessions/sessions.json that could keep old thinking/model behavior?

heady pelicanBOT
#

Yep — with that extra detail, I’d stop blaming BOOTSTRAP. If skipBootstrap=true and the run report says injectedChars 0, that file existing is probably just noise here.

For 4.23 specifically: I don’t want to overstate it as “known gpt-5.5 always hangs”, but there is documented GPT-5.5/Codex embedded-runtime parity work around exactly the class of issues you’re circling: strict agentic follow-through, tool schema compatibility, and long-task/replay/liveness surfacing. In other words, 4.23 may be old enough that GPT-5.5 can get stuck in a bad embedded Pi turn where 5.4 succeeds after fallback.

Docs page I’d anchor this to: https://docs.openclaw.ai/help/gpt55-codex-agentic-parity

Given your pattern:

  • primary gpt-5.5 hits the hard 300s embedded timeout
  • fallback gpt-5.4 succeeds quickly
  • context is only ~114k/400k
  • thinking is minimal

…I’d treat this as a gpt-5.5 + 4.23 embedded-runtime/provider behavior issue unless logs show an actual tool call looping.

Session/cache answer: the important durable session state is not only agents/<id>/sessions/sessions.json. That file is the index/session metadata, but it usually points at per-session transcript/history files in the same sessions dir. Those can keep the old conversation/tool history alive even if sessions.json has no model override.

Places I’d check for stale model/thinking behavior:

#
%USERPROFILE%\.openclaw\agents\medicare-web\sessions\sessions.json
%USERPROFILE%\.openclaw\agents\medicare-web\sessions\*.jsonl
%USERPROFILE%\.openclaw\agents\medicare-web\agent\auth-state.json
%USERPROFILE%\.openclaw\openclaw.json

auth-state.json can affect auth/profile routing/cooldowns, not usually thinking/model choice. For model/thinking, the big ones are config + session entry + live gateway in-memory state until restart.

Best next steps, in order:

  1. Update from 2026.4.23 to current (2026.5.7 showed available), then restart gateway.
  2. Start a genuinely fresh Luke session with /new or /reset after restart.
  3. If it still reproduces, temporarily make Luke primary openai-codex/gpt-5.4 or remove 5.5 as primary for that agent.
  4. If 5.5 still times out on a fresh tiny prompt after update, file it with logs/runId: https://github.com/openclaw/openclaw/issues

One more useful diagnostic: in the 300s window, check whether logs show repeated tool calls or just silence after the model request starts. If it’s silence, it’s provider/runtime timeout. If it’s repeated tool calls, that’s a real loop and worth filing separately.

frank pond
#

That makes sense. Since I previously had real issues on 5.x and intentionally rolled back to 2026.4.23, I don’t want to blindly update without a rollback plan.

  1. Is the GPT-5.5/Codex parity work only available in 2026.5.7+, or is any of strict-agentic / liveness surfacing configurable in 4.23?

  2. What is the safest Windows upgrade path with rollback?
    Current setup is OpenClaw 2026.4.23, gateway on Scheduled Task/Desktop launcher, config at %USERPROFILE%.openclaw\openclaw.json.

  3. During the timeout window, logs mostly show stuck session → embedded run timeout → fallback to gpt-5.4 succeeded. I don’t see obvious repeated tool calls. Does that point more to provider/runtime timeout than an agent tool loop?

  4. After updating, should I enable strict-agentic for GPT-5.5 agents? If yes, what exact config key should I use?

  5. If I want to keep GPT-5.5 as primary, is increasing the embedded timeout above 300s recommended, or would that just hide the bug?

heady pelicanBOT
#

Good instinct not to blindly jump versions. Answers in order:

  1. strict-agentic is present in 2026.4.23. I checked the 2026.4.23 package: it has agents.defaults.embeddedPi.executionContract / per-agent agents.list[].embeddedPi.executionContract, and the resolver already auto-activates strict-agentic for openai / openai-codex GPT-5-family models unless explicitly opted out with "default".

So if Luke is on openai-codex/gpt-5.5 and you did not set embeddedPi.executionContract: "default", 4.23 should already be using the strict-agentic lane. Updating may still help because later 5.x has additional parity/liveness/runtime fixes, but strict-agentic itself is not only 5.7+.

  1. Safest Windows upgrade/rollback path:

Before upgrade, stop gateway and make a real backup:

openclaw gateway stop
mkdir %USERPROFILE%\openclaw-backups
robocopy %USERPROFILE%\.openclaw %USERPROFILE%\openclaw-backups\.openclaw-before-2026-5-7 /MIR
copy %USERPROFILE%\.openclaw\openclaw.json %USERPROFILE%\openclaw-backups\openclaw.json.before-2026-5-7

Then preview/update:

#
openclaw update --dry-run
openclaw update
openclaw doctor
openclaw gateway restart
openclaw --version
openclaw status --all

If that update path fails, manual npm is the fallback:

npm i -g openclaw@2026.5.7
openclaw doctor
openclaw gateway install --force
openclaw gateway restart
#

Rollback package only:

npm i -g openclaw@2026.4.23
copy %USERPROFILE%\openclaw-backups\openclaw.json.before-2026-5-7 %USERPROFILE%\.openclaw\openclaw.json
openclaw doctor
openclaw gateway install --force
openclaw gateway restart

If rollback complains that a newer OpenClaw last touched the config, use the downgrade escape hatch for the one service command:

cmd /C "set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1&& openclaw gateway install --force"
cmd /C "set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1&& openclaw gateway restart"

Docs: https://docs.openclaw.ai/install/updating

#
  1. Yes — stuck session → embedded run timeout → fallback succeeds, with no repeated tool-call logs, points much more to provider/runtime/model-call timeout than an agent tool loop. A tool loop usually leaves a trail of tool invocations/results before timeout.

  2. For GPT-5.5 agents, I’d explicitly set it so the config is unambiguous, even though 4.23 should auto-enable it for GPT-5-family runs:

{
  agents: {
    list: [
      {
        id: "medicare-web",
        embeddedPi: { executionContract: "strict-agentic" },
      },
    ],
  },
}

Global default version:

#
{
  agents: {
    defaults: {
      embeddedPi: { executionContract: "strict-agentic" },
    },
  },
}

Docs: https://docs.openclaw.ai/providers/openai#strict-agentic-gpt-mode

  1. I would not increase the 300s embedded timeout as the first fix. In your case, it would mostly make every bad 5.5 attempt take even longer before falling back to 5.4. The useful signal is that 5.4 succeeds shortly after; longer timeout hides that signal and makes the bot feel more stuck.

My recommendation: keep the 300s timeout, make strict-agentic explicit for Luke, test on 4.23 once. If 5.5 still silently sits until timeout, either pin Luke to 5.4 for now or do the backed-up 5.7 upgrade test and roll back if the earlier 5.x issues return.

frank pond
#

What's default context?

heady pelicanBOT
#

If you mean context window: it’s the token budget OpenClaw can send to the model for one run — system prompt, AGENTS/BOOTSTRAP snippets, memory injections, chat history, tool results, attachments, etc.

There are two related values:

  • contextWindow = native/catalog model metadata, basically “what the model can theoretically handle”
  • contextTokens = OpenClaw’s effective runtime cap, often lower for latency/reliability

For your specific Luke logs showing ~114k / 400k, OpenClaw is treating that run’s effective budget as 400k. So 114k is large, but not near the configured hard limit.

Current docs say the Codex OAuth route can have a smaller default runtime cap for reliability, and you can override it with:

{
  models: {
    providers: {
      "openai-codex": {
        models: [{ id: "gpt-5.5", contextTokens: 160000 }],
frank pond
#

Updated from 4.23 to 5.7 very slow response times

18:08:58 [diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=32 eventLoopDelayMaxMs=2598.4 eventLoopUtilization=0.23 cpuCoreRatio=0.228 active=1 waiting=0 queued=1 phase=channels.discord.start-account recentPhases=sidecars.subagent-recovery:7ms,sidecars.main-session-recovery:4ms,post-attach.update-sentinel:0ms,sidecars.model-prewarm:4516ms,sidecars.session-locks:43ms,post-ready.maintenance:1124ms work=[active=agent:medicare-web:main(processing,q=1,age=3s last=run:completed) queued=agent:medicare-web:main(processing,q=1,age=3s last=run:completed)]

18:09:30 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=32s eventLoopDelayP99Ms=10594.8 eventLoopDelayMaxMs=10594.8 eventLoopUtilization=1 cpuCoreRatio=1.007 active=1 waiting=0 queued=1 phase=channels.discord.start-account recentPhases=sidecars.subagent-recovery:7ms,sidecars.main-session-recovery:4ms,post-attach.update-sentinel:0ms,sidecars.model-prewarm:4516ms,sidecars.session-locks:43ms,post-ready.maintenance:1124ms work=[active=agent:medicare-web:main(processing/model_call,q=1,age=4s last=model_call:started) queued=agent:medicare-web:main(processing/model_call,q=1,age=4s last=model_call:started)]

heady pelicanBOT
#

That log is not the old “5.5 sat for 300s then fallback” shape. This one says the Gateway process itself got CPU/event-loop starved:

  • eventLoopUtilization=1 = Node’s event loop was effectively 100% busy for that sample
  • cpuCoreRatio=1.007 = about one full CPU core pegged
  • eventLoopDelayP99Ms=10594.8 = timers/callbacks were delayed by ~10.6s
  • active=1 queued=1 = Luke has one active turn and another queued behind it, so replies will feel stuck/serialized
  • phase=channels.discord.start-account means this happened while Discord startup/account work was still settling, not necessarily that Discord is the root cause

So yes: 5.7 is doing something CPU/blocking-y on your box, at least during startup / first model call. I’d separate this into “first 2 minutes after restart” vs “every response forever.”

Do this now:

openclaw gateway restart
openclaw status --deep
openclaw doctor
#

Then send one tiny fresh Luke prompt like “reply pong only” and watch whether the same liveness warnings repeat after the gateway has been up for a few minutes.

If it only happens right after restart, it may be startup sidecars / post-update maintenance / model prewarm and less scary. If it happens on every Luke turn, I’d either rollback or file this as a 5.7 regression.

Useful quick mitigation to test: lower Luke’s effective context cap so 5.5 has less prompt/runtime pressure:

{
  models: {
    providers: {
      "openai-codex": {
        models: [{ id: "gpt-5.5", contextTokens: 160000 }],
      },
    },
  },
}
#

Then restart gateway.

If 5.7 stays slow, rollback is reasonable:

npm i -g openclaw@2026.4.23
copy %USERPROFILE%\openclaw-backups\openclaw.json.before-2026-5-7 %USERPROFILE%\.openclaw\openclaw.json
openclaw doctor
openclaw gateway install --force
openclaw gateway restart

If the install/restart is blocked because newer OpenClaw touched config:

cmd /C "set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1&& openclaw gateway install --force"
#
cmd /C "set OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1&& openclaw gateway restart"

For a bug report, the useful evidence is: 5.7 version, Windows, Node 24.15, the liveness warning lines, openclaw status --all, and whether it reproduces on a tiny fresh /new prompt. Issue link: https://github.com/openclaw/openclaw/issues

frank pond
#

🦞 OpenClaw 2026.5.7 (eeef486) — If you're lost, run doctor; if you're brave, run prod; if you're wise, run tests.

|
o
OpenClaw status

Overview
+----------------------+-----------------------------------------------------------------------------------------------+
| Item | Value |
+----------------------+-----------------------------------------------------------------------------------------------+
| OS | windows 10.0.19045 (x64) · node 24.15.0 |
| Dashboard | http://127.0.0.1:18789/ |
| Tailscale exposure | off |
| Channel | stable (default) |
| Update | pnpm · up to date · npm latest 2026.5.7 |
| Gateway | local · ws://127.0.0.1:18789 (local loopback) · unreachable (timeout) |
| Gateway service | Scheduled Task installed · registered · running (pid 24316, Startup-folder login item |
| | installed; listener detected on port 18789.) |
| Node service | Scheduled Task not installed |
| Agents | 11 · 3 bootstrap files present · sessions 13 · default dev-sam active 56m ago |
| Memory | enabled (plugin memory-lancedb-pro) · not checked |
| Plugin compatibility | none |
| Probes | enabled |
| Events | none |
| Tasks | 0 active · 0 queued · 0 running · 22 issues · audit 13 warn · 916 tracked |
| Heartbeat | disabled (dev-sam), disabled (crm-kota), disabled (discord), disabled (discord-bot), |
| | disabled (ebay-spook), disabled (medicare-web), disabled (orchestrator), disabled (peanut), |
| | disabled (photoroom), disabled (printify-ethan), disabled (product-research) |
| Last heartbeat | unavailable |
| Sessions | 13 active · default gpt-5.5 (200k ctx) · 11 stores |
+----------------------+-----------------------------------------------------------------------------------------------+

Security audit
Summary: 0 critical · 3 warn · 1 info

heady pelicanBOT
#

That status confirms the 5.7 gateway is wedged, not just slow:

running + listener detected on port 18789 but Gateway unreachable (timeout) means the Node process is still bound to the port, but it isn’t answering the WS health/status request. Combined with the earlier eventLoopUtilization=1 / 10s event-loop delay, I’d treat this as event-loop starvation or a hard stuck gateway process.

Before killing it, grab the tail of the file log if you can:

powershell -NoProfile -Command "Get-Content C:\tmp\openclaw\openclaw-2026-05-13.log -Tail 200"

If that path doesn’t exist, use the path shown by openclaw gateway status.

Then try the normal recovery:

openclaw gateway restart
openclaw gateway status
openclaw status --deep
#

If restart also hangs or leaves it unreachable, kill the stuck gateway PID shown in status and let the Scheduled Task/login item restart it:

taskkill /PID 24316 /F
openclaw gateway start
openclaw gateway status

If it wedges again on 5.7 after a clean restart, I’d rollback rather than keep tuning:

npm i -g openclaw@2026.4.23
copy %USERPROFILE%\openclaw-backups\openclaw.json.before-2026-5-7 %USERPROFILE%\.openclaw\openclaw.json
openclaw doctor
openclaw gateway install --force
openclaw gateway restart

The default gpt-5.5 (200k ctx) bit looks reasonable — context cap is not obviously the immediate problem here. The immediate problem is the gateway process becoming unresponsive.

frank pond
#

Tailed the log first.
Ran openclaw gateway restart.
It reproduced the bad state: listener on 18789, PID 22856, but WS probe timed out.
Killed that wedged PID.
Started the gateway again.
New gateway is healthy on PID 19032.
Current state:

openclaw gateway status: reachable, admin-capable, listening on 127.0.0.1:18789.
openclaw status --deep: gateway reachable, OpenClaw 2026.5.7, default gpt-5.5, 200k ctx.
openclaw doctor: completed. Warnings only, no plugin load errors.
Discord is connected, but logs still show event-loop starvation during Discord/user probe windows.
tasks audit: 13 warnings for lost backing sessions, no errors.
plugins doctor: memory-lancedb-pro loads, but warns it uses legacy before_agent_start.

heady pelicanBOT
#

Nice, that narrows it down a lot.

At this point I’d call the original 5.7 wedged-PID issue recovered, but the remaining event-loop starvation during Discord/user probe windows is still a real 5.7 regression candidate. The important bit is: doctor is clean, plugin load is clean, gateway can be healthy after a hard PID kill, but something during channel/probe work still blocks the Node event loop.

I’d do a controlled isolation pass before rolling back:

  1. Capture 30–60 lines around one starvation event:
powershell -NoProfile -Command "Select-String -Path C:\tmp\openclaw\openclaw-2026-05-13.log -Pattern 'liveness warning|user probe|discord|event_loop' -Context 5,10 | Select-Object -Last 80"
  1. Temporarily test with memory plugin disabled, because before_agent_start is legacy and it’s in the hot path before agent turns:
{
  plugins: {
    entries: {
#
      "memory-lancedb-pro": { enabled: false },
    },
  },
}

Restart gateway, send one tiny Discord message, and see if liveness warnings continue.

  1. If memory disabled does not change it, test Discord startup/probe isolation by starting gateway with channels skipped just once:
cmd /C "set OPENCLAW_SKIP_CHANNELS=1&& openclaw gateway"

If that stays healthy while normal service mode starves, the repro is very likely in Discord channel startup/probe handling.

  1. The tasks audit: lost backing sessions warnings are probably cleanup debt, not the cause. Don’t chase those first unless logs point there.

If step 2 or 3 identifies the trigger, that’s excellent GitHub issue material. Include:

#
  • Windows 10, Node 24.15.0, OpenClaw 2026.5.7
  • upgraded from 2026.4.23
  • gateway restart produced listener-but-WS-timeout until PID kill
  • liveness warnings during Discord/user probe windows
  • whether disabling memory-lancedb-pro changes it
  • the log excerpt around the warning

If this machine is production-ish and you need reliability now: rollback to 4.23 and pin Luke to 5.4/160k ctx. But if you can spare 15 minutes, the memory-vs-Discord isolation above will make the bug much easier to fix.

frank pond
#

19:08:51 [diagnostic] liveness warning: reasons=event_loop_delay interval=34s eventLoopDelayP99Ms=32.1 eventLoopDelayMaxMs=5251.3 eventLoopUtilization=0.247 cpuCoreRatio=0.256 active=1 waiting=0 queued=1 phase=channels.discord.start-account recentPhases=sidecars.subagent-recovery:6ms,sidecars.main-session-recovery:4ms,post-attach.update-sentinel:0ms,sidecars.model-prewarm:4301ms,sidecars.session-locks:33ms,post-ready.maintenance:1073ms work=[active=agent:medicare-web:main(processing,q=1,age=8s) queued=agent:medicare-web:main(processing,q=1,age=8s)]

STILL HAPPENING

heady pelicanBOT
#

Yep — still happening, but this sample is less catastrophic than the earlier one:

  • eventLoopDelayMaxMs=5251.3 = one callback/timer got delayed ~5.2s
  • eventLoopDelayP99Ms=32.1 = most of the interval was fine
  • eventLoopUtilization=0.247 / cpuCoreRatio=0.256 = not pegged this time

So this is a spike, not full sustained event-loop starvation like the eventLoopUtilization=1 sample. Still worth chasing because it lines up with Discord startup and Luke being queued.

At this point I’d stop tuning model/context and isolate the startup/channel path:

  1. Temporarily disable the memory plugin and restart:
{
  plugins: {
    entries: {
      "memory-lancedb-pro": { enabled: false },
#
    },
  },
}

If the warning goes away, that legacy before_agent_start hook is implicated.

  1. If it still happens, test with Discord skipped once:
cmd /C "set OPENCLAW_SKIP_CHANNELS=1&& openclaw gateway --port 18790"

Then probe that test gateway. If no liveness warnings there, Discord startup/probe is implicated.

  1. If you need the bot reliable right now, rollback to 4.23. You already reproduced a 5.7 listener-but-WS-timeout wedge, and this is enough to justify rollback for production use.

For a GitHub issue, include both samples: the earlier eventLoopUtilization=1 one and this lower-CPU spike. The contrast matters: one shows hard starvation/wedge, the other shows repeat startup-delay spikes during channels.discord.start-account.

frank pond
#

Issue not resolved can close

heady pelicanBOT
#

Got it — sorry we didn’t get this fully resolved here.

Since 5.7 reproduced a gateway/event-loop wedge and rollback/isolation is the next real step, closing the thread is fine. If you want this tracked for a fix, please file it on GitHub with the liveness logs + Windows/Node/OpenClaw versions: https://github.com/openclaw/openclaw/issues

frank pond
#

03:17:24 [ws] protocol mismatch conn=8ffef174-015b-4770-80a5-a3156e078024 remote=127.0.0.1 client=openclaw-control-ui webchat vcontrol-ui
03:17:24 [ws] closed before connect conn=8ffef174-015b-4770-80a5-a3156e078024 peer=127.0.0.1:52810->127.0.0.1:18789 remote=127.0.0.1 fwd=n/a origin=http://127.0.0.1:18789 host=127.0.0.1:18789 ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 code=1002 reason=protocol mismatch
03:17:24 [discord] client initialized as 1040690868140658798; awaiting gateway readiness
03:17:39 [ws] protocol mismatch conn=0c08f256-b6ab-4b7a-882e-0ad57f4ea2a1 remote=127.0.0.1 client=openclaw-control-ui webchat vcontrol-ui
03:17:39 [ws] closed before connect conn=0c08f256-b6ab-4b7a-882e-0ad57f4ea2a1 peer=127.0.0.1:52817->127.0.0.1:18789 remote=127.0.0.1 fwd=n/a origin=http://127.0.0.1:18789 host=127.0.0.1:18789 ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 code=1002 reason=protocol mismatch
03:17:54 [ws] protocol mismatch conn=cad83267-9ec5-45c9-bc6a-e1b965f1fc82 remote=127.0.0.1 client=openclaw-control-ui webchat vcontrol-ui
03:17:54 [ws] closed before connect conn=cad83267-9ec5-45c9-bc6a-e1b965f1fc82 peer=127.0.0.1:52823->127.0.0.1:18789 remote=127.0.0.1 fwd=n/a origin=http://127.0.0.1:18789 host=127.0.0.1:18789 ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 code=1002 reason=protocol mismatch
03:18:09 [ws] protocol mismatch conn=3b888538-e535-42b4-9770-69039d1b5fe3 remote=127.0.0.1 client=openclaw-control-ui webchat vcontrol-ui
03:18:09 [ws] closed before connect conn=3b888538-e535-42b4-9770-69039d1b5fe3 peer=127.0.0.1:52829->127.0.0.1:18789 remote=127.0.0.1 fwd=n/a origin=http://127.0.0.1:18789 host=127.0.0.1:18789 ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 code=1002 reason=protocol mismatch
03:18:24 [ws] protocol mismatch conn=1f3c15aa-9b01-4ab9-9579-7b694780d8ef remote=127.0.0.1 client=openclaw-control-ui webchat vcontrol-ui
03:18:24 [ws] closed before connect conn=1f3c15aa-9b01-4ab9-9579-7b694780d8ef peer=127.0.0.1:52834->127.0.0.1:18789 remote=127.0.0.1 fwd=n/a origin=http://127.0.0.1:18789 host=127.0.0.1:18789 ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 code=1002 reason=protocol mismatch

updated now getting spammed

heady pelicanBOT
#

That spam is almost certainly the browser Control UI reconnecting every 15s with an older WS protocol than the gateway expects. The key bit is client=openclaw-control-ui ... vcontrol-ui + reason=protocol mismatch — not Discord itself.

Quick stop-the-spam move: close any open Control UI / webchat tab pointed at 127.0.0.1:18789.

Then fix the mismatch:

  1. Please check openclaw --version first.
  2. Run openclaw doctor — there’s a UI rebuild/repair path for protocol mismatch cases.
  3. Restart the gateway: openclaw gateway restart
  4. Reopen the dashboard/Control UI and do a hard refresh. If it still loops, clear site data for http://127.0.0.1:18789 or try an incognito window.

If you’re running from source/dev checkout, rebuild the UI too: pnpm ui:build.

Docs refs: https://docs.openclaw.ai/help/faq and https://docs.openclaw.ai/web/control-ui

frank pond
#

On Version
v2026.5.12-beta.6 still getting errors

03:43:01 [diagnostic] liveness warning: reasons=event_loop_delay interval=38s eventLoopDelayP99Ms=34.8 eventLoopDelayMaxMs=18723.4 eventLoopUtilization=0.645 cpuCoreRatio=0.652 active=1 waiting=0 queued=1 phase=channels.discord.start-account recentPhases=sidecars.main-session-recovery:4ms,sidecars.restart-sentinel:22ms,post-attach.update-sentinel:11ms,post-ready.maintenance:369ms,sidecars.model-prewarm:5341ms,sidecars.session-locks:4684ms work=[active=agent:medicare-web:main(processing,q=1,age=21s) queued=agent:medicare-web:main(processing,q=1,age=21s)]

heady pelicanBOT
#

Yep — this is a different failure than the Control UI protocol mismatch.

This log says the gateway event loop is getting starved while Discord is starting, and the stuck work item is:
agent:medicare-web:main(processing,q=1,age=21s)

We’ve seen this pattern on 2026.5.12-beta.6: stale/corrupt session state can keep a session “processing” and cause gateway-wide liveness warnings.

Safest recovery path:

  1. Stop/close whatever is actively talking to that medicare-web agent/session if you can.
  2. Run:
    openclaw status --all
    openclaw doctor
    
  3. If it still repeats, move the stale session state aside rather than deleting it. Look under ~/.openclaw/sessions / agent session storage for the medicare-web session, back it up, then restart:
#
   openclaw gateway restart

Also check the agent workspace for a leftover BOOTSTRAP.md; that can keep an agent stuck in first-run/bootstrapping behavior. If present, move it aside and restart.

If you can paste the openclaw status --all tail around medicare-web + any session-file errors, we can be more precise. Docs: https://docs.openclaw.ai/gateway/troubleshooting

frank pond
#

04:02:14 [agents/harness] Codex agent harness failed; not falling back to embedded PI backend
04:02:14 [diagnostic] lane task error: lane=cron-nested durationMs=6432 error="Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './plugin-sdk/codex-mcp-projection' is not defined by "exports" in C:\Users\desti.openclaw\npm\node_modules@openclaw\codex\node_modules\openclaw\package.json imported from C:\Users\desti.openclaw\npm\node_modules@openclaw\codex\dist\thread-lifecycle-DMn8qeke.js"
04:02:14 [diagnostic] lane task error: lane=session:agent:photoroom:main durationMs=6436 error="Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: Package subpath './plugin-sdk/codex-mcp-projection' is not defined by "exports" in C:\Users\desti.openclaw\npm\node_modules@openclaw\codex\node_modules\openclaw\package.json imported from C:\Users\desti.openclaw\npm\node_modules@openclaw\codex\dist\thread-lifecycle-DMn8qeke.js"

heady pelicanBOT
#

That one is a packaging/version skew in the Codex harness, not a stuck Discord channel.

@openclaw/codex@2026.5.12-beta.6 imports:
openclaw/plugin-sdk/codex-mcp-projection

…but the matching openclaw@2026.5.12-beta.6 package does not export that subpath, so Node throws ERR_PACKAGE_PATH_NOT_EXPORTED. I verified the published package metadata — this looks like a beta packaging bug / mismatched plugin SDK export.

Workaround options:

  1. If you don’t need the Codex harness right now, disable it and restart:

    openclaw plugins disable codex
    openclaw gateway restart
    
  2. If you do need Codex, try pinning/downgrading the Codex plugin to the last stable line, then restart: