#Errors: liveness warning: reasons=event_loop_delay and stuck session: sessionId=unknown

1 messages · Page 1 of 1 (latest)

low heath
#

My agents aren't responsive at all, and I have tried many different things, nothing seems to help. When I send a message in Dashboard, it appears the agent begins to respond, but (with some very rare exceptions) never actually does. When I look at the session logs, I typically see error messages like " liveness warning: reasons=event_loop_delay […]", "stalled session: […]", and "stuck session: sessionId=unknown […]".

What can I do to fix this?

worn tigerBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

faint compassBOT
# low heath My agents aren't responsive at all, and I have tried many different things, noth...

First thing: what does openclaw --version show? Version mismatch/stale install is a very common cause of weird gateway/session behavior.

Those log lines usually mean the Gateway process is not just “waiting on the model”; its Node event loop is getting blocked and sessions are not returning to idle. I’d triage it in this order:

  1. Get a safe snapshot:
openclaw --version
openclaw status --all
openclaw models status
openclaw doctor
openclaw gateway status
  1. Restart the Gateway, then immediately tail logs while sending one Dashboard message:
openclaw gateway restart
faint compassBOT
# low heath My agents aren't responsive at all, and I have tried many different things, noth...
openclaw logs --follow
  1. If it still hangs after restart, check for stuck session state. We’ve seen stale/corrupt session state cause whole-gateway stalls. Don’t delete it first; move it aside so it’s recoverable:
openclaw gateway stop
mkdir -p ~/.openclaw/backup
mv ~/.openclaw/agents/main/sessions ~/.openclaw/backup/sessions-$(date +%Y%m%d-%H%M%S)
openclaw gateway start

If your agent id is not main, use the matching directory under ~/.openclaw/agents/.

  1. Also check the workspace for a leftover BOOTSTRAP.md. If the agent is stuck bootstrapping forever, move that aside too:
ls -la ~/.openclaw/workspace/BOOTSTRAP.md
faint compassBOT
low heath
#

Current version of OpenClaw is 2026.5.12 (f066dd2). No ~/.openclaw/workspace/BOOTSTRAP.md present; however, fwiw: I noticed in the logs that the memory file is about 60k and gets truncated to 12K.

Steps mentioned at 2. have all been done hundreds of times already.

Re: 3: it seems that would move ALL sessions out of the way; that seems overkill. I just want to get the last couple of sessions 'unstuck'; I want to convert all existing session logs into more readable html format with tags for tweaking .css . Both are under 12h old now; a command that only would move sessions less than 12h old out of the way seems better here, no?

Agent id is "main" for all relevant sessions - I have already attempted to remove 'stuck' sessions so previously existing long running sessions for Telegram and 3 Discord sessions are not visible anymore - but I would prefer not to lose those conversations, of course. Associated session manes are all over the place and quasi-randomly change from " agent:main:main " to "agent:main:dashboard:somemostoftentruncatedinclihex".

faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...

Yeah, moving all sessions is overkill if the bad state is only in recent rows. The session index for main should be here by default:

~/.openclaw/agents/main/sessions/sessions.json

Docs I checked: https://docs.openclaw.ai/cli/sessions and https://docs.openclaw.ai/gateway/health. Session lists are persisted state, not channel liveness, and event_loop_delay means the gateway event loop is getting blocked/saturated.

I’d do this as a targeted quarantine, not deletion. Stop the gateway first so it does not rewrite sessions.json while you edit it, then run dry-run:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'
const fs = require("fs");
const path = require("path");

const storePath = process.env.STORE;
const hours = Number(process.env.HOURS || 12);
const apply = process.env.APPLY === "1";
faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...
const cutoff = Date.now() - hours * 60 * 60 * 1000;
const dir = path.dirname(storePath);
const stamp = new Date().toISOString().replace(/[:.]/g, "-");
const out = path.join(dir, `quarantine-recent-${hours}h-${stamp}`);

const store = JSON.parse(fs.readFileSync(storePath, "utf8"));
const selected = Object.entries(store).filter(([, e]) => {
  const ts = Number(e.updatedAt ?? e.lastInteractionAt ?? e.startedAt ?? 0);
  return Number.isFinite(ts) && ts >= cutoff;
});

console.log(`${apply ? "Applying" : "Dry-run"}: ${selected.length} sessions updated in last ${hours}h`);
for (const [key, e] of selected) console.log("-", key, e.sessionId || "", new Date(Number(e.updatedAt || 0)).toISOString());

if (!apply) process.exit(0);
faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...

fs.mkdirSync(out, { recursive: true });
fs.copyFileSync(storePath, path.join(out, "sessions.json.before"));

const files = new Set();
const names = new Set(fs.readdirSync(dir));
for (const [, e] of selected) {
  if (e.sessionFile && path.resolve(e.sessionFile).startsWith(path.resolve(dir) + path.sep)) files.add(path.resolve(e.sessionFile));
  if (e.sessionId) {
    for (const name of names) {
      if (name === `${e.sessionId}.jsonl` || name.startsWith(`${e.sessionId}.`)) files.add(path.join(dir, name));
    }
  }
}
faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...
for (const [key] of selected) delete store[key];
fs.writeFileSync(storePath, JSON.stringify(store, null, 2) + "
");

for (const file of files) {
  if (!fs.existsSync(file)) continue;
  fs.renameSync(file, path.join(out, path.basename(file)));
}

fs.writeFileSync(path.join(out, "sessions.quarantined.json"), JSON.stringify(Object.fromEntries(selected), null, 2) + "
");
console.log(`Moved index rows + ${files.size} transcript/artifact files to ${out}`);
NODE

If the dry-run list looks right, apply it:

faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...
STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node ./that-same-script.js

If you paste it directly instead of saving it, just rerun the same heredoc with APPLY=1 added before node.

For HTML export, this is read-only and gives you tweakable tags/classes:

node <<'NODE'
const fs = require("fs");
const path = require("path");

const dir = path.join(process.env.HOME, ".openclaw/agents/main/sessions");
const out = path.join(dir, "html-export");
fs.mkdirSync(out, { recursive: true });
faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...

function esc(v) {
  return String(v ?? "").replace(/[&<>"]/g, c => ({ "&": "&amp;", "<": "&lt;", ">": "&gt;", '"': "&quot;" }[c]));
}

for (const name of fs.readdirSync(dir).filter(n => n.endsWith(".jsonl"))) {
  const src = path.join(dir, name);
  const rows = fs.readFileSync(src, "utf8").split(/
+/).filter(Boolean);
  const body = rows.map((line, i) => {
    let o;
    try { o = JSON.parse(line); } catch { o = { type: "raw", text: line }; }
    const role = o.role || o.type || o.kind || "event";
    const text = o.text || o.content || o.message || o.delta || JSON.stringify(o, null, 2);
    return `<article class="entry role-${esc(role)}" data-line="${i + 1}"><header>${esc(role)}</header><pre>${esc(text)}</pre></article>`;
faint compassBOT
# low heath Current version of OpenClaw is 2026.5.12 (f066dd2). No `~/.openclaw/workspace/BO...
  }).join("
");
  fs.writeFileSync(path.join(out, name.replace(/\.jsonl$/, ".html")), `<!doctype html><meta charset="utf-8"><link rel="stylesheet" href="sessions.css"><main>
${body}
</main>`);
}

fs.writeFileSync(path.join(out, "sessions.css"), `body{font:14px/1.45 system-ui;margin:24px;max-width:1100px}.entry{border:1px solid #ddd;border-radius:6px;padding:10px;margin:10px 0}.entry header{font-weight:700;margin-bottom:6px}pre{white-space:pre-wrap;margin:0}.role-user{border-left:5px solid #2878ff}.role-assistant{border-left:5px solid #18a058}.role-tool,.role-function{border-left:5px solid #9a6bff}
`);
console.log("Wrote", out);
NODE

Two notes:

  • The 60k memory file being truncated to 12k is probably prompt/context trimming, not the file being damaged.
  • If quarantining recent sessions fixes it, keep the quarantine folder and file a GitHub issue with the diagnostics export rather than deleting the evidence: https://github.com/openclaw/openclaw/issues
low heath
#

That first script throws an error:

[stdin]:36
fs.writeFileSync(storePath, JSON.stringify(store, null, 2) + "
                                                             ^
Expected ',', got 'string literal'

SyntaxError: Invalid or unexpected token
    at makeContextifyScript (node:internal/vm:194:14)
    at compileScript (node:internal/process/execution:388:10)
    at evalTypeScript (node:internal/process/execution:260:22)
    at node:internal/main/eval_stdin:51:5
    at ReadStream.<anonymous> (node:internal/process/execution:205:5)
    at ReadStream.emit (node:events:508:28)
    at endReadableNT (node:internal/streams/readable:1729:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:90:21)

Node.js v24.14.0
faint compassBOT
# low heath That first script throws an error: ```zsh [stdin]:36 fs.writeFileSync(storePath,...

Yep, that one is on my script formatting. Discord mangled the "\ " string into a real line break inside the JS string.

Use this corrected version. It avoids newline escape strings entirely:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'
const fs = require("fs");
const path = require("path");
const os = require("os");

const storePath = process.env.STORE;
const hours = Number(process.env.HOURS || 12);
const apply = process.env.APPLY === "1";
const cutoff = Date.now() - hours * 60 * 60 * 1000;
const dir = path.dirname(storePath);
faint compassBOT
# low heath That first script throws an error: ```zsh [stdin]:36 fs.writeFileSync(storePath,...
const stamp = new Date().toISOString().replace(/[:.]/g, "-");
const out = path.join(dir, "quarantine-recent-" + hours + "h-" + stamp);

const store = JSON.parse(fs.readFileSync(storePath, "utf8"));
const selected = Object.entries(store).filter(([, e]) => {
  const ts = Number(e.updatedAt ?? e.lastInteractionAt ?? e.startedAt ?? 0);
  return Number.isFinite(ts) && ts >= cutoff;
});

console.log((apply ? "Applying" : "Dry-run") + ": " + selected.length + " sessions updated in last " + hours + "h");
for (const [key, e] of selected) {
  const when = e.updatedAt ? new Date(Number(e.updatedAt)).toISOString() : "no updatedAt";
  console.log("-", key, e.sessionId || "", when);
}
faint compassBOT
# low heath That first script throws an error: ```zsh [stdin]:36 fs.writeFileSync(storePath,...
if (!apply) process.exit(0);

fs.mkdirSync(out, { recursive: true });
fs.copyFileSync(storePath, path.join(out, "sessions.json.before"));

const files = new Set();
const names = new Set(fs.readdirSync(dir));
const dirResolved = path.resolve(dir) + path.sep;

for (const [, e] of selected) {
  if (e.sessionFile) {
    const resolved = path.resolve(e.sessionFile);
    if (resolved.startsWith(dirResolved)) files.add(resolved);
  }
  if (e.sessionId) {
faint compassBOT
# low heath That first script throws an error: ```zsh [stdin]:36 fs.writeFileSync(storePath,...
    for (const name of names) {
      if (name === e.sessionId + ".jsonl" || name.startsWith(e.sessionId + ".")) {
        files.add(path.join(dir, name));
      }
    }
  }
}

for (const [key] of selected) delete store[key];

fs.writeFileSync(storePath, JSON.stringify(store, null, 2) + os.EOL);
fs.writeFileSync(
  path.join(out, "sessions.quarantined.json"),
  JSON.stringify(Object.fromEntries(selected), null, 2) + os.EOL
);
faint compassBOT
# low heath That first script throws an error: ```zsh [stdin]:36 fs.writeFileSync(storePath,...

for (const file of files) {
  if (!fs.existsSync(file)) continue;
  fs.renameSync(file, path.join(out, path.basename(file)));
}

console.log("Moved index rows + " + files.size + " transcript/artifact files to " + out);
NODE

If the dry-run list looks right, apply with the exact same heredoc, just add APPLY=1 before node:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node <<'NODE'
# paste the same JS body here
faint compassBOT
low heath
#

I /did/ save it to a file. I had no idea you meant .js - I had used .sh and sudo sh scriptname.sh - don't presume I am familiar with all terms and tools; most of it sounds absolute nonsense to me (and every time I look into something, it turns out that it actually /is/ nonsense, but on many more levels than initially may appear - but I digress).

Even heredoc fails here half the time because I'm supposed to know where every dot and comma and whitespace should be in 100+ languages, most of the time in languages that are specifically designed to be machine-readable but not intended to ever be read by humans such as json - but e.g. the OpenClaw documents and bots generally provide only INVALID json snippets for humans to manually correct before feeding it to the machines.

I also prefer /not/ to /have to/ become a domain expert in everything have stuff "just work". I don't need to know even what language that script is in. I just need to know how to make it do what it does. A button with a clear label would be great, but unfortunately we seem to have to work on creating the future using the latest bleeding edge tech through mainly 1969 UX like typing commands in a fugly CLI.

So I guess that "NODE" at the end is the clue, and I should now run:

node scriptname.sh

It's still silly that I can't just DOUBLE-CLICK on that file in Finder in macOS to make it "just work".

Anyway, next error: <tbc>

#
STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'
                                                           ^^^^^

SyntaxError: Unexpected identifier 'HOURS'
    at wrapSafe (node:internal/modules/cjs/loader:1743:18)
    at Module._compile (node:internal/modules/cjs/loader:1786:20)
    at Object..js (node:internal/modules/cjs/loader:1943:10)
    at Module.load (node:internal/modules/cjs/loader:1533:32)
    at Module._load (node:internal/modules/cjs/loader:1335:12)
    at wrapModuleLoad (node:internal/modules/cjs/loader:255:19)
    at Module.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:154:5)
    at node:internal/main/run_main_module:33:47
faint compassBOT
# low heath I /did/ save it to a file. I had no idea you meant `.js` - I had used .`sh` and ...

You’re right. I assumed too much there, and the that-same-script.js wording was especially unhelpful. Sorry.

Short version:

  • Do not use sudo here. It can change $HOME to root’s home and point at the wrong OpenClaw files.
  • The file extension does not matter much. What matters is which program runs it.

If your saved file starts with this kind of line:

const fs = require("fs");

run it with:

faint compassBOT
# low heath I /did/ save it to a file. I had no idea you meant `.js` - I had used .`sh` and ...
STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node scriptname.sh

To apply after the dry-run:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node scriptname.sh

If your saved file starts with this kind of line:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'

then it is a shell wrapper and should be run with:

faint compassBOT
# low heath I /did/ save it to a file. I had no idea you meant `.js` - I had used .`sh` and ...
bash scriptname.sh

To apply, edit the first line to include APPLY=1, like:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node <<'NODE'

and run:

bash scriptname.sh

The simplest path from where you are is probably: save only the JavaScript body, starting at const fs = require("fs");, then use the node scriptname.sh commands above. The .sh filename is ugly but fine.

faint compassBOT
faint compassBOT
# low heath ```zsh STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node ...

That error means the file is being run by node, but the first line in the file is a shell line:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'

So for that exact saved file, run it with bash, not node:

bash scriptname.sh

To apply it, edit the first line of the file from:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'
faint compassBOT
# low heath ```zsh STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node ...

to:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node <<'NODE'

then run:

bash scriptname.sh

Mental model, without jargon: the file you saved is a tiny shell script that starts Node and feeds JavaScript into it. If you run that whole file with Node directly, Node tries to read the shell wrapper as JavaScript and blows up exactly where you saw it.

Also: no sudo.

low heath
#

Re: "Do not use sudo here." - well, otherwise it did not work at all.

Re: "
Of course not - it's exactly the first line that /you/ pasted above for me to use (why would I even change it?):

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 node <<'NODE'

(3 more pages of text)
TLDR: I said I do NOT want to become a domain expert in everything.

faint compassBOT
# low heath Re: "Do not use sudo here." - well, otherwise it did not work at all. Re: " Of ...

You’re right, and I’m sorry. I gave you a pasteable block but then talked about it like you were expected to classify it. That’s bad support.

Do exactly this:

bash scriptname.sh

Not node scriptname.sh.

If it only works with sudo, your session files are probably owned by root now from an earlier sudo run. Fix the ownership once:

sudo chown -R "$USER:$(id -gn)" "$HOME/.openclaw/agents/main/sessions"

Then run the script without sudo:

faint compassBOT
# low heath Re: "Do not use sudo here." - well, otherwise it did not work at all. Re: " Of ...
bash scriptname.sh

To apply instead of dry-run, change only the first line by adding APPLY=1 before node:

STORE="$HOME/.openclaw/agents/main/sessions/sessions.json" HOURS=12 APPLY=1 node <<'NODE'

Then:

bash scriptname.sh

And yes: you should not have to become a shell/Node/JSON expert for this. This is exactly the kind of recovery task that should be exposed as a clear OpenClaw command or UI action.

low heath
#

Re: "run it with bash" - did you know when bash STOPPED being the default shell for macOS? See, I /am/ a domain expert in seeing that kind of slop coming a mile away. I would bet good money that you would also paste invalid JSON when not warned explicitly NOT to do so, and so on.

And there is no such thing as "a clear OpenClaw command or UI action." :p

Also: the clean(er) solution here would already have had "APPLY=0" in the initial version.

Also of note: I have now just updated to v2026.5.14-beta.1 (from v2026.5.12). There are many fixes in areas that have been plaguing me and many other crustacean aficionados in recent weeks, keeping claws crossed that this takes care of a few more (still many, many to go… but we'll get there, even if sideways and with tiny steps - we'll just keep marching like crazy).

Now (just before updating OpenClaw) the script worked as expected:

Applying: 2 sessions updated in last 12h
- agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a bd7ea47d-daaf-430e-b2d6-bb5844a962da 2026-05-15T04:04:29.933Z
- agent:main:main ced4cfb0-aaeb-471a-a607-f35241cd3b99 2026-05-15T03:45:37.244Z
faint compassBOT
# low heath Re: "run it with bash" - did you know when `bash` STOPPED being the default shel...

Fair hit. macOS switched the default interactive shell to zsh years ago; bash scriptname.sh was meant as “use the bash interpreter for this file,” not “your shell is bash,” but I made the whole path too fiddly. APPLY=0 in the first version would also have been clearer.

Good: that output means it quarantined exactly the two recent rows:

  • agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a
  • agent:main:main

After the 2026.5.14-beta.1 update, I’d do the boring verification pass:

openclaw gateway restart
openclaw status --deep
openclaw sessions --agent main --active 720 --limit 20

If the liveness/stuck-session warning comes back on beta.1, keep the quarantine folder and run:

faint compassBOT
low heath
#

I still got a "liveness warning: reasons=event_loop_delay" error:

liveness warning: reasons=event_loop_delay interval=37s eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=11467.2 eventLoopUtilization=0.346 cpuCoreRatio=0.381 active=1 waiting=0 queued=1 recentPhases=sidecars.main-session-recovery:4ms,sidecars.restart-sentinel:22ms,post-attach.update-sentinel:9ms,sidecars.session-locks:300ms,sidecars.model-prewarm:623ms,post-ready.maintenance:33ms work=[active=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a(processing,q=1,age=13s) queued=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a(processing,q=1,age=13s)]

Then my agent responded: ONCE.

As happened before when I wiped everything and rebooted, and then: nothing. Endless spinning, on reload: nothing. again - my last sent message = GONE. Activity indicator still spinning (new in UI, probably related to codex server, I guess).
AT this point (as usual), I get these types of errors:/

stalled session: sessionId=8817bd7d-38e3-4f28-9bdf-e48d9770ae31 sessionKey=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a state=processing age=146s queueDepth=1 reason=active_work_without_progress classification=stalled_agent_run activeWorkKind=embedded_run lastProgress=codex_app_server:notification:thread/tokenUsage/updated lastProgressAge=141s recovery=none
faint compassBOT
# low heath I still got a "liveness warning: reasons=event_loop_delay" error: ``` liveness w...

That log is important: this is no longer pointing at the old files you quarantined. It is the same session key pattern, but a new active run:

sessionId=8817bd7d-38e3-4f28-9bdf-e48d9770ae31
sessionKey=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a
classification=stalled_agent_run
activeWorkKind=embedded_run
lastProgress=codex_app_server:notification:thread/tokenUsage/updated

So the likely failure is: the Codex app-server harness accepts the turn, emits token usage once, then stops producing terminal/progress events, leaving OpenClaw’s per-session lane stuck. The event_loop_delay warning is probably a symptom around that wedged active run, not the root cause.

For immediate recovery, I’d do this in the affected dashboard chat first:

/queue interrupt

If that does not release it:

faint compassBOT
# low heath I still got a "liveness warning: reasons=event_loop_delay" error: ``` liveness w...
/reset

Then restart the gateway. After restart, verify with:

openclaw status --deep
openclaw sessions --agent main --active 720 --limit 20

For the bug report, this is now good evidence. Please include:

  • that you updated from 2026.5.12 to 2026.5.14-beta.1
  • the liveness line
  • the stalled-session line
  • that quarantining the two recent session rows did not prevent a newly-created dashboard Codex run from wedging
  • diagnostics export:
faint compassBOT
# low heath I still got a "liveness warning: reasons=event_loop_delay" error: ``` liveness w...
openclaw gateway diagnostics export

Issue link: https://github.com/openclaw/openclaw/issues

Docs backing the interpretation:
https://docs.openclaw.ai/concepts/queue
https://docs.openclaw.ai/plugins/codex-harness
https://docs.openclaw.ai/gateway/diagnostics

If you need a temporary workaround while this is investigated, test whether the same dashboard chat works when not using the native Codex harness. The docs call that out as the runtime split: Codex app-server is the codex runtime; direct OpenAI API traffic goes through pi. I’m not going to hand-wave the exact config for your setup without seeing it, but the relevant docs are here: https://docs.openclaw.ai/concepts/agent-runtimes

low heath
#

That warning message was (more or less, minor variations e.g. response times) repeated 17 times, then two more, then agent DID answer again. The last warning diagnostic messages:

stuck session recovery: sessionId=8817bd7d-38e3-4f28-9bdf-e48d9770ae31 sessionKey=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a age=626s action=abort_embedded_run aborted=true drained=true released=0

stuck session recovery outcome: status=aborted action=abort_embedded_run sessionId=8817bd7d-38e3-4f28-9bdf-e48d9770ae31 sessionKey=agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a activeSessionId=8817bd7d-38e3-4f28-9bdf-e48d9770ae31 activeWorkKind=embedded_run lane=session:agent:main:dashboard:bd4d3e53-ad6e-4e5a-af2d-f17d3ae7941a aborted=true drained=true forceCleared=false released=0

Let's see if the agent now stays responding, at least.

faint compassBOT
# low heath That warning message was (more or less, minor variations e.g. response times) re...

That recovery output is actually useful: OpenClaw eventually did what the newer diagnostics are supposed to do. It classified the run as stalled, aborted the embedded Codex run, drained it, and the lane became usable enough for the queued turn to answer.

The bad part is the delay: age=626s means you waited ~10 minutes before recovery. That lines up with the documented default behavior: stalled embedded runs only get abort-drained after the safer extended abort window, not at the first liveness warning.

If this keeps happening, I’d file the issue even if it “recovers,” because the root bug is still “Codex app-server turn accepts work, emits token usage, then stops producing progress/terminal events.” The recovery working after 10 minutes is mitigation, not a fix.

Temporary knob, if you want faster recovery while debugging:

{
  diagnostics: {
    stuckSessionWarnMs: 30000,
    stuckSessionAbortMs: 120000,
  },
}
faint compassBOT
low heath
#

I assume that "temporary knob" is for openclaw.json? Very welcome indeed.

And see, I was tight on the money there:

I would bet good money that you would also paste invalid JSON when not warned explicitly NOT to do so, and so on.

You just pasted INVALID json, dear Krill. 🙂

faint compassBOT
# low heath I assume that "temporary knob" is for `openclaw.json`? Very welcome indeed. And...

Yep, caught. The file is ~/.openclaw/openclaw.json, and OpenClaw parses it as JSON5 despite the .json name. My previous block was valid JSON5, not valid JSON. That distinction matters when humans are copying things by hand.

Valid strict JSON version:

{
  "diagnostics": {
    "stuckSessionWarnMs": 30000,
    "stuckSessionAbortMs": 120000
  }
}

If your existing openclaw.json already has other top-level keys, do not replace the whole file with that. Add just this top-level "diagnostics" block next to the existing ones. In strict JSON, every key needs quotes, and commas go between sibling keys.

Safer CLI way, no manual comma surgery:

faint compassBOT
# low heath I assume that "temporary knob" is for `openclaw.json`? Very welcome indeed. And...
openclaw config set diagnostics.stuckSessionWarnMs 30000
openclaw config set diagnostics.stuckSessionAbortMs 120000
openclaw config validate
openclaw gateway restart

Docs confirming the config path/format: https://docs.openclaw.ai/gateway/configuration
Diagnostics keys: https://docs.openclaw.ai/gateway/configuration-reference#diagnostics

And yes: point taken. I’ll be explicit about “JSON5, not strict JSON” when giving config snippets.

low heath
#

I don't even know the difference between those dialecs, it is a machine-only language as far as I'm concerned. But whatever format, it is more forgiving than OpenClaw. Taking longer that a second to add a comma? BAM, file """clobbered""" , start over.

30s and 120s still sounds like WAAAAY too long, though. 15s and 30s is more like it, I think.

faint compassBOT
# low heath I don't even know the difference between those dialecs, it is a machine-only lan...

I checked the code this time.

Yes, the knob is in ~/.openclaw/openclaw.json. OpenClaw accepts low explicit values:

  • stuckSessionWarnMs minimum is 1000 ms.
  • stuckSessionAbortMs is clamped to be at least the warn value.
  • The ~10 minute default only applies when stuckSessionAbortMs is unset.

So 15s / 30s is valid.

Safer command route:

openclaw config set diagnostics.stuckSessionWarnMs 15000
openclaw config set diagnostics.stuckSessionAbortMs 30000
openclaw config validate
faint compassBOT
faint compassBOT
low heath
#

Also: defining this in ms is nuts. Just typing extra 000s everywhere as if 0 means nothing. :p

faint compassBOT
low heath
#

Better: if you see "stalled session:" twice in a row, abort it.

faint compassBOT
# low heath Better: if you see "stalled session:" twice in a row, abort it.

That’s a better product shape than making people tune milliseconds.

Implementation-wise I’d phrase it as something like:

“Abort a stalled embedded run after N consecutive stalled diagnostics for the same session generation, with no progress between samples.”

The “same session generation” / “no progress between samples” bit matters so it does not abort a slow run that briefly looked stalled but then resumed emitting progress. For your case, N=2 would have cut the wait from ~10 minutes to roughly one extra diagnostic interval.

Worth filing as a bug/feature request alongside the logs: https://github.com/openclaw/openclaw/issues

low heath
#

Now I get these intermixed with the "stalled session: " warnings, taking turns, one each, then the other, many times over - then finally the gateway did restart 'automatically':

gateway
still draining 2 active task(s) and 1 active embedded run(s) before restart
faint compassBOT
# low heath Now I get these intermixed with the "stalled session: " warnings, taking turns, ...

That line is the restart drain guard. On restart, OpenClaw tries not to kill in-flight work immediately; it marks the gateway as draining, rejects new work, and waits for active tasks / embedded runs to finish. The warning repeats every ~30s while it is waiting.

In your case it is almost certainly waiting on the same wedged Codex embedded run, so the restart path is also getting held hostage by the stalled run until recovery/timeout catches up.

Two practical knobs:

openclaw config set diagnostics.stuckSessionWarnMs 15000
openclaw config set diagnostics.stuckSessionAbortMs 30000
openclaw config set gateway.reload.deferralTimeoutMs 45000
openclaw config validate
openclaw gateway restart

What those do:

  • diagnostics.*: abort stalled embedded runs faster.
  • gateway.reload.deferralTimeoutMs: do not let a restart wait the default 300000ms / 5 minutes for active work to drain.
faint compassBOT
low heath
#

I have asked Codex CLI to analyse this from its end as well now. Codex is still perfectly responsive, as expected. And we had a LOT of changes in recent updates - but my problems predated some of the latest and biggest changes. I think these problems started around the same time when OpenClaw asked me to import stuff from Codex - it just might be related.

faint compassBOT
# low heath I have asked Codex CLI to analyse this from its end as well now. Codex is still ...

That timing is plausible enough to investigate. Codex CLI being responsive does not rule it out, because OpenClaw’s native Codex path is not just “run your Codex CLI.” It launches Codex app-server with a per-agent CODEX_HOME under OpenClaw state and may use migrated Codex skills/plugins/native app config.

Relevant state path from the docs:

~/.openclaw/agents/main/agent/codex-home/

I’d check what Codex migration actually enabled:

openclaw config get plugins.entries.codex
openclaw doctor
openclaw migrate codex --dry-run

If codexPlugins is enabled, I’d test with native Codex plugins disabled temporarily: