#Agents stop mid work

1 messages · Page 1 of 1 (latest)

crystal warren
#

I have more and more the issue that Agents just stop mid work. I need to send a message for them to get alive and continue. I am using v2026.04.26

spice fableBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

austere gorgeBOT
#

That sounds like a real stuck-run issue, not normal behavior. 2026.4.26 is currently the latest GitHub release I see, so first thing is to capture the stuck state before nudging it again. (github.com)

When it happens, please grab:

openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300

In the logs, look especially for:

stopReason: "length"
finish_reason: length
#
approval-timeout
An async command did not run
LLM request timed out
model stream idle
Provider in cooldown
fetch failed
processing

There are a few known “looks frozen until I poke it” patterns:

  • output token limit / stopReason: length can silently strand work; there’s an existing issue describing exactly that kind of freeze. (github.com)
  • async/approval mismatch can make the agent say “I’ll do it now” but not actually continue; there’s also a similar regression report with approval-timeout / An async command did not run as useful log strings. (github.com)
  • local or slow model backends can idle-timeout mid-run; for those, prefer increasing models.providers.<id>.timeoutSeconds before touching the whole agent timeout. Docs: https://docs.openclaw.ai/concepts/agent-loop

If the logs show one of those signatures, file it with the captured output here: https://github.com/openclaw/openclaw/issues — especially since you’re already on latest.

GitHub

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞 - Issues · openclaw/openclaw

GitHub

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞 - Issues · openclaw/openclaw

#

Short-term workaround: try splitting very large tasks into phases/subagents, and if you’re on local/OpenAI-compatible models, raise the provider timeout for that provider.

crystal warren
#

openclaw status --all
What I think its important in the log:

#

Gateway: bind=loopback (127.0.0.1), port=18789 (service args)
Probe target: ws://127.0.0.1:18789

austere gorgeBOT
#

Looks like the actual log snippet didn’t come through — Discord may have eaten it.

Please paste either as a code block or attach a .txt file with:

openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300

Most useful: the log lines right when the agent stops, plus the timestamp/message where you “wake” it again. status --all is designed to redact tokens, but still skim for anything private before posting.

#

That part looks normal/healthy: loopback 127.0.0.1:18789 just means the Gateway is listening locally and the CLI probes it over local WebSocket. It doesn’t explain the agent stopping.

The useful parts will be lower in status --all / logs — especially Tasks, Sessions, and the log lines around the stop. Please paste as a code block or attach .txt with:

openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300

If it’s huge, search/filter for these and paste matches:

openclaw logs --plain --limit 1000 | grep -Ei 'timeout|idle|stopReason|length|approval|processing|stuck|error|cooldown|abort|disconnect'
crystal warren
#

Listening: [::1]:18789
Node service │ systemd not installed

Gateway restart attempts (tail): /home/najef/.openclaw/logs/gateway-restart.log
[2026-04-23T20:38:11Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-23T20:38:11Z] openclaw restart done source=update
[2026-04-25T09:24:22Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-25T09:24:22Z] openclaw restart done source=update
[2026-04-25T20:49:39Z] openclaw restart attempt source=update target=openclaw-gateway.service

[2026-04-28T20:40:09Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-28T20:40:09Z] openclaw restart done source=update
! Channel issues skipped (gateway unreachable)

Gateway health:
timeout

Pasteable debug report. Auth tokens redacted.

openclaw tasks list --json is really long, quick scan only shows crontabs timeing out.

  "taskId": "662599a4-e33c-4739-bd0f-d40cb2090076",
  "runtime": "cron",
  "sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
  "requesterSessionKey": "",
  "ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
  "scopeKind": "system",
  "childSessionKey": "agent:neon:telegram:direct:2101884310",
  "agentId": "neon",
  "runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1776852000027",
  "label": "Orion Heartbeat",
  "task": "Orion Heartbeat",
  "status": "timed_out",
  "deliveryStatus": "not_applicable",
  "notifyPolicy": "silent",
  "createdAt": 1776852000027,
  "startedAt": 1776852000027,
  "endedAt": 1776852120036,
  "lastEventAt": 1776852120036,
  "cleanupAfter": 1777456920036,
  "error": "cron: job execution timed out"
},
#

openclaw tasks audit --json

{
  "kind": "task",
  "severity": "warn",
  "code": "lost",
  "detail": "backing session missing",
  "ageMs": 294423004,
  "status": "lost",
  "token": "639b39ff-d575-45dd-91b9-15fa04724bad",
  "task": {
    "taskId": "639b39ff-d575-45dd-91b9-15fa04724bad",
    "runtime": "cron",
    "sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
    "requesterSessionKey": "",
    "ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
    "scopeKind": "system",
    "childSessionKey": "agent:neon:telegram:direct:2101884310",
    "agentId": "neon",
    "runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777159870461",
    "label": "Orion Heartbeat",
    "task": "Orion Heartbeat",
    "status": "lost",
    "deliveryStatus": "not_applicable",
    "notifyPolicy": "silent",
    "createdAt": 1777159870461,
    "startedAt": 1777159870461,
    "endedAt": 1777160193868,
    "lastEventAt": 1777160193868,
    "cleanupAfter": 1777764993868,
    "error": "backing session missing"
austere gorgeBOT
#

That’s a much better clue. The key problem is not the cron entry itself — it’s this:

! Channel issues skipped (gateway unreachable)
Gateway health:
  timeout
Listening: [::1]:18789

So the Gateway has something listening, but the health/RPC probe is timing out. That can absolutely look like “agents stop until I poke them”.

Two things to check next:

openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
#
curl -g -m 5 'http://[::1]:18789/healthz'

If [::1] works but 127.0.0.1 times out/fails, it’s likely an IPv6-vs-IPv4 loopback bind mismatch. In that case try forcing IPv4 loopback:

openclaw config set gateway.bind custom
openclaw config set gateway.customBindHost 127.0.0.1
openclaw gateway restart
openclaw health --verbose

The cron timeout you pasted is also useful:

#
label: Orion Heartbeat
status: timed_out
error: cron: job execution timed out

It timed out after ~120s. If that heartbeat is expected to do real work, bump it:

openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron edit 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --timeout-seconds 600

But first I’d focus on the Gateway health timeout. Please paste the output of the three health/curl commands above — that’ll tell us whether this is bind mismatch vs gateway event-loop wedged.

#

Yep — that audit output is useful, but it’s mostly a symptom/history marker, not the stuck run itself.

lost / backing session missing means the task ledger has a cron task, but the backing child session (agent:neon:telegram:direct:2101884310) is gone. For old retained tasks this is a warning until cleanup.

Run:

openclaw tasks maintenance --json
openclaw tasks maintenance --apply

Then check the actual cron:

openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 20
#

If Orion Heartbeat keeps timing out/lost, temporarily disable it to see if normal agents stop freezing:

openclaw cron edit 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --disable
openclaw gateway restart

But the bigger red flag is still:

Gateway health: timeout
Listening: [::1]:18789

So I’d prioritize confirming whether the gateway is wedged or only bound weirdly:

#
openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
curl -g -m 5 'http://[::1]:18789/healthz'

If health still times out, paste that plus:

openclaw logs --plain --limit 500 | grep -Ei '639b39ff|Orion Heartbeat|backing session|timeout|stuck|processing|health|error'
crystal warren
#

how do I upload log file? txt to much to copy paste

austere gorgeBOT
#

Easiest: save it to a file, then drag/drop the .txt into this Discord thread or use the + / upload file button.

Good commands:

openclaw status --all > openclaw-status.txt
openclaw tasks audit --json > openclaw-tasks-audit.json
openclaw logs --plain --limit 1000 > openclaw-logs.txt

Then upload those files here.

If Discord says it’s too large, make a smaller filtered log:

openclaw logs --plain --limit 3000 \
#
  | grep -Ei 'Orion Heartbeat|backing session|timeout|stuck|processing|health|error|lost|cron|abort|disconnect' \
  > openclaw-filtered-logs.txt

Please skim for anything private before uploading. status --all redacts auth tokens, but logs can still include paths, usernames, message text, etc.

crystal warren
#

cannot drag and drop and / upload file does not exist

austere gorgeBOT
#

No worries — we can do it in small pasted chunks.

Run this and paste the output:

openclaw health --verbose --timeout 30000 2>&1
curl -m 5 http://127.0.0.1:18789/healthz 2>&1
curl -g -m 5 'http://[::1]:18789/healthz' 2>&1

Then this filtered log, limited enough for Discord:

openclaw logs --plain --limit 3000 \
  | grep -Ei 'Orion Heartbeat|backing session|timeout|stuck|processing|health|error|lost|cron|abort|disconnect' \
  | tail -80
#

If that’s still too much, use:

... | tail -40

Main thing I need next is: do the two curl /healthz commands work, or do they timeout? That tells us if this is IPv4/IPv6 bind weirdness or a wedged Gateway.

crystal warren
#

openclaw logs --plain --limit 1000 | grep -Ei 'timeout|idle|stopReason|length|approval|processing|stuck|error|cooldown|abort|disconnect'

-info plugins {"subsystem":"plugins"} active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 start timeoutMs=15000 queryChars=808
-info plugins {"subsystem":"plugins"} active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 done status=timeout elapsedMs=16511 summaryChars=0
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=a45cc6f8-a0f2-4bde-b412-67eb51a0c3c2
-warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_failover_decision","tags":["error_handling","failover","assistant","surface_error"],"runId":"active-memory-moju70jx-b2082d40","stage":"assistant","decision":"surface_error","failoverReason":"timeout","profileFailureReason":"timeout","provider":"minimax","model":"MiniMax-M2.5","sourceProvider":"minimax","sourceModel":"MiniMax-M2.5","profileId":"sha256:d347a6986db6","fallbackConfigured":true,"timedOut":true,"aborted":true} embedded run failover decision
-error telegram/api {"subsystem":"telegram/api"} telegram sendChatAction failed: Network request for 'sendChatAction' failed!
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=849b92d7-abd0-413a-9eda-b86ede36b304
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=0dd1de89-cd2b-46af-bef4-5529d059cf45
-error (node:377477) ExperimentalWarning: SQLite is an experimental feature and might change at any time
-warn gateway/ws {"subsystem":"gateway/ws"} handshake timeout conn=8e95457f-cb2a-4ba3-ba65-1762feeac639 peer=127.0.0.1:55550->127.0.0.1:18789 remote=127.0.0.1

#

-warn gateway/ws {"subsystem":"gateway/ws"} {"cause":"handshake-timeout","handshake":"failed","durationMs":37191,"lastFrameType":"req","lastFrameMethod":"connect","lastFrameId":"c219e03f-74ab-4059-b95a-
799ee3196628","host":"openclaw.tail967935.ts.net","origin":"https://openclaw.tail967935.ts.net","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36 Edg/147.0.0.0","forwardedFor":"100.108.130.102","remoteAddr":"127.0.0.1","remotePort":55550,"localAddr":"127.0.0.1","localPort":18789,"endpoint":"127.0.0.1:55550->127.0.0.1:18789","handshakeMs":10214} closed before connect conn=8e95457f-cb2a-4ba3-ba65-1762feeac639 peer=127.0.0.1:55550->127.0.0.1:18789 remote=127.0.0.1 fwd=100.108.130.102 origin=https://openclaw.tail967935.ts.net host=openclaw.tail967935.ts.net ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36 Edg/147.0.0.0 code=1000 reason=n/a
-warn bundle-mcp {"subsystem":"bundle-mcp"} failed to start server "minimax" (uvx minimax-coding-plan-mcp -y): Error: MCP server connection timed out after 30000ms
-error [tools] edit failed: Could not find the exact text in /home/najef/.openclaw/agents/arnold/workspace/memory/2026-04-29.md. The old text must match exactly including all whitespace and newlines.

austere gorgeBOT
#

This log points at two likely culprits:

  1. Active Memory is timing out

    active-memory ... MiniMax-M2.5 ... done status=timeout elapsedMs=16511
    embedded_run_failover_decision ... failoverReason="timeout" ... aborted=true
    

    Active Memory is a blocking pre-reply memory sub-agent, so if it keeps timing out, every affected agent turn can feel like it “stops”.

    Quick test: temporarily turn it off and see if agents stop freezing:

    /active-memory off --global
    

    Or scope it away from the affected agent/session. Docs: https://docs.openclaw.ai/concepts/active-memory

  2. Gateway local WebSocket handshakes are timing out

#
   gateway/ws handshake timeout ... peer=127.0.0.1...->127.0.0.1:18789

That matches your earlier Gateway health: timeout. It suggests the Gateway is reachable but sometimes too wedged/busy to complete local RPC handshakes.

The SQLite warning is harmless. The Telegram sendChatAction failure may be network noise unless it repeats a lot.

Next I’d do this:

openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
curl -g -m 5 'http://[::1]:18789/healthz'

And as a mitigation test:

#
/active-memory off --global
openclaw gateway restart

Then try the same kind of agent task again. If it stops freezing, the issue is probably Active Memory + MiniMax timeout/failover pressure rather than the main agent loop.

#

This adds two concrete suspects:

failed to start server "minimax" ... timed out after 30000ms

That means every run that exposes MCP tools may block ~30s trying to start the minimax MCP server. If you don’t actively need that MCP server, disable it for the test:

openclaw mcp show minimax --json
openclaw mcp unset minimax
openclaw gateway restart

If you want a broader test without removing the server config, deny bundled MCP tools temporarily:

#
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart

Also keep Active Memory off for the test, since your previous log showed MiniMax timing out there too:

/active-memory off --global

Then try one normal agent task again.

The edit failed: Could not find the exact text...memory/2026-04-29.md line is probably not the main freeze — that’s a failed exact-text edit, often from stale/concurrent file contents.

The Tailscale line says the Control UI/browser handshake is timing out through openclaw.tail...; close that browser tab during testing too, just to reduce noise.

crystal warren
#

najef@OpenClaw-Server:~/.openclaw$ openclaw health --verbose --timeout 30000

🦞 OpenClaw 2026.4.26 (be8c246) — Less clicking, more shipping, fewer "where did that file go" moments.

│
â—‡
Gateway connection:
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
Telegram: configured
Agents: sarah, marcus, david, jordan, emma, gemini, susan, arthur, edgar, oscar, virginia, james, orion, riley, alex, neon (default), arnold
Heartbeat interval: 25m (sarah), 25m (marcus), 25m (david), 25m (jordan), 25m (emma), 25m (gemini), 25m (susan), 25m (arthur), 25m (edgar), 25m (oscar), 25m (virginia), 25m (james), 25m (orion), 25m (riley), 25m (alex), 25m (neon), 25m (arnold)

Session store (marcus): /home/najef/.openclaw/agents/marcus/sessions/sessions.json (2 entries)

  • agent:marcus:main:heartbeat (12m ago)
  • agent:marcus:main (42m ago)
    Session store (susan): /home/najef/.openclaw/agents/susan/sessions/sessions.json (2 entries)
  • agent:susan:main:heartbeat (6m ago)
  • agent:susan:main (7392m ago)
    Session store (neon): /home/najef/.openclaw/agents/neon/sessions/sessions.json (168 entries)
  • agent:neon:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed (3m ago)
  • agent:neon:main:heartbeat (23m ago)
  • agent:neon:telegram:direct:2101884310 (40m ago)
  • agent:neon:cron:1404af42-bf12-4a73-b197-b7a670d80d20:run:2bb9f514-b10a-48d1-8f89-6502dcb964d4:heartbeat (128m ago)
  • agent:neon:cron:1404af42-bf12-4a73-b197-b7a670d80d20 (143m ago)
    Session store (arnold): /home/najef/.openclaw/agents/arnold/sessions/sessions.json (4 entries)
  • agent:arnold:main:heartbeat (0m ago)
  • agent:arnold:telegram:direct:2101884310 (39m ago)
  • agent:arnold:telegram:slash:2101884310 (4375m ago)
  • agent:arnold:cron:7c6caa02-af84-4e17-9dfc-ee02b1ac255b:run:b53e970e-7699-4d26-b82e-63c7189e2ed1 (24623m ago)
    Other agens same...
austere gorgeBOT
#

This output actually confirms the Gateway is reachable now — so the earlier Gateway health: timeout looks intermittent/load-related, not a permanent bind issue.

Big red flag here:

Heartbeat interval: 25m (...) for 17 agents

That’s a lot of full agent turns in the background. Combined with the earlier:

  • Active Memory timing out on MiniMax
  • minimax MCP server startup timing out after 30s
  • WebSocket handshake timeouts

…I’d test by reducing background pressure first.

Try this mitigation bundle:

#
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw config set agents.defaults.heartbeat.isolatedSession true
openclaw config set agents.defaults.heartbeat.lightContext true
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw gateway restart

If you want a stronger test, temporarily disable heartbeats:

openclaw config set agents.defaults.heartbeat.every "0m"
openclaw gateway restart

Then run openclaw health --verbose again. If it still shows 25m per-agent, those are per-agent heartbeat overrides in agents.list[], so check:

#
grep -n "heartbeat" ~/.openclaw/openclaw.json

I’d also keep these disabled during the test:

/active-memory off --global

and either:

openclaw mcp unset minimax

or:

crystal warren
austere gorgeBOT
#

Good — that rules out IPv4/IPv6 bind mismatch. Gateway is listening fine on both:

127.0.0.1 âś…
::1 âś…

So the pattern is much more likely load / blocking work inside the Gateway, not networking.

I’d do the isolation test in this order:

# reduce heartbeat load
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw config set agents.defaults.heartbeat.isolatedSession true
#
openclaw config set agents.defaults.heartbeat.lightContext true

# disable the MCP server path that is timing out
openclaw config set tools.deny '["bundle-mcp"]' --strict-json

openclaw gateway restart

And in chat:

/active-memory off --global

Then run one task that used to stall. If it stops stalling, re-enable pieces one by one:

#
  1. heartbeats
  2. active-memory
  3. bundle-mcp / minimax MCP

My bet: the 17 agents heartbeating every 25m + Active Memory MiniMax timeouts + Minimax MCP startup timeout are saturating the runtime enough that runs appear “asleep” until a new message wakes/queues progress.

crystal warren
#

najef@OpenClaw-Server:~/.openclaw$ openclaw mcp show minimax --json
{
"command": "uvx",
"args": [
"minimax-coding-plan-mcp",
"-y"
],
"env": {
"MINIMAX_API_KEY": "sk-Removed",
"MINIMAX_API_HOST": "https://api.minimax.io"
}

austere gorgeBOT
#

Yep, that matches the timeout line exactly:

uvx minimax-coding-plan-mcp -y

For now I’d remove it from OpenClaw’s MCP registry and restart:

openclaw mcp unset minimax
openclaw gateway restart

Then confirm it’s gone:

#
openclaw mcp list

If you need to keep the config around instead of deleting it, use the global deny test:

openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart

Also: good call redacting the key. If that key was ever pasted unredacted anywhere public, rotate it.

crystal warren
#

can we look at /active-memory... why minimax not working?

austere gorgeBOT
#

Yes — and important distinction: there are two MiniMax paths in your logs.

  1. Active Memory MiniMax model call

    active-memory ... activeProvider=minimax activeModel=MiniMax-M2.5 ... status=timeout
    

    This is the one affecting /active-memory.

  2. MCP MiniMax server

    uvx minimax-coding-plan-mcp ... timed out
    

    Separate issue. That’s the MCP tool server, not the chat model.

For /active-memory, check this first in chat:

#
/active-memory status
/active-memory status --global
/verbose on
/trace on

Then from shell:

openclaw config get plugins.entries.active-memory
openclaw models status --probe --probe-provider minimax --probe-timeout 30000
openclaw models list --provider minimax

Your log says Active Memory is using MiniMax-M2.5, but current MiniMax docs/defaults are around MiniMax-M2.7 / M2.7-highspeed. So also smoke-test the exact model:

#
openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json
openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json

If M2.5 times out/fails but M2.7 works, pin Active Memory to the working model:

openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw gateway restart

If you use OAuth instead of API key, the provider may be minimax-portal/... instead of minimax/... — use whatever models list/status shows.

Docs: https://docs.openclaw.ai/concepts/active-memory and https://docs.openclaw.ai/providers/minimax

crystal warren
#

Active Memory: on for this session.
Active Memory: on globally.

"enabled": true,
"config": {
"enabled": true,
"agents": [
"neon",
"main",
"arnold",
"orion",
"sarah",
"marcus",
"david",
"emma",
"jordan",
"alex",
"gemini",
"susan",
"arthur",
"edgar",
"oscar",
"virginia",
"james",
"riley"
],
"allowedChatTypes": [
"direct"
],
"modelFallback": "default-remote",
"queryMode": "recent",
"promptStyle": "balanced",
"timeoutMs": 15000,
"maxSummaryChars": 220,
"persistTranscripts": false,
"logging": true,
"model": "minimax/MiniMax-M2.5",
"modelFallbackPolicy": "default-remote",
"thinking": "low",
"qmd": {
"searchMode": "inherit"

austere gorgeBOT
#

Yep, that explains a lot.

Active Memory is enabled for 18 agents, and it’s pinned to:

model: "minimax/MiniMax-M2.5"
timeoutMs: 15000
thinking: "low"

Your logs show exactly that model timing out at ~16.5s. Since Active Memory is blocking, this can make normal turns feel stalled.

I’d change Active Memory to either use a current MiniMax model or a faster fallback.

Try:

#
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7-highspeed
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw gateway restart

If that model isn’t available, first check:

openclaw models list --provider minimax
openclaw models status --probe --probe-provider minimax --probe-timeout 30000

Or safest / simplest: let Active Memory inherit the working session model instead of pinning MiniMax:

#
openclaw config unset plugins.entries.active-memory.config.model
openclaw config set plugins.entries.active-memory.config.modelFallback google/gemini-3-flash
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw gateway restart

Also consider scoping it to fewer agents while testing:

openclaw config set plugins.entries.active-memory.config.agents '["neon","arnold"]' --strict-json

modelFallbackPolicy is deprecated/no longer meaningful for runtime behavior, so don’t worry about that one.

crystal warren
#

openclaw models status --probe --probe-provider minimax --probe-timeout 30000

Config : ~/.openclaw/openclaw.json
Agent dir : ~/.openclaw/agents/main/agent
Default : minimax/MiniMax-M2.7
Fallbacks (1) : minimax/MiniMax-M2.5
Aliases (2) : Claude -> anthropic/claude-sonnet-4-6, Minimax -> minimax/MiniMax-M2.7
Configured models (3): anthropic/claude-sonnet-4-6, minimax/MiniMax-M2.5, minimax/MiniMax-M2.7

Auth overview
Auth store : ~/auth-profiles.json
Shell env : off
Providers w/ OAuth/tokens (0): -

  • anthropic effective=models.json:sk-ant-a...8A3PcgAA | models.json=sk-ant-a...8A3PcgAA | source=models.json: ~/.openclaw/agents/main/agent/models.json
  • metaclaw effective=models.json:me...aw | models.json=me...aw | source=models.json: ~models.json
  • minimax effective=profiles:~/.openclaw/agents/main/agent/auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | minimax:global=sk-cp-x7...eDA9Y_cc
  • openrouter effective=profiles:~auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | openrouter:default=sk-or-v1...3e63b85d
  • zai effective=profiles:~auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | zai:default=cc393639...ZMqFyAia

Model │ Profile │ Status
minimax/MiniMax- │ minimax:global (api_ │ unknown · 54s │
M2.7 │ key) │ ↳ No callable tools remain after resolving explicit tool allowlist (agents.neon.tools.allow: *, message); tools are disabled for this run. Fix the allowlist or enable the plugin that registers the requested tool. │

austere gorgeBOT
#

Yep — this shows the issue pretty clearly.

Active Memory is pinned to minimax/MiniMax-M2.5, while your MiniMax default is M2.7 and the probe only checked M2.7. Also the probe result is not clean:

Status: unknown · 54s

So first bypass agent/tools and test raw MiniMax:

openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json
openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json

If M2.7 works and M2.5 is slow/fails, fix Active Memory:

#
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw gateway restart

Also, this part is suspicious but separate:

agents.neon.tools.allow: *, message

For the MiniMax probe, avoid that noise with:

#
openclaw models status --probe --probe-provider minimax --agent main --probe-timeout 30000

Docs note: Coding Plan OAuth uses minimax-portal/...; API-key setup uses minimax/.... Since your auth shows sk-cp..., if raw minimax/MiniMax-M2.7 fails, you may need to re-onboard with the matching MiniMax auth path. https://docs.openclaw.ai/providers/minimax

crystal warren
#

can both have same alias?
openclaw models list --provider minimax

Model Input Ctx Local Auth Tags
minimax/MiniMax-M2.7 text 195k no yes default,configured,alias:Minimax
minimax/MiniMax-M2.5 text 195k no yes fallback#1,configured,alias:Minimax

austere gorgeBOT
#

I wouldn’t keep the same alias on both. It’s ambiguous and can make Minimax resolve in surprising ways depending on default/fallback ordering.

Use distinct aliases, e.g.:

openclaw config set 'agents.defaults.models.minimax/MiniMax-M2.7.alias' Minimax
openclaw config set 'agents.defaults.models.minimax/MiniMax-M2.5.alias' MinimaxOld

Or remove the alias from M2.5 entirely:

openclaw config unset 'agents.defaults.models.minimax/MiniMax-M2.5.alias'

Then for Active Memory, don’t use the alias — pin the exact model:

#
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw gateway restart

That avoids it accidentally selecting M2.5 again.

crystal warren
#

I am doing openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json and I can see the CPU 100% for the whole time... its the gateway process and it did not used to use so many cpu resources.

austere gorgeBOT
#

Yeah — don’t let that keep spinning. Ctrl-C it.

Important: openclaw infer model run --local should not route through the Gateway; it’s a lean local provider smoke test. So if the gateway process is pegged at 100%, either:

  1. it’s coincidental background load — heartbeats / active-memory / MCP, or
  2. the process you’re seeing is actually the openclaw infer CLI node process, not the gateway.

Check the PID/cmd while it’s happening:

ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20

Also run the probe with a hard timeout:

#
timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json
timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json

If M2.5 hangs / burns CPU but M2.7 works, remove M2.5 from anything important:

openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw config set agents.defaults.model.fallbacks '[]' --strict-json
openclaw gateway restart

Given what we’ve seen, I would not use MiniMax-M2.5 for Active Memory. It’s exactly the model timing out in your logs.

crystal warren
#

I am doing the smoke test...

austere gorgeBOT
#

Got it — while it’s running, please grab the process list in another terminal:

ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20

That will tell us whether the CPU is the smoke-test CLI process or the long-running gateway.

If it runs longer than ~60s, stop it with Ctrl-C and try the same for M2.7:

timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json

If M2.7 returns quickly and M2.5 doesn’t, we’ve basically found the bad model path.

crystal warren
#

najef@OpenClaw-Server:~$ ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
377477 1466 322 9.3 01:06:48 openclaw-gateway
377568 377540 0.4 0.5 01:06:47 /home/najef/.nvm/versions/node/v24.14.1/bin/node --require /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/preflight.cjs --import file:///home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/loader.mjs server/index.ts
377538 377537 0.1 0.6 01:06:47 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/vite
377558 377538 0.0 0.0 01:06:47 /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/@esbuild/linux-x64/bin/esbuild --service=0.27.2 --ping
377540 377539 0.0 0.2 01:06:47 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/tsx watch server/index.ts
377504 377503 0.0 0.2 01:06:48 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/npm-run-all --parallel dev dev:server
346387 2098 0.0 0.1 02:47:36 /usr/bin/pluma /home/najef/.openclaw/openclaw.json
2495 2475 0.0 0.0 15:33:17 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/110/doc
2433 2402 0.0 0.0 15:33:17 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /home/najef/.cache/doc
1744 1728 0.0 0.0 15:33:18 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/1000/doc

austere gorgeBOT
#

Yep — that’s the smoking gun:

openclaw-gateway 322% CPU

infer model run --local should not be using the Gateway, so this is almost certainly background Gateway load, not the smoke test itself.

Do this mitigation now:

/active-memory off --global

Then shell:

#
openclaw mcp unset minimax
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw gateway restart

After restart, watch CPU:

ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw-gateway|openclaw|node' | grep -v grep | sort -k3 -nr | head -20

If Gateway jumps back to high CPU while idle, grab:

#
openclaw tasks list --status running
openclaw gateway stability --json --limit 100
openclaw logs --plain --limit 500 | grep -Ei 'active-memory|minimax|heartbeat|cron|handshake|timeout|error|stuck'

Also: those vite/tsx processes under .openclaw/.../workspace look like an agent-started dev server. Not necessarily bad, but if no current task needs it, it’s extra load/noise. The Gateway CPU is the real blocker though.

crystal warren
#

You are proposing workarounds, I am more keen to know why something is not working.

austere gorgeBOT
#

You’re right — I drifted into mitigation mode. Let’s separate what we know from what we still need to prove.

What we know from your logs:

  1. Gateway is actually burning CPU

    openclaw-gateway 322% CPU
    

    So agents “stopping” is likely because the Gateway/event loop is saturated, not because Telegram/WebChat is disconnected.

  2. Active Memory is pinned to MiniMax M2.5

    "model": "minimax/MiniMax-M2.5"
    

    and logs show:

#
   active-memory ... MiniMax-M2.5 ... status=timeout

So Active Memory is one trigger candidate.

  1. MiniMax MCP is also timing out

    uvx minimax-coding-plan-mcp ... timed out after 30000ms
    

    Separate from Active Memory, but another MiniMax-related blocking path.

  2. Gateway handshakes timeout under load

    gateway/ws handshake timeout
    

    That’s likely an effect of the CPU saturation.

#

To find the actual cause, don’t change config yet. While CPU is high, run:

openclaw gateway stability --json --limit 100
openclaw tasks list --status running --json
openclaw logs --plain --limit 500 | grep -Ei 'diagnostic|liveness|active-memory|minimax|heartbeat|cron|mcp|timeout|error'

Also:

curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool

What I’m looking for: event-loop delay, active/waiting sessions, repeated active-memory starts, MCP startup loops, or heartbeat/cron floods.

The current best hypothesis is: a bug/hot loop or repeated blocking retry inside Gateway, triggered by MiniMax M2.5 Active Memory and/or MiniMax MCP startup, amplified by many agents/heartbeats. But we need the stability/liveness output to say why confidently.

#

They appear to be on Linux/Ubuntu-ish from the prompt and process output:

najef@OpenClaw-Server
systemd not installed
/home/najef/.nvm/versions/node/v24.14.1/...

But better to confirm with:

uname -a
cat /etc/os-release
node --version
openclaw --version
crystal warren
#

"model": "minimax/MiniMax-M2.5" should be quicker than "model": "minimax/MiniMax-M2.7" also maybe the problem is not with the model but with the cpu being 100% I say this because openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json also takes long time to finish.

austere gorgeBOT
#

Yes — I agree with your read. The slow infer --local test may be correlated, not causal.

If openclaw-gateway is already at 300% CPU, then any normal command can feel slow because the machine is saturated. The model request itself should mostly be network wait, not burn local CPU.

So the question is now: what is the Gateway doing at 300% CPU?

Please capture these while CPU is high:

openclaw gateway stability --json --limit 100
openclaw tasks list --status running --json
curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool

And this log slice:

#
openclaw logs --plain --limit 1000 \
  | grep -Ei 'diagnostic|liveness|event-loop|cpu|active-memory|minimax|heartbeat|cron|mcp|timeout|stuck|processing|error' \
  | tail -120

Also useful to see if it’s spawning/reaping loops:

pstree -ap 377477 | head -100

If gateway stability shows high event-loop delay / active session queues, that explains why agents only continue after a nudge: they’re not “asleep”; the Gateway is overloaded and delayed. Then we can pin whether the overload is active-memory, heartbeat fanout, cron, MCP startup, or a bug/hot loop.

crystal warren
#

najef@OpenClaw-Server:~$ openclaw gateway stability --json --limit 100
Gateway stability failed: Error: gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback

PRETTY_NAME="Ubuntu 24.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.4 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
v24.14.1
OpenClaw 2026.4.26 (be8c246)

austere gorgeBOT
#

Fair. The key fact now is:

openclaw gateway stability ... gateway timeout after 10000ms
openclaw-gateway 322% CPU

So the Gateway is not just “busy”; it is saturated enough that its own diagnostic RPC can’t answer. To know why, we need OS-level evidence because Gateway RPC is unreliable while it’s pegged.

Please run these while CPU is high:

PID=$(pgrep -xo openclaw-gateway)
echo "PID=$PID"

ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
#

This tells us whether one JS thread is hot-looping or many worker/native threads are burning CPU.

Then:

curl -m 10 -s http://127.0.0.1:18789/readyz | python3 -m json.tool | head -160

And if strace exists:

timeout 15s strace -f -p "$PID" -c 2>&1

Interpretation:

  • lots of openat/stat/read → file/plugin/session scanning loop
  • lots of futex/epoll + CPU still high → worker/native contention
#
  • almost no syscalls → JS hot loop / GC / CPU-bound code

This is the path to “why.” Current hypothesis is not “MiniMax is slow”; it’s “Gateway entered a CPU-saturation path, and MiniMax/active-memory/MCP are just nearby triggers.”

crystal warren
#

openclaw tasks list --status running --json
{
"count": 1,
"runtime": null,
"status": "running",
"tasks": [
{
"taskId": "a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777458386285",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "running",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1777458386285,
"startedAt": 1777458386285,
"lastEventAt": 1777458386285
}
]
}

curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool
{
"ready": true,
"failing": [],
"uptimeMs": 4511038
}

austere gorgeBOT
#

This narrows it.

readyz being green means the HTTP liveness/readiness path can answer. But gateway stability timing out means the Gateway WS/RPC path is too saturated to answer within 10s. So the Gateway isn’t dead; it’s overloaded/wedged in a way that mainly affects agent/RPC work.

The running task is also very suspicious:

label: Orion Heartbeat
runtime: cron
agentId: neon
childSessionKey: agent:neon:telegram:direct:2101884310
lastEventAt == startedAt

That means this cron run started, but has emitted no progress event since start. And it’s using the same neon Telegram direct session, not some obviously isolated throwaway session.

Next, inspect exactly what that cron task is doing:

#
openclaw tasks show a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40 --json
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 10

And still need this OS-level CPU view:

PID=$(pgrep -xo openclaw-gateway)
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30

If the high CPU lines are one/few threads, that points to a hot loop. If many threads are hot, it’s more likely worker/native/parallel work.

crystal warren
#

PID=$(pgrep -xo openclaw-gateway)
echo "PID=$PID"

ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
pgrep: pattern that searches for process name longer than 15 characters will result in zero matches
Try `pgrep -f' option to match against the complete command line.
PID=
error: list of process IDs must follow -p

Usage:
ps [options]

Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.

For more details see ps(1).

austere gorgeBOT
#

Ah, Linux truncates process names at 15 chars for plain pgrep. Use -f:

PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"

ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30

If that still misses, use the PID from your earlier ps output directly:

PID=377477
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
crystal warren
#

openclaw tasks show a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40 --json

{
"taskId": "a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777458386285",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "timed_out",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1777458386285,
"startedAt": 1777458386285,
"endedAt": 1777458715376,
"lastEventAt": 1777458715376,
"cleanupAfter": 1778063515376,
"error": "cron: job execution timed out"
}

openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed

🦞 OpenClaw 2026.4.26 (be8c246) — Pairing codes exist because even bots believe in consent—and good security hygiene.

│
gateway connect failed: Error: gateway closed (1000):
â—‡
Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback

austere gorgeBOT
#

This is useful: the cron task timed out, and now normal Gateway RPC is closing during connect:

gateway closed (1000 normal closure)

That means we still need the OS/thread view because Gateway is not reliable enough to introspect itself.

Run this exact version:

PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"

ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
#

And if strace is installed:

timeout 15s strace -f -p "$PID" -c 2>&1

The cron evidence says Orion Heartbeat is one reproducible stuck/timed-out workload, but the reason may still be underneath it: Gateway CPU saturation, active-memory, or MCP. The closed (1000) during cron show is another sign the Gateway WS handler is unhealthy under load.

crystal warren
#

najef@OpenClaw-Server:~$ PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"

ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
PID=377477
377477 418703 0 78.4 10.0 libuv-worker openclaw-gateway
377477 418732 0 63.9 10.0 libuv-worker openclaw-gateway
377477 418700 1 46.5 10.0 libuv-worker openclaw-gateway
377477 418699 6 45.1 10.0 libuv-worker openclaw-gateway
377477 418702 4 44.5 10.0 libuv-worker openclaw-gateway
377477 418734 4 44.3 10.0 libuv-worker openclaw-gateway
377477 418704 2 39.5 10.0 libuv-worker openclaw-gateway
377477 418701 2 39.2 10.0 libuv-worker openclaw-gateway
377477 418723 5 37.7 10.0 libuv-worker openclaw-gateway
377477 418733 3 36.0 10.0 libuv-worker openclaw-gateway
377477 418724 2 32.7 10.0 libuv-worker openclaw-gateway
377477 418725 0 32.2 10.0 libuv-worker openclaw-gateway
377477 377590 4 22.1 10.0 libuv-worker openclaw-gateway
377477 377591 5 21.9 10.0 libuv-worker openclaw-gateway
377477 377593 5 21.5 10.0 libuv-worker openclaw-gateway
377477 377592 1 21.5 10.0 libuv-worker openclaw-gateway
377477 377477 1 21.1 10.0 openclaw-gatewa openclaw-gateway
377477 377484 1 0.6 10.0 V8Worker openclaw-gateway
377477 377483 6 0.6 10.0 V8Worker openclaw-gateway
377477 377482 1 0.6 10.0 V8Worker openclaw-gateway
377477 377481 1 0.6 10.0 V8Worker openclaw-gateway
PID TID PSR %CPU %MEM COMMAND COMMAND
377477 377490 4 0.0 10.0 SignalInspector openclaw-gateway
377477 377480 6 0.0 10.0 DelayedTaskSche openclaw-gateway

#

najef@OpenClaw-Server:~$ timeout 15s strace -f -p "$PID" -c 2>&1
strace: attach: ptrace(PTRACE_SEIZE, 377477): Operation not permitted
najef@OpenClaw-Server:~$

austere gorgeBOT
#

This is the first really diagnostic output. It means:

The CPU burn is in libuv-worker threads, not mostly the JS main thread.
So this is probably not “MiniMax response is slow” by itself. Something in Gateway is flooding Node’s worker pool — commonly filesystem work, crypto, compression, DNS, sqlite/native work, etc.

That also explains the weird split:

  • /readyz works: simple HTTP path can answer.
  • WS/RPC commands time out/close: connect/auth/task paths likely need queued worker work and get starved.

Next, find what those workers are doing:

PID=$(pgrep -of 'openclaw-gateway')

tr '\0' '\n' < /proc/$PID/environ | grep -E 'UV_THREADPOOL_SIZE|NODE_OPTIONS|OPENCLAW'
#

timeout 15s strace -f -p "$PID" -c 2>&1

If strace says permission denied:

sudo timeout 15s strace -f -p "$PID" -c 2>&1

Interpretation:

  • tons of openat/stat/read → file/session/plugin scanning loop
  • tons of sqlite/file locking syscalls → sqlite/session/task store pressure
  • mostly futex / little syscall activity while CPU high → native CPU-bound worker / crypto / compression / tight loop

If you have perf, this is even better:

#
sudo timeout 15s perf top -p "$PID" --stdio | head -100

The “why” right now is: Gateway worker pool saturation. We still need strace/perf to identify the subsystem causing it.

#

That’s expected on Ubuntu with ptrace_scope restrictions.

Use one of these:

sudo timeout 15s strace -f -p "$PID" -c 2>&1

If you don’t have sudo, check thread wait channels instead — no special perms usually:

PID=$(pgrep -of 'openclaw-gateway')
for T in $(ps -L -p "$PID" -o tid= | head -40); do
  printf "%s " "$T"
  cat /proc/$PID/task/$T/wchan 2>/dev/null
done | sort | uniq -c
#

And grab open files summary:

ls -l /proc/$PID/fd 2>/dev/null | awk '{print $NF}' | sed 's/[0-9]\+$/#/' | sort | uniq -c | sort -nr | head -40

If sudo strace works, that’s the best next evidence.

crystal warren
#

% time seconds usecs/call calls errors syscall


69.05 0.039992 121 330 epoll_pwait
18.85 0.010919 12 869 statx
3.53 0.002042 14 139 6 futex
3.42 0.001982 24 81 write
3.37 0.001949 21 90 read
1.69 0.000980 490 2 sendto
0.05 0.000027 1 27 clone3
0.02 0.000011 0 108 rt_sigprocmask
0.01 0.000005 0 8 access
0.01 0.000004 0 27 rseq
0.00 0.000001 0 27 madvise
0.00 0.000001 0 4 socket
0.00 0.000001 0 4 2 connect
0.00 0.000001 0 4 newfstatat
0.00 0.000001 0 27 set_robust_list
0.00 0.000001 0 6 io_uring_enter
0.00 0.000000 0 6 close
0.00 0.000000 0 2 fstat
0.00 0.000000 0 4 poll
0.00 0.000000 0 2 lseek
0.00 0.000000 0 2 ioctl
0.00 0.000000 0 24 getpid
0.00 0.000000 0 2 recvfrom
0.00 0.000000 0 22 setsockopt
0.00 0.000000 0 2 getsockopt
0.00 0.000000 0 2 epoll_ctl
0.00 0.000000 0 2 openat


100.00 0.057917 31 1823 8 total
najef@OpenClaw-Server:~$

austere gorgeBOT
#

This is not looking like “MiniMax model latency” anymore.

Key read from this sample:

869 statx
27 clone3
total syscall time: 0.058s over 15s

So the Gateway is spending CPU mostly outside syscalls, inside libuv worker/native userland work. The statx count hints at filesystem scanning/polling, but it’s not enough by itself to explain 300% CPU unless it’s happening constantly or mixed with native work.

Two next checks:

PID=$(pgrep -of 'openclaw-gateway')
#
tr '\0' '\n' < /proc/$PID/environ | grep -E 'UV_THREADPOOL_SIZE|NODE_OPTIONS|OPENCLAW'

And path-level strace, so we see what is being statx’d:

sudo timeout 10s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 200 2>&1 \
  | grep -E 'openclaw.plugin.json|/dist/extensions/|sessions.json|memory/|models.json|auth-profiles|cron|node_modules' \
  | head -120

If perf is available, that’s the best answer:

#
sudo timeout 15s perf top -p "$PID" --stdio | head -100

Right now I’d phrase the “why” as:

Gateway’s libuv worker pool is saturated by native/background work. That starves WS/RPC operations, so agents appear stopped and diagnostic RPCs time out. We need path/function samples to identify which subsystem is feeding the worker pool.

The running/timed-out Orion Heartbeat is likely one workload that triggers it, but the thread data says the deeper failure is worker-pool saturation.

crystal warren
#

PID=$(pgrep -of 'openclaw-gateway') shows nothing.

OPENCLAW_GATEWAY_PORT=18789
OPENCLAW_SYSTEMD_UNIT=openclaw-gateway.service
OPENCLAW_WINDOWS_TASK_NAME=OpenClaw Gateway
OPENCLAW_SERVICE_MARKER=openclaw
OPENCLAW_SERVICE_KIND=gateway
OPENCLAW_SERVICE_VERSION=2026.4.26

[pid 377593] openat(AT_FDCWD, "/home/najef/.openclaw/cron/jobs-state.json", O_RDONLY|O_CLOEXEC) = 37
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=139264, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/index.js", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0775, stx_size=2855, ...}) = 0

#

[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json", O_RDONLY|O_CLOEXEC) = 39
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json", O_RDONLY|O_CLOEXEC) = 39
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=139264, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}

austere gorgeBOT
#

Good catch. That sample points away from MiniMax itself.

What we now know:

  • Gateway PID was 377477; pgrep not finding it now may mean the process title changed or it restarted, but your strace lines are still from that PID.
  • The hot Gateway is doing filesystem/module-resolution work:
    /home/najef/.openclaw/cron/jobs-state.json
    .../node_modules/openclaw/dist/index.js
    
  • Combined with the running/timed-out Orion Heartbeat, the strongest theory is now: cron/heartbeat state handling is repeatedly waking work and causing heavy module/fs resolution in libuv workers, starving Gateway WS/RPC.

Next useful capture:

ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
#

Then use the real PID from that output and run:

PID=<gateway-pid>
sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 240 2>&1 \
  | grep -E 'jobs-state.json|jobs.json|sessions.json|openclaw.plugin.json|dist/index.js|auth-profiles|models.json' \
  | head -200

Also inspect the cron job:

openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --json
#

If cron show can’t connect, directly inspect config/store paths:

grep -R "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed\|Orion Heartbeat" ~/.openclaw -n | head -50

At this point I’d file an issue with: high CPU in many libuv-worker threads, WS/RPC timeouts, Orion Heartbeat cron timing out, and strace showing repeated cron state + OpenClaw dist stat calls. That’s a real bug-shaped report.

#

This is a better clue. The dist/package.json ENOENT itself is normal-ish Node/package resolution probing, but seeing it repeatedly while the Gateway is hot is not normal.

This points to repeated OpenClaw package/plugin metadata resolution, especially:

.../openclaw/package.json
.../openclaw/dist/extensions

So I’d refine the “why” to:

Gateway is repeatedly rebuilding/resolving OpenClaw plugin/package metadata on a hot path, likely triggered by cron/heartbeat/agent tool setup. That saturates libuv workers, so WS/RPC handshakes and agent progress stall.

Let’s count exactly what’s looping:

#
PID=377477
sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 240 2>&1 \
  | sed -n 's/.*"\([^"]*\)".*/\1/p' \
  | grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
  | sort | uniq -c | sort -nr | head -50

If top entries are dist/extensions / openclaw.plugin.json / package metadata, that’s strong evidence for a plugin-discovery/tool-catalog hot loop.

For PID detection, use:

ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
#

because pgrep can miss depending on process title/argv.

crystal warren
#

Orion Heartbeat does not have any special heartbeat that could cause this..

austere gorgeBOT
#

That’s fair — I shouldn’t overfit to the name. The task label being “Orion Heartbeat” may just be the visible victim, not the cause.

The stronger evidence is independent of that cron:

many hot libuv-worker threads
repeated stat/open of openclaw package + dist/extensions
WS/RPC timing out while /readyz works

So the likely cause is not “Orion Heartbeat has a special prompt”. It’s more likely a Gateway hot path that repeatedly resolves package/plugin metadata during some run/session/tool setup.

To prove/disprove, the best next command is the counted strace:

PID=377477
#
sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 240 2>&1 \
  | sed -n 's/.*"\([^"]*\)".*/\1/p' \
  | grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
  | sort | uniq -c | sort -nr | head -50

If that shows lots of dist/extensions / openclaw.plugin.json, then the root bug is plugin/package discovery being rebuilt too often, not any one heartbeat.

crystal warren
#

what components does the gateway have that could use cpu like that?

austere gorgeBOT
#

Good question. Since the hot threads are libuv-worker, I’d think “async native/background work” more than pure JS.

Gateway components that can plausibly burn CPU / worker pool:

  • Plugin discovery / tool catalog building
    Scans dist/extensions, reads openclaw.plugin.json, package metadata. Your strace already points here.

  • Session/task/cron state stores
    Reads/writes sessions.json, tasks, cron/jobs-state.json, cleanup/maintenance scans.

  • Memory / Active Memory / embeddings / QMD
    Index/search/bootstrap memory files, embeddings, recall prep. Active Memory was timing out, but may be a trigger rather than root.

  • MCP tool startup/catalog
    Starting stdio MCP servers (uvx minimax-coding-plan-mcp) and discovering their tools can block/timeout.

  • Message/channel runtime

#

Telegram send/action retries, channel health checks, delivery queues.

  • File watching / workspace scanning
    Workspace bootstrap, AGENTS/skills/memory scanning, project file inspection.

  • Compression/crypto/auth
    Token/device auth, hashing, websocket compression, TLS/DNS-ish worker tasks.

  • Node 24 / native runtime behavior
    You’re on Node v24.14.1; if a regression is in Node/libuv/fs workers, it could amplify plugin/fs scans.

From your evidence, top suspects are currently:

  1. plugin/package metadata resolution (dist/extensions, package.json)
  2. cron/session state churn (jobs-state.json)
  3. Active Memory/MCP triggering repeated tool/plugin setup
#

The counted strace will tell us whether plugin discovery is truly dominating or just incidental.

crystal warren
#

how can we check one by one what they are doing?

austere gorgeBOT
#

Yes — do it like a controlled bisect: observe CPU + one signal, disable one subsystem, restart, compare. Don’t change five things at once.

0) Baseline

ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
openclaw logs --plain --limit 300 | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' | tail -80

1) Plugin/package discovery

Evidence check:

PID=<gateway-pid>
sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat -s 240 2>&1 \
  | sed -n 's/.*"\([^"]*\)".*/\1/p' \
  | grep -E 'openclaw.plugin.json|dist/extensions|package.json' \
#
  | sort | uniq -c | sort -nr | head -50

2) Cron

Check:

openclaw cron list
openclaw tasks list --status running
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 10

Temporary isolate:

openclaw config set cron.enabled false --strict-json
openclaw gateway restart
#

3) Heartbeats

Check:

openclaw system heartbeat last
grep -n "heartbeat" ~/.openclaw/openclaw.json | head -80

Temporary isolate:

openclaw config set agents.defaults.heartbeat.every "0m"
openclaw gateway restart

If per-agent overrides exist, they may need clearing too.

4) Active Memory

Check:

#
/active-memory status --global

Shell:

openclaw config get plugins.entries.active-memory
openclaw logs --plain --limit 500 | grep -i active-memory | tail -80

Temporary isolate:

/active-memory off --global

5) MCP

Check:

#
openclaw mcp list
openclaw mcp show minimax --json
openclaw logs --plain --limit 500 | grep -Ei 'bundle-mcp|mcp' | tail -80

Temporary isolate:

openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart

6) Channels

Check:

openclaw channels status --probe
#
openclaw logs --plain --limit 500 | grep -Ei 'telegram|channel|sendChatAction|disconnect|timeout' | tail -80

Best order: MCP → Active Memory → Cron → Heartbeats → plugin discovery evidence. After each restart, wait 2–3 min and check ps CPU again.

crystal warren
#

najef@OpenClaw-Server:~/.openclaw$ ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
346387 0.0 0.1 04:03:58 /usr/bin/pluma /home/najef/.openclaw/openclaw.json
377477 504 10.3 02:23:09 openclaw-gateway
377504 0.0 0.2 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/npm-run-all --parallel dev dev:server
377538 0.1 0.6 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/vite
377540 0.0 0.2 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/tsx watch server/index.ts
377558 0.0 0.0 02:23:08 /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/@esbuild/linux-x64/bin/esbuild --service=0.27.2 --ping
377568 0.2 0.5 02:23:08 /home/najef/.nvm/versions/node/v24.14.1/bin/node --require /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/preflight.cjs --import file:///home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/loader.mjs server/index.ts
najef@OpenClaw-Server:~/.openclaw$ openclaw logs --plain --limit 300 | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' | tail -80
gateway connect failed: Error: gateway closed (1000):
Gateway not reachable. Is it running and accessible?
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
Hint: run openclaw doctor.

austere gorgeBOT
#

At 504% CPU, the Gateway is wedged enough that Gateway-backed commands won’t be trustworthy. So openclaw logs failing is expected now.

Use file/process-level checks instead:

PID=377477

sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 240 2>&1 \
  | sed -n 's/.*"\([^"]*\)".*/\1/p' \
  | grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
  | sort | uniq -c | sort -nr | head -50

And raw logs, bypassing Gateway RPC:

#
ls -lt ~/.openclaw/logs /tmp/openclaw 2>/dev/null | head -30

tail -n 500 "$(ls -t ~/.openclaw/logs/*.log /tmp/openclaw/openclaw-*.log 2>/dev/null | head -1)" \
  | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' \
  | tail -100

If perf exists, this is the clearest “what is CPU doing” command:

sudo timeout 15s perf top -p 377477 --stdio | head -100

At this point the main fact is: Gateway is not merely slow; it’s CPU-saturated in libuv workers and WS/RPC is failing. Capture one strace count / perf sample, then restart Gateway — otherwise the system is too wedged to debug through OpenClaw commands.

crystal warren
#

PID gateway-pid
PID: command not found

austere gorgeBOT
#

Sorry — I meant replace <gateway-pid> with the actual number. In your case it’s:

PID=377477

Then run commands using $PID, for example:

PID=377477

sudo timeout 20s strace -f -p "$PID" \
  -e trace=statx,openat,newfstatat \
  -s 240 2>&1 \
  | sed -n 's/.*"\([^"]*\)".*/\1/p' \
  | grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
#
  | sort | uniq -c | sort -nr | head -50

Or simply:

sudo timeout 15s perf top -p 377477 --stdio | head -100
crystal warren
#

ls -lt ~/.openclaw/logs /tmp/openclaw 2>/dev/null | head -30
/tmp/openclaw:
total 13272
-rw-rw-r-- 1 najef najef 10994156 Apr 29 14:12 openclaw-2026-04-29.log
drwx------ 2 najef najef 4096 Apr 29 11:14 openclaw-active-memory-YiXRVS
drwx------ 2 najef najef 4096 Apr 29 11:13 tts-73ZiJY
drwx------ 2 najef najef 4096 Apr 29 10:15 tts-RMy68Y
drwxrwxr-x 2 najef najef 4096 Apr 29 09:50 downloads
-rw-rw-r-- 1 najef najef 2545032 Apr 28 23:59 openclaw-2026-04-28.log
drwx------ 2 najef najef 4096 Apr 28 23:56 tts-ScNU2h
drwx------ 2 najef najef 4096 Apr 28 23:56 tts-THeNJm
drwx------ 2 najef najef 4096 Apr 28 23:26 tts-4ubDGh
drwx------ 2 najef najef 4096 Apr 28 23:24 tts-9XIqxu

.openclaw/logs:
total 280
-rw------- 1 najef najef 3289 Apr 29 11:28 config-health.json
-rw------- 1 najef najef 264115 Apr 29 00:00 config-audit.jsonl
-rw-rw-r-- 1 najef najef 918 Apr 28 22:40 gateway-restart.log
-rw-rw-r-- 1 najef najef 4936 Apr 22 23:19 commands.log

austere gorgeBOT
#

Good — use the raw log file directly:

tail -n 1000 /tmp/openclaw/openclaw-2026-04-29.log \
  | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error|liveness|diagnostic|worker|event-loop' \
  | tail -150

Also check whether logs are being spammed by one message:

tail -n 5000 /tmp/openclaw/openclaw-2026-04-29.log \
  | sed 's/[0-9a-f]\{8\}-[0-9a-f-]\{27,36\}/<uuid>/g' \
  | sort | uniq -c | sort -nr | head -30

That openclaw-active-memory-* temp dir is another clue Active Memory was running recently. Not proof by itself, but it lines up with the earlier timeout logs.

crystal warren
#

RN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:07:44.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:08:14.720Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:08:14.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6793s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:08:44.721Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:08:44.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6793s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6823s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:09:14.721Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:09:14.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6823s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6853s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"

austere gorgeBOT
#

This is a real smoking gun, but it’s a different one than Orion:

stuck session:
sessionKey=agent:arnold:main:heartbeat
state=processing
age=6800s+
queueDepth=0

Meaning: Gateway thinks Arnold’s heartbeat session has been processing for ~2 hours, with no queue behind it. That explains “agents stop mid-work”: at least one heartbeat/session lane is stuck in processing.

Now we need the first cause, not the repeated warning. Run:

LOG=/tmp/openclaw/openclaw-2026-04-29.log
#

grep -nE 'agent:arnold:main:heartbeat|sessionId=arnold|stuck session recovery|heartbeat.*arnold|active-memory.*arnold|timeout|abort|error' "$LOG" | tail -250

Also specifically:

grep -n 'stuck session: sessionId=arnold' "$LOG" | head -1

Take the line number from that first result, then show ~200 lines before it:

LINE=<line-number>
sed -n "$((LINE-200)),$((LINE+20))p" "$LOG"
#

What I’m looking for:

  • stuck session recovery skipped ... active run
  • active-memory ... arnold ... timeout
  • model timeout / abort cleanup failure
  • heartbeat run started but never ended

Current diagnosis: a scheduled heartbeat session (arnold:main:heartbeat) got stuck in processing; the Gateway keeps warning every 30s but is not recovering it. That’s a bug-shaped state.

#

I did the bootstrap check first: I don’t have a useful resume file/highlight trail for this thread, so I’m going off what you posted here.

This does look like a real bug on v2026.4.26. I can see several fixes landed on main right after April 26, 2026 that match “agent stops until I poke it again” pretty closely:

  • stale session-lane recovery
  • defer agent dispatch after accept
  • yield after accepted ack
  • heartbeat/cron busy-lane deferral

So my read: you’re probably hitting a stuck queue/session-lane problem, not just a bad prompt/config.

Two good paths:

  1. Fastest check: test current main / next build if you’re comfortable.
  2. If staying on 2026.4.26: grab one real stall capture:
    • openclaw status --all
    • openclaw logs --follow
#
  • reproduce once
    • paste the actual log lines around the stall
      (the ls output helps show where logs are, but not why it wedged)

If you’re using subagents, cron, or heartbeats, say so too — that matters here.

Docs:

If you paste openclaw status --all plus ~100-200 log lines from the stall, I can help narrow which of those fixes you’re actually hitting.

#

Bootstrap check first: I don’t have a useful resume file/highlight trail for this thread, so I’m rebuilding context from what you pasted here. I may be missing earlier details, but this log is enough to say something concrete.

That log is the smoking gun:

stuck session: sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0

So yes — this looks like a real stuck-session / heartbeat-lane bug on v2026.4.26, not just “the model got lazy.”

What it means:

  • the heartbeat session stayed in processing
  • it sat there for ~2 hours
  • queueDepth=0 means it wasn’t actively draining queued work

That lines up really well with fixes that landed after v2026.4.26:

  • stale session-lane recovery
  • defer agent dispatch after accept
  • yield after accepted ack
#
  • heartbeat busy-lane deferral

So my recommendation is:

  1. Best fix: test current main / next release if you can.
    Your version is just before the stuck-session recovery work.

  2. Best workaround on 2026.4.26: temporarily disable heartbeat for that agent and see if the “agents stop mid work” symptom disappears.
    Docs use:

    agents: {
      defaults: {
        heartbeat: {
          every: "0m"
        }
      }
    
#
   }

then restart the gateway.

  1. If you want heartbeats back on a newer build, I’d also try:
    • isolatedSession: true
    • lightContext: true
    • skipWhenBusy: true
      That last one is from the newer heartbeat fix.

If you want, paste:

  • openclaw status --all
  • the 20-50 log lines before the first stuck session warning

and I can help tell whether this is just heartbeat wedging or heartbeat + another lane-pressure issue.

crystal warren
#

How do I give you the 20-50 log lines before the first stuck session warning?

austere gorgeBOT
#

Use this:

LOG=/tmp/openclaw/openclaw-2026-04-29.log

LINE=$(grep -n 'stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat' "$LOG" | head -1 | cut -d: -f1)

echo "first stuck-session line: $LINE"

sed -n "$((LINE-50)),$((LINE+10))p" "$LOG"

If that’s too much/noisy, filter it:

sed -n "$((LINE-50)),$((LINE+10))p" "$LOG" \
#
  | grep -Ei 'arnold|heartbeat|active-memory|timeout|abort|error|session|processing|recovery|model|failover'

Paste that output in a code block.

crystal warren
#

how can I save downgrade?

austere gorgeBOT
#

Yes — safe downgrade flow:

mkdir -p ~/openclaw-backups
openclaw backup create --verify --output ~/openclaw-backups

Then stop the wedged gateway:

openclaw gateway stop

If that can’t connect, terminate the PID you saw:

#
kill -TERM 377477
sleep 5
ps -p 377477 || true

Then install a previous version, for example:

npm install -g openclaw@2026.4.25
openclaw --version
openclaw gateway install --force
openclaw gateway start

If you want to go back to 2026.4.24 instead:

#
npm install -g openclaw@2026.4.24

After downgrade:

openclaw status
openclaw gateway status

One caution: don’t downgrade too far. OpenClaw has config-version guardrails, so an older binary may refuse some service/config mutations if openclaw.json was touched by a newer version. If that happens, stop and paste the exact error before forcing anything.

crystal warren
#

i want v23 or even v22

austere gorgeBOT
#

Yep, both exist: 2026.4.23 and 2026.4.22.

I’d try 2026.4.23 first:

mkdir -p ~/openclaw-backups
openclaw backup create --verify --output ~/openclaw-backups

openclaw gateway stop
npm install -g openclaw@2026.4.23
openclaw --version
openclaw gateway install --force
openclaw gateway start
openclaw status

If you want 2026.4.22 instead:

#
openclaw gateway stop
npm install -g openclaw@2026.4.22
openclaw --version
openclaw gateway install --force
openclaw gateway start

If gateway stop can’t reach the wedged gateway, use the PID:

kill -TERM 377477
sleep 5

If the older binary refuses because config was touched by newer OpenClaw, don’t force yet — paste that exact error.

crystal warren
#

Is v22 too far?

austere gorgeBOT
#

2026.4.22 is not crazy far, but I’d avoid jumping straight there unless 2026.4.23 still has the problem.

Reason: each step back increases the chance of config/state migrations or package/runtime dependency differences biting you. 2026.4.23 is a safer rollback target and likely far enough if this came from .24/.25/.26.

Recommended order:

npm install -g openclaw@2026.4.23

Test. If still broken:

npm install -g openclaw@2026.4.22

Definitely make the backup first:

#
openclaw backup create --verify --output ~/openclaw-backups
crystal warren
#

what should I write for issue I will open?

austere gorgeBOT
#

Use something like this:

## Summary

OpenClaw Gateway becomes CPU-saturated and agents stop progressing mid-work until I send another message. During the bad state, HTTP `/healthz` and sometimes `/readyz` can respond, but Gateway WS/RPC commands time out or close during connect.

## Environment

- OpenClaw: 2026.4.26 (be8c246)
- OS: Ubuntu 24.04.4 LTS
- Node: v24.14.1
- Install: npm global via nvm
- Gateway bind: loopback, port 18789
- systemd: not installed
#
## Symptoms

- Agents stop mid-work and only continue after I send a message.
- `openclaw-gateway` reaches very high CPU:
  - observed 322% CPU
  - later observed 504% CPU
- Gateway RPC becomes unreliable:
  - `openclaw gateway stability --json --limit 100` failed with `gateway timeout after 10000ms`
  - `openclaw cron show ...` failed with `gateway closed (1000 normal closure): no close reason`
  - `openclaw logs --plain ...` failed with `gateway closed (1000)`
- HTTP liveness still worked:
  - `curl http://127.0.0.1:18789/healthz` → `{"ok":true,"status":"live"}`
  - `curl http://[::1]:18789/healthz` → `{"ok":true,"status":"live"}`
  - `/readyz` returned `{"ready":true,"failing":[]}` at least once
#
## Evidence

Thread view while CPU was high:

```text
openclaw-gateway 504% CPU

Per-thread CPU showed many hot libuv worker threads:

libuv-worker 78.4%
libuv-worker 63.9%
libuv-worker 46.5%
libuv-worker 45.1%
...
#

openclaw-gateway main thread ~21%


`strace -f -p <gateway-pid> -c` showed mostly:

```text
epoll_pwait
statx
futex
write/read

Path-level strace showed repeated OpenClaw package/plugin metadata resolution:

/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json = ENOENT
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json
#

/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions
/home/najef/.openclaw/cron/jobs-state.json


Logs repeatedly showed a stuck heartbeat session:

```text
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0
...

There was also a cron task repeatedly timing out:

{
  "runtime": "cron",
#

"label": "Orion Heartbeat",
"agentId": "neon",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"status": "timed_out",
"error": "cron: job execution timed out"
}


Active Memory was enabled globally and configured for many agents. It was pinned to MiniMax M2.5:

```json
{
  "plugins.entries.active-memory.config.model": "minimax/MiniMax-M2.5",
  "timeoutMs": 15000,
  "thinking": "low"
}
#

Logs showed Active Memory timeouts:

```text
active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 start timeoutMs=15000
active-memory ... done status=timeout elapsedMs=16511 summaryChars=0
embedded_run_failover_decision ... failoverReason="timeout" ... timedOut=true aborted=true

MCP also had a MiniMax server configured:

{
  "command": "uvx",
  "args": ["minimax-coding-plan-mcp", "-y"]
}
#

Logs showed:

```text
failed to start server "minimax" (uvx minimax-coding-plan-mcp -y): Error: MCP server connection timed out after 30000ms

Current hypothesis

Gateway appears to enter CPU saturation / worker-pool saturation. It may be triggered by heartbeat/cron/active-memory/MCP setup, but the deeper symptom is many busy libuv-worker threads plus repeated package/plugin metadata resolution. Once saturated, WS/RPC commands fail while simple HTTP health endpoints may still respond.

Expected behavior

Gateway should recover stuck heartbeat/session work, not remain in processing for hours, and WS/RPC should not become unusable under this condition.

#
## Workarounds considered

- Downgrade to 2026.4.23 or 2026.4.22
- Disable Active Memory globally
- Disable bundled MCP / remove `minimax` MCP
- Reduce heartbeat frequency

I have not yet confirmed which of these avoids the CPU saturation.
crystal warren
#

maybe not include things we are not sure are an issue like minmax m2.5?

austere gorgeBOT
#

Yes, agreed. Keep the issue focused on confirmed evidence, and put uncertain things under “possibly related” or omit them.

Use this cleaner version:

## Summary

OpenClaw Gateway becomes CPU-saturated and agents stop progressing mid-work until I send another message. During the bad state, simple HTTP health endpoints can respond, but Gateway WS/RPC commands time out or close during connect.

## Environment

- OpenClaw: 2026.4.26 (be8c246)
- OS: Ubuntu 24.04.4 LTS
- Node: v24.14.1
- Install: npm global via nvm
- Gateway bind: loopback, port 18789
#
- systemd: not installed

## Symptoms

- Agents stop mid-work and only continue after I send another message.
- `openclaw-gateway` reaches very high CPU:
  - observed 322% CPU
  - later observed 504% CPU
- Gateway RPC becomes unreliable:
  - `openclaw gateway stability --json --limit 100` failed with `gateway timeout after 10000ms`
  - `openclaw cron show ...` failed with `gateway closed (1000 normal closure): no close reason`
  - `openclaw logs --plain ...` failed with `gateway closed (1000)`
- HTTP liveness still worked:
  - `curl http://127.0.0.1:18789/healthz` → `{"ok":true,"status":"live"}`
  - `curl http://[::1]:18789/healthz` → `{"ok":true,"status":"live"}`
#
  - `/readyz` returned `{"ready":true,"failing":[]}` at least once

## Evidence

Process list while CPU was high:

```text
openclaw-gateway 504% CPU

Per-thread CPU showed many hot libuv-worker threads:

libuv-worker 78.4%
libuv-worker 63.9%
libuv-worker 46.5%
#

libuv-worker 45.1%
...
openclaw-gateway main thread ~21%


`strace -f -p <gateway-pid> -c` showed activity including:

```text
epoll_pwait
statx
futex
write/read

Path-level strace showed repeated package/plugin/cron-state paths:

#

/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json = ENOENT
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions
/home/najef/.openclaw/cron/jobs-state.json


Logs repeatedly showed a stuck heartbeat session:

```text
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0
...

A cron task also timed out while this was happening:

#

{
"runtime": "cron",
"label": "Orion Heartbeat",
"agentId": "neon",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"status": "timed_out",
"error": "cron: job execution timed out"
}


## Expected behavior

Gateway should recover or fail stuck heartbeat/session work, not remain in `processing` for hours, and WS/RPC should not become unusable while HTTP health endpoints still respond.

## Notes