#Agents stop mid work
1 messages · Page 1 of 1 (latest)
That sounds like a real stuck-run issue, not normal behavior. 2026.4.26 is currently the latest GitHub release I see, so first thing is to capture the stuck state before nudging it again. (github.com)
When it happens, please grab:
openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300
In the logs, look especially for:
stopReason: "length"
finish_reason: length
approval-timeout
An async command did not run
LLM request timed out
model stream idle
Provider in cooldown
fetch failed
processing
There are a few known “looks frozen until I poke it” patterns:
- output token limit /
stopReason: lengthcan silently strand work; there’s an existing issue describing exactly that kind of freeze. (github.com) - async/approval mismatch can make the agent say “I’ll do it now” but not actually continue; there’s also a similar regression report with
approval-timeout/An async command did not runas useful log strings. (github.com) - local or slow model backends can idle-timeout mid-run; for those, prefer increasing
models.providers.<id>.timeoutSecondsbefore touching the whole agent timeout. Docs: https://docs.openclaw.ai/concepts/agent-loop
If the logs show one of those signatures, file it with the captured output here: https://github.com/openclaw/openclaw/issues — especially since you’re already on latest.
Short-term workaround: try splitting very large tasks into phases/subagents, and if you’re on local/OpenAI-compatible models, raise the provider timeout for that provider.
openclaw status --all
What I think its important in the log:
Gateway: bind=loopback (127.0.0.1), port=18789 (service args)
Probe target: ws://127.0.0.1:18789
Looks like the actual log snippet didn’t come through — Discord may have eaten it.
Please paste either as a code block or attach a .txt file with:
openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300
Most useful: the log lines right when the agent stops, plus the timestamp/message where you “wake” it again. status --all is designed to redact tokens, but still skim for anything private before posting.
That part looks normal/healthy: loopback 127.0.0.1:18789 just means the Gateway is listening locally and the CLI probes it over local WebSocket. It doesn’t explain the agent stopping.
The useful parts will be lower in status --all / logs — especially Tasks, Sessions, and the log lines around the stop. Please paste as a code block or attach .txt with:
openclaw status --all
openclaw tasks list --json
openclaw tasks audit --json
openclaw logs --plain --limit 300
If it’s huge, search/filter for these and paste matches:
openclaw logs --plain --limit 1000 | grep -Ei 'timeout|idle|stopReason|length|approval|processing|stuck|error|cooldown|abort|disconnect'
Docs for the relevant tools:
https://docs.openclaw.ai/help/faq
https://docs.openclaw.ai/automation/tasks
Listening: [::1]:18789
Node service │ systemd not installed
Gateway restart attempts (tail): /home/najef/.openclaw/logs/gateway-restart.log
[2026-04-23T20:38:11Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-23T20:38:11Z] openclaw restart done source=update
[2026-04-25T09:24:22Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-25T09:24:22Z] openclaw restart done source=update
[2026-04-25T20:49:39Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-28T20:40:09Z] openclaw restart attempt source=update target=openclaw-gateway.service
[2026-04-28T20:40:09Z] openclaw restart done source=update
! Channel issues skipped (gateway unreachable)
Gateway health:
timeout
Pasteable debug report. Auth tokens redacted.
openclaw tasks list --json is really long, quick scan only shows crontabs timeing out.
"taskId": "662599a4-e33c-4739-bd0f-d40cb2090076",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1776852000027",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "timed_out",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1776852000027,
"startedAt": 1776852000027,
"endedAt": 1776852120036,
"lastEventAt": 1776852120036,
"cleanupAfter": 1777456920036,
"error": "cron: job execution timed out"
},
openclaw tasks audit --json
{
"kind": "task",
"severity": "warn",
"code": "lost",
"detail": "backing session missing",
"ageMs": 294423004,
"status": "lost",
"token": "639b39ff-d575-45dd-91b9-15fa04724bad",
"task": {
"taskId": "639b39ff-d575-45dd-91b9-15fa04724bad",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777159870461",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "lost",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1777159870461,
"startedAt": 1777159870461,
"endedAt": 1777160193868,
"lastEventAt": 1777160193868,
"cleanupAfter": 1777764993868,
"error": "backing session missing"
That’s a much better clue. The key problem is not the cron entry itself — it’s this:
! Channel issues skipped (gateway unreachable)
Gateway health:
timeout
Listening: [::1]:18789
So the Gateway has something listening, but the health/RPC probe is timing out. That can absolutely look like “agents stop until I poke them”.
Two things to check next:
openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
curl -g -m 5 'http://[::1]:18789/healthz'
If [::1] works but 127.0.0.1 times out/fails, it’s likely an IPv6-vs-IPv4 loopback bind mismatch. In that case try forcing IPv4 loopback:
openclaw config set gateway.bind custom
openclaw config set gateway.customBindHost 127.0.0.1
openclaw gateway restart
openclaw health --verbose
The cron timeout you pasted is also useful:
label: Orion Heartbeat
status: timed_out
error: cron: job execution timed out
It timed out after ~120s. If that heartbeat is expected to do real work, bump it:
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron edit 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --timeout-seconds 600
But first I’d focus on the Gateway health timeout. Please paste the output of the three health/curl commands above — that’ll tell us whether this is bind mismatch vs gateway event-loop wedged.
Yep — that audit output is useful, but it’s mostly a symptom/history marker, not the stuck run itself.
lost / backing session missing means the task ledger has a cron task, but the backing child session (agent:neon:telegram:direct:2101884310) is gone. For old retained tasks this is a warning until cleanup.
Run:
openclaw tasks maintenance --json
openclaw tasks maintenance --apply
Then check the actual cron:
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 20
If Orion Heartbeat keeps timing out/lost, temporarily disable it to see if normal agents stop freezing:
openclaw cron edit 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --disable
openclaw gateway restart
But the bigger red flag is still:
Gateway health: timeout
Listening: [::1]:18789
So I’d prioritize confirming whether the gateway is wedged or only bound weirdly:
openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
curl -g -m 5 'http://[::1]:18789/healthz'
If health still times out, paste that plus:
openclaw logs --plain --limit 500 | grep -Ei '639b39ff|Orion Heartbeat|backing session|timeout|stuck|processing|health|error'
how do I upload log file? txt to much to copy paste
Easiest: save it to a file, then drag/drop the .txt into this Discord thread or use the + / upload file button.
Good commands:
openclaw status --all > openclaw-status.txt
openclaw tasks audit --json > openclaw-tasks-audit.json
openclaw logs --plain --limit 1000 > openclaw-logs.txt
Then upload those files here.
If Discord says it’s too large, make a smaller filtered log:
openclaw logs --plain --limit 3000 \
| grep -Ei 'Orion Heartbeat|backing session|timeout|stuck|processing|health|error|lost|cron|abort|disconnect' \
> openclaw-filtered-logs.txt
Please skim for anything private before uploading. status --all redacts auth tokens, but logs can still include paths, usernames, message text, etc.
cannot drag and drop and / upload file does not exist
No worries — we can do it in small pasted chunks.
Run this and paste the output:
openclaw health --verbose --timeout 30000 2>&1
curl -m 5 http://127.0.0.1:18789/healthz 2>&1
curl -g -m 5 'http://[::1]:18789/healthz' 2>&1
Then this filtered log, limited enough for Discord:
openclaw logs --plain --limit 3000 \
| grep -Ei 'Orion Heartbeat|backing session|timeout|stuck|processing|health|error|lost|cron|abort|disconnect' \
| tail -80
If that’s still too much, use:
... | tail -40
Main thing I need next is: do the two curl /healthz commands work, or do they timeout? That tells us if this is IPv4/IPv6 bind weirdness or a wedged Gateway.
openclaw logs --plain --limit 1000 | grep -Ei 'timeout|idle|stopReason|length|approval|processing|stuck|error|cooldown|abort|disconnect'
-info plugins {"subsystem":"plugins"} active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 start timeoutMs=15000 queryChars=808
-info plugins {"subsystem":"plugins"} active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 done status=timeout elapsedMs=16511 summaryChars=0
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=a45cc6f8-a0f2-4bde-b412-67eb51a0c3c2
-warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_failover_decision","tags":["error_handling","failover","assistant","surface_error"],"runId":"active-memory-moju70jx-b2082d40","stage":"assistant","decision":"surface_error","failoverReason":"timeout","profileFailureReason":"timeout","provider":"minimax","model":"MiniMax-M2.5","sourceProvider":"minimax","sourceModel":"MiniMax-M2.5","profileId":"sha256:d347a6986db6","fallbackConfigured":true,"timedOut":true,"aborted":true} embedded run failover decision
-error telegram/api {"subsystem":"telegram/api"} telegram sendChatAction failed: Network request for 'sendChatAction' failed!
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=849b92d7-abd0-413a-9eda-b86ede36b304
-info gateway/ws {"subsystem":"gateway/ws"} webchat disconnected code=1006 reason=n/a conn=0dd1de89-cd2b-46af-bef4-5529d059cf45
-error (node:377477) ExperimentalWarning: SQLite is an experimental feature and might change at any time
-warn gateway/ws {"subsystem":"gateway/ws"} handshake timeout conn=8e95457f-cb2a-4ba3-ba65-1762feeac639 peer=127.0.0.1:55550->127.0.0.1:18789 remote=127.0.0.1
-warn gateway/ws {"subsystem":"gateway/ws"} {"cause":"handshake-timeout","handshake":"failed","durationMs":37191,"lastFrameType":"req","lastFrameMethod":"connect","lastFrameId":"c219e03f-74ab-4059-b95a-
799ee3196628","host":"openclaw.tail967935.ts.net","origin":"https://openclaw.tail967935.ts.net","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36 Edg/147.0.0.0","forwardedFor":"100.108.130.102","remoteAddr":"127.0.0.1","remotePort":55550,"localAddr":"127.0.0.1","localPort":18789,"endpoint":"127.0.0.1:55550->127.0.0.1:18789","handshakeMs":10214} closed before connect conn=8e95457f-cb2a-4ba3-ba65-1762feeac639 peer=127.0.0.1:55550->127.0.0.1:18789 remote=127.0.0.1 fwd=100.108.130.102 origin=https://openclaw.tail967935.ts.net host=openclaw.tail967935.ts.net ua=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36 Edg/147.0.0.0 code=1000 reason=n/a
-warn bundle-mcp {"subsystem":"bundle-mcp"} failed to start server "minimax" (uvx minimax-coding-plan-mcp -y): Error: MCP server connection timed out after 30000ms
-error [tools] edit failed: Could not find the exact text in /home/najef/.openclaw/agents/arnold/workspace/memory/2026-04-29.md. The old text must match exactly including all whitespace and newlines.
This log points at two likely culprits:
-
Active Memory is timing out
active-memory ... MiniMax-M2.5 ... done status=timeout elapsedMs=16511 embedded_run_failover_decision ... failoverReason="timeout" ... aborted=trueActive Memory is a blocking pre-reply memory sub-agent, so if it keeps timing out, every affected agent turn can feel like it “stops”.
Quick test: temporarily turn it off and see if agents stop freezing:
/active-memory off --globalOr scope it away from the affected agent/session. Docs: https://docs.openclaw.ai/concepts/active-memory
-
Gateway local WebSocket handshakes are timing out
gateway/ws handshake timeout ... peer=127.0.0.1...->127.0.0.1:18789
That matches your earlier Gateway health: timeout. It suggests the Gateway is reachable but sometimes too wedged/busy to complete local RPC handshakes.
The SQLite warning is harmless. The Telegram sendChatAction failure may be network noise unless it repeats a lot.
Next I’d do this:
openclaw health --verbose --timeout 30000
curl -m 5 http://127.0.0.1:18789/healthz
curl -g -m 5 'http://[::1]:18789/healthz'
And as a mitigation test:
/active-memory off --global
openclaw gateway restart
Then try the same kind of agent task again. If it stops freezing, the issue is probably Active Memory + MiniMax timeout/failover pressure rather than the main agent loop.
This adds two concrete suspects:
failed to start server "minimax" ... timed out after 30000ms
That means every run that exposes MCP tools may block ~30s trying to start the minimax MCP server. If you don’t actively need that MCP server, disable it for the test:
openclaw mcp show minimax --json
openclaw mcp unset minimax
openclaw gateway restart
If you want a broader test without removing the server config, deny bundled MCP tools temporarily:
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart
Also keep Active Memory off for the test, since your previous log showed MiniMax timing out there too:
/active-memory off --global
Then try one normal agent task again.
The edit failed: Could not find the exact text...memory/2026-04-29.md line is probably not the main freeze — that’s a failed exact-text edit, often from stale/concurrent file contents.
The Tailscale line says the Control UI/browser handshake is timing out through openclaw.tail...; close that browser tab during testing too, just to reduce noise.
najef@OpenClaw-Server:~/.openclaw$ openclaw health --verbose --timeout 30000
🦞 OpenClaw 2026.4.26 (be8c246) — Less clicking, more shipping, fewer "where did that file go" moments.
│
â—‡
Gateway connection:
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
Telegram: configured
Agents: sarah, marcus, david, jordan, emma, gemini, susan, arthur, edgar, oscar, virginia, james, orion, riley, alex, neon (default), arnold
Heartbeat interval: 25m (sarah), 25m (marcus), 25m (david), 25m (jordan), 25m (emma), 25m (gemini), 25m (susan), 25m (arthur), 25m (edgar), 25m (oscar), 25m (virginia), 25m (james), 25m (orion), 25m (riley), 25m (alex), 25m (neon), 25m (arnold)
Session store (marcus): /home/najef/.openclaw/agents/marcus/sessions/sessions.json (2 entries)
- agent:marcus:main:heartbeat (12m ago)
- agent:marcus:main (42m ago)
Session store (susan): /home/najef/.openclaw/agents/susan/sessions/sessions.json (2 entries) - agent:susan:main:heartbeat (6m ago)
- agent:susan:main (7392m ago)
Session store (neon): /home/najef/.openclaw/agents/neon/sessions/sessions.json (168 entries) - agent:neon:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed (3m ago)
- agent:neon:main:heartbeat (23m ago)
- agent:neon:telegram:direct:2101884310 (40m ago)
- agent:neon:cron:1404af42-bf12-4a73-b197-b7a670d80d20:run:2bb9f514-b10a-48d1-8f89-6502dcb964d4:heartbeat (128m ago)
- agent:neon:cron:1404af42-bf12-4a73-b197-b7a670d80d20 (143m ago)
Session store (arnold): /home/najef/.openclaw/agents/arnold/sessions/sessions.json (4 entries) - agent:arnold:main:heartbeat (0m ago)
- agent:arnold:telegram:direct:2101884310 (39m ago)
- agent:arnold:telegram:slash:2101884310 (4375m ago)
- agent:arnold:cron:7c6caa02-af84-4e17-9dfc-ee02b1ac255b:run:b53e970e-7699-4d26-b82e-63c7189e2ed1 (24623m ago)
Other agens same...
This output actually confirms the Gateway is reachable now — so the earlier Gateway health: timeout looks intermittent/load-related, not a permanent bind issue.
Big red flag here:
Heartbeat interval: 25m (...) for 17 agents
That’s a lot of full agent turns in the background. Combined with the earlier:
- Active Memory timing out on MiniMax
minimaxMCP server startup timing out after 30s- WebSocket handshake timeouts
…I’d test by reducing background pressure first.
Try this mitigation bundle:
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw config set agents.defaults.heartbeat.isolatedSession true
openclaw config set agents.defaults.heartbeat.lightContext true
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw gateway restart
If you want a stronger test, temporarily disable heartbeats:
openclaw config set agents.defaults.heartbeat.every "0m"
openclaw gateway restart
Then run openclaw health --verbose again. If it still shows 25m per-agent, those are per-agent heartbeat overrides in agents.list[], so check:
grep -n "heartbeat" ~/.openclaw/openclaw.json
I’d also keep these disabled during the test:
/active-memory off --global
and either:
openclaw mcp unset minimax
or:
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
curl -m 5 http://127.0.0.1:18789/healthz
{"ok":true,"status":"live"}n
curl -g -m 5 'http://[::1]:18789/healthz'
{"ok":true,"status":"live"}
Good — that rules out IPv4/IPv6 bind mismatch. Gateway is listening fine on both:
127.0.0.1 âś…
::1 âś…
So the pattern is much more likely load / blocking work inside the Gateway, not networking.
I’d do the isolation test in this order:
# reduce heartbeat load
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw config set agents.defaults.heartbeat.isolatedSession true
openclaw config set agents.defaults.heartbeat.lightContext true
# disable the MCP server path that is timing out
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart
And in chat:
/active-memory off --global
Then run one task that used to stall. If it stops stalling, re-enable pieces one by one:
- heartbeats
- active-memory
- bundle-mcp / minimax MCP
My bet: the 17 agents heartbeating every 25m + Active Memory MiniMax timeouts + Minimax MCP startup timeout are saturating the runtime enough that runs appear “asleep” until a new message wakes/queues progress.
najef@OpenClaw-Server:~/.openclaw$ openclaw mcp show minimax --json
{
"command": "uvx",
"args": [
"minimax-coding-plan-mcp",
"-y"
],
"env": {
"MINIMAX_API_KEY": "sk-Removed",
"MINIMAX_API_HOST": "https://api.minimax.io"
}
Yep, that matches the timeout line exactly:
uvx minimax-coding-plan-mcp -y
For now I’d remove it from OpenClaw’s MCP registry and restart:
openclaw mcp unset minimax
openclaw gateway restart
Then confirm it’s gone:
openclaw mcp list
If you need to keep the config around instead of deleting it, use the global deny test:
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart
Also: good call redacting the key. If that key was ever pasted unredacted anywhere public, rotate it.
can we look at /active-memory... why minimax not working?
Yes — and important distinction: there are two MiniMax paths in your logs.
-
Active Memory MiniMax model call
active-memory ... activeProvider=minimax activeModel=MiniMax-M2.5 ... status=timeoutThis is the one affecting
/active-memory. -
MCP MiniMax server
uvx minimax-coding-plan-mcp ... timed outSeparate issue. That’s the MCP tool server, not the chat model.
For /active-memory, check this first in chat:
/active-memory status
/active-memory status --global
/verbose on
/trace on
Then from shell:
openclaw config get plugins.entries.active-memory
openclaw models status --probe --probe-provider minimax --probe-timeout 30000
openclaw models list --provider minimax
Your log says Active Memory is using MiniMax-M2.5, but current MiniMax docs/defaults are around MiniMax-M2.7 / M2.7-highspeed. So also smoke-test the exact model:
openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json
openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json
If M2.5 times out/fails but M2.7 works, pin Active Memory to the working model:
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw gateway restart
If you use OAuth instead of API key, the provider may be minimax-portal/... instead of minimax/... — use whatever models list/status shows.
Docs: https://docs.openclaw.ai/concepts/active-memory and https://docs.openclaw.ai/providers/minimax
Active Memory: on for this session.
Active Memory: on globally.
"enabled": true,
"config": {
"enabled": true,
"agents": [
"neon",
"main",
"arnold",
"orion",
"sarah",
"marcus",
"david",
"emma",
"jordan",
"alex",
"gemini",
"susan",
"arthur",
"edgar",
"oscar",
"virginia",
"james",
"riley"
],
"allowedChatTypes": [
"direct"
],
"modelFallback": "default-remote",
"queryMode": "recent",
"promptStyle": "balanced",
"timeoutMs": 15000,
"maxSummaryChars": 220,
"persistTranscripts": false,
"logging": true,
"model": "minimax/MiniMax-M2.5",
"modelFallbackPolicy": "default-remote",
"thinking": "low",
"qmd": {
"searchMode": "inherit"
Yep, that explains a lot.
Active Memory is enabled for 18 agents, and it’s pinned to:
model: "minimax/MiniMax-M2.5"
timeoutMs: 15000
thinking: "low"
Your logs show exactly that model timing out at ~16.5s. Since Active Memory is blocking, this can make normal turns feel stalled.
I’d change Active Memory to either use a current MiniMax model or a faster fallback.
Try:
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7-highspeed
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw gateway restart
If that model isn’t available, first check:
openclaw models list --provider minimax
openclaw models status --probe --probe-provider minimax --probe-timeout 30000
Or safest / simplest: let Active Memory inherit the working session model instead of pinning MiniMax:
openclaw config unset plugins.entries.active-memory.config.model
openclaw config set plugins.entries.active-memory.config.modelFallback google/gemini-3-flash
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw gateway restart
Also consider scoping it to fewer agents while testing:
openclaw config set plugins.entries.active-memory.config.agents '["neon","arnold"]' --strict-json
modelFallbackPolicy is deprecated/no longer meaningful for runtime behavior, so don’t worry about that one.
openclaw models status --probe --probe-provider minimax --probe-timeout 30000
Config : ~/.openclaw/openclaw.json
Agent dir : ~/.openclaw/agents/main/agent
Default : minimax/MiniMax-M2.7
Fallbacks (1) : minimax/MiniMax-M2.5
Aliases (2) : Claude -> anthropic/claude-sonnet-4-6, Minimax -> minimax/MiniMax-M2.7
Configured models (3): anthropic/claude-sonnet-4-6, minimax/MiniMax-M2.5, minimax/MiniMax-M2.7
Auth overview
Auth store : ~/auth-profiles.json
Shell env : off
Providers w/ OAuth/tokens (0): -
- anthropic effective=models.json:sk-ant-a...8A3PcgAA | models.json=sk-ant-a...8A3PcgAA | source=models.json: ~/.openclaw/agents/main/agent/models.json
- metaclaw effective=models.json:me...aw | models.json=me...aw | source=models.json: ~models.json
- minimax effective=profiles:~/.openclaw/agents/main/agent/auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | minimax:global=sk-cp-x7...eDA9Y_cc
- openrouter effective=profiles:~auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | openrouter:default=sk-or-v1...3e63b85d
- zai effective=profiles:~auth-profiles.json | profiles=1 (oauth=0, token=0, api_key=1) | zai:default=cc393639...ZMqFyAia
Model │ Profile │ Status
minimax/MiniMax- │ minimax:global (api_ │ unknown · 54s │
M2.7 │ key) │ ↳ No callable tools remain after resolving explicit tool allowlist (agents.neon.tools.allow: *, message); tools are disabled for this run. Fix the allowlist or enable the plugin that registers the requested tool. │
Yep — this shows the issue pretty clearly.
Active Memory is pinned to minimax/MiniMax-M2.5, while your MiniMax default is M2.7 and the probe only checked M2.7. Also the probe result is not clean:
Status: unknown · 54s
So first bypass agent/tools and test raw MiniMax:
openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json
openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json
If M2.7 works and M2.5 is slow/fails, fix Active Memory:
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw config set plugins.entries.active-memory.config.timeoutMs 30000
openclaw gateway restart
Also, this part is suspicious but separate:
agents.neon.tools.allow: *, message
For the MiniMax probe, avoid that noise with:
openclaw models status --probe --probe-provider minimax --agent main --probe-timeout 30000
Docs note: Coding Plan OAuth uses minimax-portal/...; API-key setup uses minimax/.... Since your auth shows sk-cp..., if raw minimax/MiniMax-M2.7 fails, you may need to re-onboard with the matching MiniMax auth path. https://docs.openclaw.ai/providers/minimax
can both have same alias?
openclaw models list --provider minimax
Model Input Ctx Local Auth Tags
minimax/MiniMax-M2.7 text 195k no yes default,configured,alias:Minimax
minimax/MiniMax-M2.5 text 195k no yes fallback#1,configured,alias:Minimax
I wouldn’t keep the same alias on both. It’s ambiguous and can make Minimax resolve in surprising ways depending on default/fallback ordering.
Use distinct aliases, e.g.:
openclaw config set 'agents.defaults.models.minimax/MiniMax-M2.7.alias' Minimax
openclaw config set 'agents.defaults.models.minimax/MiniMax-M2.5.alias' MinimaxOld
Or remove the alias from M2.5 entirely:
openclaw config unset 'agents.defaults.models.minimax/MiniMax-M2.5.alias'
Then for Active Memory, don’t use the alias — pin the exact model:
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw gateway restart
That avoids it accidentally selecting M2.5 again.
I am doing openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json and I can see the CPU 100% for the whole time... its the gateway process and it did not used to use so many cpu resources.
Yeah — don’t let that keep spinning. Ctrl-C it.
Important: openclaw infer model run --local should not route through the Gateway; it’s a lean local provider smoke test. So if the gateway process is pegged at 100%, either:
- it’s coincidental background load — heartbeats / active-memory / MCP, or
- the process you’re seeing is actually the
openclaw inferCLI node process, not the gateway.
Check the PID/cmd while it’s happening:
ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
Also run the probe with a hard timeout:
timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json
timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json
If M2.5 hangs / burns CPU but M2.7 works, remove M2.5 from anything important:
openclaw config set plugins.entries.active-memory.config.model minimax/MiniMax-M2.7
openclaw config set plugins.entries.active-memory.config.thinking off
openclaw config set agents.defaults.model.fallbacks '[]' --strict-json
openclaw gateway restart
Given what we’ve seen, I would not use MiniMax-M2.5 for Active Memory. It’s exactly the model timing out in your logs.
I am doing the smoke test...
Got it — while it’s running, please grab the process list in another terminal:
ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
That will tell us whether the CPU is the smoke-test CLI process or the long-running gateway.
If it runs longer than ~60s, stop it with Ctrl-C and try the same for M2.7:
timeout 60s openclaw infer model run --local --model minimax/MiniMax-M2.7 --prompt "Reply with exactly: pong" --json
If M2.7 returns quickly and M2.5 doesn’t, we’ve basically found the bad model path.
najef@OpenClaw-Server:~$ ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
377477 1466 322 9.3 01:06:48 openclaw-gateway
377568 377540 0.4 0.5 01:06:47 /home/najef/.nvm/versions/node/v24.14.1/bin/node --require /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/preflight.cjs --import file:///home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/loader.mjs server/index.ts
377538 377537 0.1 0.6 01:06:47 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/vite
377558 377538 0.0 0.0 01:06:47 /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/@esbuild/linux-x64/bin/esbuild --service=0.27.2 --ping
377540 377539 0.0 0.2 01:06:47 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/tsx watch server/index.ts
377504 377503 0.0 0.2 01:06:48 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/npm-run-all --parallel dev dev:server
346387 2098 0.0 0.1 02:47:36 /usr/bin/pluma /home/najef/.openclaw/openclaw.json
2495 2475 0.0 0.0 15:33:17 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/110/doc
2433 2402 0.0 0.0 15:33:17 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /home/najef/.cache/doc
1744 1728 0.0 0.0 15:33:18 fusermount3 -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/1000/doc
Yep — that’s the smoking gun:
openclaw-gateway 322% CPU
infer model run --local should not be using the Gateway, so this is almost certainly background Gateway load, not the smoke test itself.
Do this mitigation now:
/active-memory off --global
Then shell:
openclaw mcp unset minimax
openclaw config set agents.defaults.heartbeat.every "2h"
openclaw config set agents.defaults.heartbeat.skipWhenBusy true
openclaw gateway restart
After restart, watch CPU:
ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw-gateway|openclaw|node' | grep -v grep | sort -k3 -nr | head -20
If Gateway jumps back to high CPU while idle, grab:
openclaw tasks list --status running
openclaw gateway stability --json --limit 100
openclaw logs --plain --limit 500 | grep -Ei 'active-memory|minimax|heartbeat|cron|handshake|timeout|error|stuck'
Also: those vite/tsx processes under .openclaw/.../workspace look like an agent-started dev server. Not necessarily bad, but if no current task needs it, it’s extra load/noise. The Gateway CPU is the real blocker though.
You are proposing workarounds, I am more keen to know why something is not working.
You’re right — I drifted into mitigation mode. Let’s separate what we know from what we still need to prove.
What we know from your logs:
-
Gateway is actually burning CPU
openclaw-gateway 322% CPUSo agents “stopping” is likely because the Gateway/event loop is saturated, not because Telegram/WebChat is disconnected.
-
Active Memory is pinned to MiniMax M2.5
"model": "minimax/MiniMax-M2.5"and logs show:
active-memory ... MiniMax-M2.5 ... status=timeout
So Active Memory is one trigger candidate.
-
MiniMax MCP is also timing out
uvx minimax-coding-plan-mcp ... timed out after 30000msSeparate from Active Memory, but another MiniMax-related blocking path.
-
Gateway handshakes timeout under load
gateway/ws handshake timeoutThat’s likely an effect of the CPU saturation.
To find the actual cause, don’t change config yet. While CPU is high, run:
openclaw gateway stability --json --limit 100
openclaw tasks list --status running --json
openclaw logs --plain --limit 500 | grep -Ei 'diagnostic|liveness|active-memory|minimax|heartbeat|cron|mcp|timeout|error'
Also:
curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool
What I’m looking for: event-loop delay, active/waiting sessions, repeated active-memory starts, MCP startup loops, or heartbeat/cron floods.
The current best hypothesis is: a bug/hot loop or repeated blocking retry inside Gateway, triggered by MiniMax M2.5 Active Memory and/or MiniMax MCP startup, amplified by many agents/heartbeats. But we need the stability/liveness output to say why confidently.
They appear to be on Linux/Ubuntu-ish from the prompt and process output:
najef@OpenClaw-Server
systemd not installed
/home/najef/.nvm/versions/node/v24.14.1/...
But better to confirm with:
uname -a
cat /etc/os-release
node --version
openclaw --version
"model": "minimax/MiniMax-M2.5" should be quicker than "model": "minimax/MiniMax-M2.7" also maybe the problem is not with the model but with the cpu being 100% I say this because openclaw infer model run --local --model minimax/MiniMax-M2.5 --prompt "Reply with exactly: pong" --json also takes long time to finish.
Yes — I agree with your read. The slow infer --local test may be correlated, not causal.
If openclaw-gateway is already at 300% CPU, then any normal command can feel slow because the machine is saturated. The model request itself should mostly be network wait, not burn local CPU.
So the question is now: what is the Gateway doing at 300% CPU?
Please capture these while CPU is high:
openclaw gateway stability --json --limit 100
openclaw tasks list --status running --json
curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool
And this log slice:
openclaw logs --plain --limit 1000 \
| grep -Ei 'diagnostic|liveness|event-loop|cpu|active-memory|minimax|heartbeat|cron|mcp|timeout|stuck|processing|error' \
| tail -120
Also useful to see if it’s spawning/reaping loops:
pstree -ap 377477 | head -100
If gateway stability shows high event-loop delay / active session queues, that explains why agents only continue after a nudge: they’re not “asleep”; the Gateway is overloaded and delayed. Then we can pin whether the overload is active-memory, heartbeat fanout, cron, MCP startup, or a bug/hot loop.
najef@OpenClaw-Server:~$ openclaw gateway stability --json --limit 100
Gateway stability failed: Error: gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
PRETTY_NAME="Ubuntu 24.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.4 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
v24.14.1
OpenClaw 2026.4.26 (be8c246)
Fair. The key fact now is:
openclaw gateway stability ... gateway timeout after 10000ms
openclaw-gateway 322% CPU
So the Gateway is not just “busy”; it is saturated enough that its own diagnostic RPC can’t answer. To know why, we need OS-level evidence because Gateway RPC is unreliable while it’s pegged.
Please run these while CPU is high:
PID=$(pgrep -xo openclaw-gateway)
echo "PID=$PID"
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
This tells us whether one JS thread is hot-looping or many worker/native threads are burning CPU.
Then:
curl -m 10 -s http://127.0.0.1:18789/readyz | python3 -m json.tool | head -160
And if strace exists:
timeout 15s strace -f -p "$PID" -c 2>&1
Interpretation:
- lots of
openat/stat/read→ file/plugin/session scanning loop - lots of
futex/epoll+ CPU still high → worker/native contention
- almost no syscalls → JS hot loop / GC / CPU-bound code
This is the path to “why.” Current hypothesis is not “MiniMax is slow”; it’s “Gateway entered a CPU-saturation path, and MiniMax/active-memory/MCP are just nearby triggers.”
openclaw tasks list --status running --json
{
"count": 1,
"runtime": null,
"status": "running",
"tasks": [
{
"taskId": "a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777458386285",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "running",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1777458386285,
"startedAt": 1777458386285,
"lastEventAt": 1777458386285
}
]
}
curl -s http://127.0.0.1:18789/readyz | python3 -m json.tool
{
"ready": true,
"failing": [],
"uptimeMs": 4511038
}
This narrows it.
readyz being green means the HTTP liveness/readiness path can answer. But gateway stability timing out means the Gateway WS/RPC path is too saturated to answer within 10s. So the Gateway isn’t dead; it’s overloaded/wedged in a way that mainly affects agent/RPC work.
The running task is also very suspicious:
label: Orion Heartbeat
runtime: cron
agentId: neon
childSessionKey: agent:neon:telegram:direct:2101884310
lastEventAt == startedAt
That means this cron run started, but has emitted no progress event since start. And it’s using the same neon Telegram direct session, not some obviously isolated throwaway session.
Next, inspect exactly what that cron task is doing:
openclaw tasks show a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40 --json
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 10
And still need this OS-level CPU view:
PID=$(pgrep -xo openclaw-gateway)
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
If the high CPU lines are one/few threads, that points to a hot loop. If many threads are hot, it’s more likely worker/native/parallel work.
PID=$(pgrep -xo openclaw-gateway)
echo "PID=$PID"
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
pgrep: pattern that searches for process name longer than 15 characters will result in zero matches
Try `pgrep -f' option to match against the complete command line.
PID=
error: list of process IDs must follow -p
Usage:
ps [options]
Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.
For more details see ps(1).
Ah, Linux truncates process names at 15 chars for plain pgrep. Use -f:
PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
If that still misses, use the PID from your earlier ps output directly:
PID=377477
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
openclaw tasks show a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40 --json
{
"taskId": "a5cfc26f-9e9f-4dc1-8138-3640fa2b1f40",
"runtime": "cron",
"sourceId": "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"requesterSessionKey": "",
"ownerKey": "system:cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed",
"scopeKind": "system",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"agentId": "neon",
"runId": "cron:4e71329a-aaf2-46bc-94d1-ee573b0ca0ed:1777458386285",
"label": "Orion Heartbeat",
"task": "Orion Heartbeat",
"status": "timed_out",
"deliveryStatus": "not_applicable",
"notifyPolicy": "silent",
"createdAt": 1777458386285,
"startedAt": 1777458386285,
"endedAt": 1777458715376,
"lastEventAt": 1777458715376,
"cleanupAfter": 1778063515376,
"error": "cron: job execution timed out"
}
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed
🦞 OpenClaw 2026.4.26 (be8c246) — Pairing codes exist because even bots believe in consent—and good security hygiene.
│
gateway connect failed: Error: gateway closed (1000):
â—‡
Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
This is useful: the cron task timed out, and now normal Gateway RPC is closing during connect:
gateway closed (1000 normal closure)
That means we still need the OS/thread view because Gateway is not reliable enough to introspect itself.
Run this exact version:
PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
And if strace is installed:
timeout 15s strace -f -p "$PID" -c 2>&1
The cron evidence says Orion Heartbeat is one reproducible stuck/timed-out workload, but the reason may still be underneath it: Gateway CPU saturation, active-memory, or MCP. The closed (1000) during cron show is another sign the Gateway WS handler is unhealthy under load.
najef@OpenClaw-Server:~$ PID=$(pgrep -of 'openclaw-gateway')
echo "PID=$PID"
ps -L -p "$PID" -o pid,tid,psr,pcpu,pmem,comm,args | sort -k4 -nr | head -30
PID=377477
377477 418703 0 78.4 10.0 libuv-worker openclaw-gateway
377477 418732 0 63.9 10.0 libuv-worker openclaw-gateway
377477 418700 1 46.5 10.0 libuv-worker openclaw-gateway
377477 418699 6 45.1 10.0 libuv-worker openclaw-gateway
377477 418702 4 44.5 10.0 libuv-worker openclaw-gateway
377477 418734 4 44.3 10.0 libuv-worker openclaw-gateway
377477 418704 2 39.5 10.0 libuv-worker openclaw-gateway
377477 418701 2 39.2 10.0 libuv-worker openclaw-gateway
377477 418723 5 37.7 10.0 libuv-worker openclaw-gateway
377477 418733 3 36.0 10.0 libuv-worker openclaw-gateway
377477 418724 2 32.7 10.0 libuv-worker openclaw-gateway
377477 418725 0 32.2 10.0 libuv-worker openclaw-gateway
377477 377590 4 22.1 10.0 libuv-worker openclaw-gateway
377477 377591 5 21.9 10.0 libuv-worker openclaw-gateway
377477 377593 5 21.5 10.0 libuv-worker openclaw-gateway
377477 377592 1 21.5 10.0 libuv-worker openclaw-gateway
377477 377477 1 21.1 10.0 openclaw-gatewa openclaw-gateway
377477 377484 1 0.6 10.0 V8Worker openclaw-gateway
377477 377483 6 0.6 10.0 V8Worker openclaw-gateway
377477 377482 1 0.6 10.0 V8Worker openclaw-gateway
377477 377481 1 0.6 10.0 V8Worker openclaw-gateway
PID TID PSR %CPU %MEM COMMAND COMMAND
377477 377490 4 0.0 10.0 SignalInspector openclaw-gateway
377477 377480 6 0.0 10.0 DelayedTaskSche openclaw-gateway
najef@OpenClaw-Server:~$ timeout 15s strace -f -p "$PID" -c 2>&1
strace: attach: ptrace(PTRACE_SEIZE, 377477): Operation not permitted
najef@OpenClaw-Server:~$
This is the first really diagnostic output. It means:
The CPU burn is in libuv-worker threads, not mostly the JS main thread.
So this is probably not “MiniMax response is slow” by itself. Something in Gateway is flooding Node’s worker pool — commonly filesystem work, crypto, compression, DNS, sqlite/native work, etc.
That also explains the weird split:
/readyzworks: simple HTTP path can answer.- WS/RPC commands time out/close: connect/auth/task paths likely need queued worker work and get starved.
Next, find what those workers are doing:
PID=$(pgrep -of 'openclaw-gateway')
tr '\0' '\n' < /proc/$PID/environ | grep -E 'UV_THREADPOOL_SIZE|NODE_OPTIONS|OPENCLAW'
timeout 15s strace -f -p "$PID" -c 2>&1
If strace says permission denied:
sudo timeout 15s strace -f -p "$PID" -c 2>&1
Interpretation:
- tons of
openat/stat/read→ file/session/plugin scanning loop - tons of sqlite/file locking syscalls → sqlite/session/task store pressure
- mostly
futex/ little syscall activity while CPU high → native CPU-bound worker / crypto / compression / tight loop
If you have perf, this is even better:
sudo timeout 15s perf top -p "$PID" --stdio | head -100
The “why” right now is: Gateway worker pool saturation. We still need strace/perf to identify the subsystem causing it.
That’s expected on Ubuntu with ptrace_scope restrictions.
Use one of these:
sudo timeout 15s strace -f -p "$PID" -c 2>&1
If you don’t have sudo, check thread wait channels instead — no special perms usually:
PID=$(pgrep -of 'openclaw-gateway')
for T in $(ps -L -p "$PID" -o tid= | head -40); do
printf "%s " "$T"
cat /proc/$PID/task/$T/wchan 2>/dev/null
done | sort | uniq -c
And grab open files summary:
ls -l /proc/$PID/fd 2>/dev/null | awk '{print $NF}' | sed 's/[0-9]\+$/#/' | sort | uniq -c | sort -nr | head -40
If sudo strace works, that’s the best next evidence.
% time seconds usecs/call calls errors syscall
69.05 0.039992 121 330 epoll_pwait
18.85 0.010919 12 869 statx
3.53 0.002042 14 139 6 futex
3.42 0.001982 24 81 write
3.37 0.001949 21 90 read
1.69 0.000980 490 2 sendto
0.05 0.000027 1 27 clone3
0.02 0.000011 0 108 rt_sigprocmask
0.01 0.000005 0 8 access
0.01 0.000004 0 27 rseq
0.00 0.000001 0 27 madvise
0.00 0.000001 0 4 socket
0.00 0.000001 0 4 2 connect
0.00 0.000001 0 4 newfstatat
0.00 0.000001 0 27 set_robust_list
0.00 0.000001 0 6 io_uring_enter
0.00 0.000000 0 6 close
0.00 0.000000 0 2 fstat
0.00 0.000000 0 4 poll
0.00 0.000000 0 2 lseek
0.00 0.000000 0 2 ioctl
0.00 0.000000 0 24 getpid
0.00 0.000000 0 2 recvfrom
0.00 0.000000 0 22 setsockopt
0.00 0.000000 0 2 getsockopt
0.00 0.000000 0 2 epoll_ctl
0.00 0.000000 0 2 openat
100.00 0.057917 31 1823 8 total
najef@OpenClaw-Server:~$
This is not looking like “MiniMax model latency” anymore.
Key read from this sample:
869 statx
27 clone3
total syscall time: 0.058s over 15s
So the Gateway is spending CPU mostly outside syscalls, inside libuv worker/native userland work. The statx count hints at filesystem scanning/polling, but it’s not enough by itself to explain 300% CPU unless it’s happening constantly or mixed with native work.
Two next checks:
PID=$(pgrep -of 'openclaw-gateway')
tr '\0' '\n' < /proc/$PID/environ | grep -E 'UV_THREADPOOL_SIZE|NODE_OPTIONS|OPENCLAW'
And path-level strace, so we see what is being statx’d:
sudo timeout 10s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 200 2>&1 \
| grep -E 'openclaw.plugin.json|/dist/extensions/|sessions.json|memory/|models.json|auth-profiles|cron|node_modules' \
| head -120
If perf is available, that’s the best answer:
sudo timeout 15s perf top -p "$PID" --stdio | head -100
Right now I’d phrase the “why” as:
Gateway’s libuv worker pool is saturated by native/background work. That starves WS/RPC operations, so agents appear stopped and diagnostic RPCs time out. We need path/function samples to identify which subsystem is feeding the worker pool.
The running/timed-out Orion Heartbeat is likely one workload that triggers it, but the thread data says the deeper failure is worker-pool saturation.
PID=$(pgrep -of 'openclaw-gateway') shows nothing.
OPENCLAW_GATEWAY_PORT=18789
OPENCLAW_SYSTEMD_UNIT=openclaw-gateway.service
OPENCLAW_WINDOWS_TASK_NAME=OpenClaw Gateway
OPENCLAW_SERVICE_MARKER=openclaw
OPENCLAW_SERVICE_KIND=gateway
OPENCLAW_SERVICE_VERSION=2026.4.26
[pid 377593] openat(AT_FDCWD, "/home/najef/.openclaw/cron/jobs-state.json", O_RDONLY|O_CLOEXEC) = 37
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=139264, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/index.js", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0775, stx_size=2855, ...}) = 0
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json", O_RDONLY|O_CLOEXEC) = 39
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 377477] openat(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json", O_RDONLY|O_CLOEXEC) = 39
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=139264, ...}) = 0
[pid 377477] statx(AT_FDCWD, "/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFDIR|0775, stx_size=4096, ...}
Good catch. That sample points away from MiniMax itself.
What we now know:
- Gateway PID was
377477;pgrepnot finding it now may mean the process title changed or it restarted, but yourstracelines are still from that PID. - The hot Gateway is doing filesystem/module-resolution work:
/home/najef/.openclaw/cron/jobs-state.json .../node_modules/openclaw/dist/index.js - Combined with the running/timed-out
Orion Heartbeat, the strongest theory is now: cron/heartbeat state handling is repeatedly waking work and causing heavy module/fs resolution in libuv workers, starving Gateway WS/RPC.
Next useful capture:
ps -eo pid,ppid,pcpu,pmem,etime,args | grep -E 'openclaw|node' | grep -v grep | sort -k3 -nr | head -20
Then use the real PID from that output and run:
PID=<gateway-pid>
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 240 2>&1 \
| grep -E 'jobs-state.json|jobs.json|sessions.json|openclaw.plugin.json|dist/index.js|auth-profiles|models.json' \
| head -200
Also inspect the cron job:
openclaw cron show 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --json
If cron show can’t connect, directly inspect config/store paths:
grep -R "4e71329a-aaf2-46bc-94d1-ee573b0ca0ed\|Orion Heartbeat" ~/.openclaw -n | head -50
At this point I’d file an issue with: high CPU in many libuv-worker threads, WS/RPC timeouts, Orion Heartbeat cron timing out, and strace showing repeated cron state + OpenClaw dist stat calls. That’s a real bug-shaped report.
This is a better clue. The dist/package.json ENOENT itself is normal-ish Node/package resolution probing, but seeing it repeatedly while the Gateway is hot is not normal.
This points to repeated OpenClaw package/plugin metadata resolution, especially:
.../openclaw/package.json
.../openclaw/dist/extensions
So I’d refine the “why” to:
Gateway is repeatedly rebuilding/resolving OpenClaw plugin/package metadata on a hot path, likely triggered by cron/heartbeat/agent tool setup. That saturates libuv workers, so WS/RPC handshakes and agent progress stall.
Let’s count exactly what’s looping:
PID=377477
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 240 2>&1 \
| sed -n 's/.*"\([^"]*\)".*/\1/p' \
| grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
| sort | uniq -c | sort -nr | head -50
If top entries are dist/extensions / openclaw.plugin.json / package metadata, that’s strong evidence for a plugin-discovery/tool-catalog hot loop.
For PID detection, use:
ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
because pgrep can miss depending on process title/argv.
Orion Heartbeat does not have any special heartbeat that could cause this..
That’s fair — I shouldn’t overfit to the name. The task label being “Orion Heartbeat” may just be the visible victim, not the cause.
The stronger evidence is independent of that cron:
many hot libuv-worker threads
repeated stat/open of openclaw package + dist/extensions
WS/RPC timing out while /readyz works
So the likely cause is not “Orion Heartbeat has a special prompt”. It’s more likely a Gateway hot path that repeatedly resolves package/plugin metadata during some run/session/tool setup.
To prove/disprove, the best next command is the counted strace:
PID=377477
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 240 2>&1 \
| sed -n 's/.*"\([^"]*\)".*/\1/p' \
| grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
| sort | uniq -c | sort -nr | head -50
If that shows lots of dist/extensions / openclaw.plugin.json, then the root bug is plugin/package discovery being rebuilt too often, not any one heartbeat.
what components does the gateway have that could use cpu like that?
Good question. Since the hot threads are libuv-worker, I’d think “async native/background work” more than pure JS.
Gateway components that can plausibly burn CPU / worker pool:
-
Plugin discovery / tool catalog building
Scansdist/extensions, readsopenclaw.plugin.json, package metadata. Yourstracealready points here. -
Session/task/cron state stores
Reads/writessessions.json,tasks,cron/jobs-state.json, cleanup/maintenance scans. -
Memory / Active Memory / embeddings / QMD
Index/search/bootstrap memory files, embeddings, recall prep. Active Memory was timing out, but may be a trigger rather than root. -
MCP tool startup/catalog
Starting stdio MCP servers (uvx minimax-coding-plan-mcp) and discovering their tools can block/timeout. -
Message/channel runtime
Telegram send/action retries, channel health checks, delivery queues.
-
File watching / workspace scanning
Workspace bootstrap, AGENTS/skills/memory scanning, project file inspection. -
Compression/crypto/auth
Token/device auth, hashing, websocket compression, TLS/DNS-ish worker tasks. -
Node 24 / native runtime behavior
You’re on Nodev24.14.1; if a regression is in Node/libuv/fs workers, it could amplify plugin/fs scans.
From your evidence, top suspects are currently:
- plugin/package metadata resolution (
dist/extensions,package.json) - cron/session state churn (
jobs-state.json) - Active Memory/MCP triggering repeated tool/plugin setup
The counted strace will tell us whether plugin discovery is truly dominating or just incidental.
how can we check one by one what they are doing?
Yes — do it like a controlled bisect: observe CPU + one signal, disable one subsystem, restart, compare. Don’t change five things at once.
0) Baseline
ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
openclaw logs --plain --limit 300 | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' | tail -80
1) Plugin/package discovery
Evidence check:
PID=<gateway-pid>
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat -s 240 2>&1 \
| sed -n 's/.*"\([^"]*\)".*/\1/p' \
| grep -E 'openclaw.plugin.json|dist/extensions|package.json' \
| sort | uniq -c | sort -nr | head -50
2) Cron
Check:
openclaw cron list
openclaw tasks list --status running
openclaw cron runs --id 4e71329a-aaf2-46bc-94d1-ee573b0ca0ed --limit 10
Temporary isolate:
openclaw config set cron.enabled false --strict-json
openclaw gateway restart
3) Heartbeats
Check:
openclaw system heartbeat last
grep -n "heartbeat" ~/.openclaw/openclaw.json | head -80
Temporary isolate:
openclaw config set agents.defaults.heartbeat.every "0m"
openclaw gateway restart
If per-agent overrides exist, they may need clearing too.
4) Active Memory
Check:
/active-memory status --global
Shell:
openclaw config get plugins.entries.active-memory
openclaw logs --plain --limit 500 | grep -i active-memory | tail -80
Temporary isolate:
/active-memory off --global
5) MCP
Check:
openclaw mcp list
openclaw mcp show minimax --json
openclaw logs --plain --limit 500 | grep -Ei 'bundle-mcp|mcp' | tail -80
Temporary isolate:
openclaw config set tools.deny '["bundle-mcp"]' --strict-json
openclaw gateway restart
6) Channels
Check:
openclaw channels status --probe
openclaw logs --plain --limit 500 | grep -Ei 'telegram|channel|sendChatAction|disconnect|timeout' | tail -80
Best order: MCP → Active Memory → Cron → Heartbeats → plugin discovery evidence. After each restart, wait 2–3 min and check ps CPU again.
najef@OpenClaw-Server:~/.openclaw$ ps -eo pid,pcpu,pmem,etime,args | grep '[o]penclaw'
346387 0.0 0.1 04:03:58 /usr/bin/pluma /home/najef/.openclaw/openclaw.json
377477 504 10.3 02:23:09 openclaw-gateway
377504 0.0 0.2 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/npm-run-all --parallel dev dev:server
377538 0.1 0.6 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/vite
377540 0.0 0.2 02:23:09 node /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/.bin/tsx watch server/index.ts
377558 0.0 0.0 02:23:08 /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/@esbuild/linux-x64/bin/esbuild --service=0.27.2 --ping
377568 0.2 0.5 02:23:08 /home/najef/.nvm/versions/node/v24.14.1/bin/node --require /home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/preflight.cjs --import file:///home/najef/.openclaw/CerAImic/Dev-Team/projects/Active/Nerve-CerAImic/node_modules/tsx/dist/loader.mjs server/index.ts
najef@OpenClaw-Server:~/.openclaw$ openclaw logs --plain --limit 300 | grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' | tail -80
gateway connect failed: Error: gateway closed (1000):
Gateway not reachable. Is it running and accessible?
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Config: /home/najef/.openclaw/openclaw.json
Bind: loopback
Hint: run openclaw doctor.
At 504% CPU, the Gateway is wedged enough that Gateway-backed commands won’t be trustworthy. So openclaw logs failing is expected now.
Use file/process-level checks instead:
PID=377477
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 240 2>&1 \
| sed -n 's/.*"\([^"]*\)".*/\1/p' \
| grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
| sort | uniq -c | sort -nr | head -50
And raw logs, bypassing Gateway RPC:
ls -lt ~/.openclaw/logs /tmp/openclaw 2>/dev/null | head -30
tail -n 500 "$(ls -t ~/.openclaw/logs/*.log /tmp/openclaw/openclaw-*.log 2>/dev/null | head -1)" \
| grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error' \
| tail -100
If perf exists, this is the clearest “what is CPU doing” command:
sudo timeout 15s perf top -p 377477 --stdio | head -100
At this point the main fact is: Gateway is not merely slow; it’s CPU-saturated in libuv workers and WS/RPC is failing. Capture one strace count / perf sample, then restart Gateway — otherwise the system is too wedged to debug through OpenClaw commands.
PID gateway-pid
PID: command not found
Sorry — I meant replace <gateway-pid> with the actual number. In your case it’s:
PID=377477
Then run commands using $PID, for example:
PID=377477
sudo timeout 20s strace -f -p "$PID" \
-e trace=statx,openat,newfstatat \
-s 240 2>&1 \
| sed -n 's/.*"\([^"]*\)".*/\1/p' \
| grep -E 'openclaw|jobs-state|sessions.json|openclaw.plugin.json|package.json|dist/extensions' \
| sort | uniq -c | sort -nr | head -50
Or simply:
sudo timeout 15s perf top -p 377477 --stdio | head -100
ls -lt ~/.openclaw/logs /tmp/openclaw 2>/dev/null | head -30
/tmp/openclaw:
total 13272
-rw-rw-r-- 1 najef najef 10994156 Apr 29 14:12 openclaw-2026-04-29.log
drwx------ 2 najef najef 4096 Apr 29 11:14 openclaw-active-memory-YiXRVS
drwx------ 2 najef najef 4096 Apr 29 11:13 tts-73ZiJY
drwx------ 2 najef najef 4096 Apr 29 10:15 tts-RMy68Y
drwxrwxr-x 2 najef najef 4096 Apr 29 09:50 downloads
-rw-rw-r-- 1 najef najef 2545032 Apr 28 23:59 openclaw-2026-04-28.log
drwx------ 2 najef najef 4096 Apr 28 23:56 tts-ScNU2h
drwx------ 2 najef najef 4096 Apr 28 23:56 tts-THeNJm
drwx------ 2 najef najef 4096 Apr 28 23:26 tts-4ubDGh
drwx------ 2 najef najef 4096 Apr 28 23:24 tts-9XIqxu
.openclaw/logs:
total 280
-rw------- 1 najef najef 3289 Apr 29 11:28 config-health.json
-rw------- 1 najef najef 264115 Apr 29 00:00 config-audit.jsonl
-rw-rw-r-- 1 najef najef 918 Apr 28 22:40 gateway-restart.log
-rw-rw-r-- 1 najef najef 4936 Apr 22 23:19 commands.log
Good — use the raw log file directly:
tail -n 1000 /tmp/openclaw/openclaw-2026-04-29.log \
| grep -Ei 'active-memory|bundle-mcp|cron|heartbeat|plugin|timeout|error|liveness|diagnostic|worker|event-loop' \
| tail -150
Also check whether logs are being spammed by one message:
tail -n 5000 /tmp/openclaw/openclaw-2026-04-29.log \
| sed 's/[0-9a-f]\{8\}-[0-9a-f-]\{27,36\}/<uuid>/g' \
| sort | uniq -c | sort -nr | head -30
That openclaw-active-memory-* temp dir is another clue Active Memory was running recently. Not proof by itself, but it lines up with the earlier timeout logs.
RN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:07:44.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:08:14.720Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:08:14.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6793s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:08:44.721Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:08:44.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6793s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6823s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"{"subsystem":"diagnostic"}","parentNames":["openclaw"],"date":"2026-04-29T12:09:14.721Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147:14","fileName":"subsystem-rHhUC6qs.js","fileNameWithLine":"subsystem-rHhUC6qs.js:147","fileColumn":"14","fileLine":"147","filePath":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js","filePathWithLine":".nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/subsystem-rHhUC6qs.js:147","method":"logToFile"}},"time":"2026-04-29T14:09:14.721+02:00","hostname":"OpenClaw-Server","message":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6823s queueDepth=0"}
{"0":"{"subsystem":"diagnostic"}","1":"stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6853s queueDepth=0","_meta":{"runtime":"node","runtimeVersion":"24.14.1","hostname":"unknown","name":"
This is a real smoking gun, but it’s a different one than Orion:
stuck session:
sessionKey=agent:arnold:main:heartbeat
state=processing
age=6800s+
queueDepth=0
Meaning: Gateway thinks Arnold’s heartbeat session has been processing for ~2 hours, with no queue behind it. That explains “agents stop mid-work”: at least one heartbeat/session lane is stuck in processing.
Now we need the first cause, not the repeated warning. Run:
LOG=/tmp/openclaw/openclaw-2026-04-29.log
grep -nE 'agent:arnold:main:heartbeat|sessionId=arnold|stuck session recovery|heartbeat.*arnold|active-memory.*arnold|timeout|abort|error' "$LOG" | tail -250
Also specifically:
grep -n 'stuck session: sessionId=arnold' "$LOG" | head -1
Take the line number from that first result, then show ~200 lines before it:
LINE=<line-number>
sed -n "$((LINE-200)),$((LINE+20))p" "$LOG"
What I’m looking for:
stuck session recovery skipped ... active runactive-memory ... arnold ... timeout- model timeout / abort cleanup failure
- heartbeat run started but never ended
Current diagnosis: a scheduled heartbeat session (arnold:main:heartbeat) got stuck in processing; the Gateway keeps warning every 30s but is not recovering it. That’s a bug-shaped state.
I did the bootstrap check first: I don’t have a useful resume file/highlight trail for this thread, so I’m going off what you posted here.
This does look like a real bug on v2026.4.26. I can see several fixes landed on main right after April 26, 2026 that match “agent stops until I poke it again” pretty closely:
- stale session-lane recovery
- defer agent dispatch after accept
- yield after accepted ack
- heartbeat/cron busy-lane deferral
So my read: you’re probably hitting a stuck queue/session-lane problem, not just a bad prompt/config.
Two good paths:
- Fastest check: test current
main/ next build if you’re comfortable. - If staying on 2026.4.26: grab one real stall capture:
openclaw status --allopenclaw logs --follow
- reproduce once
- paste the actual log lines around the stall
(thelsoutput helps show where logs are, but not why it wedged)
- paste the actual log lines around the stall
If you’re using subagents, cron, or heartbeats, say so too — that matters here.
Docs:
- https://docs.openclaw.ai/help/faq
- https://docs.openclaw.ai/gateway/troubleshooting
- https://docs.openclaw.ai/tools/subagents
- https://docs.openclaw.ai/gateway/heartbeat
If you paste openclaw status --all plus ~100-200 log lines from the stall, I can help narrow which of those fixes you’re actually hitting.
Bootstrap check first: I don’t have a useful resume file/highlight trail for this thread, so I’m rebuilding context from what you pasted here. I may be missing earlier details, but this log is enough to say something concrete.
That log is the smoking gun:
stuck session: sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0
So yes — this looks like a real stuck-session / heartbeat-lane bug on v2026.4.26, not just “the model got lazy.”
What it means:
- the heartbeat session stayed in
processing - it sat there for ~2 hours
- queueDepth=0 means it wasn’t actively draining queued work
That lines up really well with fixes that landed after v2026.4.26:
- stale session-lane recovery
- defer agent dispatch after accept
- yield after accepted ack
- heartbeat busy-lane deferral
So my recommendation is:
-
Best fix: test current
main/ next release if you can.
Your version is just before the stuck-session recovery work. -
Best workaround on 2026.4.26: temporarily disable heartbeat for that agent and see if the “agents stop mid work” symptom disappears.
Docs use:agents: { defaults: { heartbeat: { every: "0m" } }
}
then restart the gateway.
- If you want heartbeats back on a newer build, I’d also try:
isolatedSession: truelightContext: trueskipWhenBusy: true
That last one is from the newer heartbeat fix.
If you want, paste:
openclaw status --all- the 20-50 log lines before the first
stuck sessionwarning
and I can help tell whether this is just heartbeat wedging or heartbeat + another lane-pressure issue.
Docs:
- https://docs.openclaw.ai/help/faq
- https://docs.openclaw.ai/gateway/troubleshooting
- https://docs.openclaw.ai/gateway/heartbeat
If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it.
How do I give you the 20-50 log lines before the first stuck session warning?
Use this:
LOG=/tmp/openclaw/openclaw-2026-04-29.log
LINE=$(grep -n 'stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat' "$LOG" | head -1 | cut -d: -f1)
echo "first stuck-session line: $LINE"
sed -n "$((LINE-50)),$((LINE+10))p" "$LOG"
If that’s too much/noisy, filter it:
sed -n "$((LINE-50)),$((LINE+10))p" "$LOG" \
| grep -Ei 'arnold|heartbeat|active-memory|timeout|abort|error|session|processing|recovery|model|failover'
Paste that output in a code block.
how can I save downgrade?
Yes — safe downgrade flow:
mkdir -p ~/openclaw-backups
openclaw backup create --verify --output ~/openclaw-backups
Then stop the wedged gateway:
openclaw gateway stop
If that can’t connect, terminate the PID you saw:
kill -TERM 377477
sleep 5
ps -p 377477 || true
Then install a previous version, for example:
npm install -g openclaw@2026.4.25
openclaw --version
openclaw gateway install --force
openclaw gateway start
If you want to go back to 2026.4.24 instead:
npm install -g openclaw@2026.4.24
After downgrade:
openclaw status
openclaw gateway status
One caution: don’t downgrade too far. OpenClaw has config-version guardrails, so an older binary may refuse some service/config mutations if openclaw.json was touched by a newer version. If that happens, stop and paste the exact error before forcing anything.
i want v23 or even v22
Yep, both exist: 2026.4.23 and 2026.4.22.
I’d try 2026.4.23 first:
mkdir -p ~/openclaw-backups
openclaw backup create --verify --output ~/openclaw-backups
openclaw gateway stop
npm install -g openclaw@2026.4.23
openclaw --version
openclaw gateway install --force
openclaw gateway start
openclaw status
If you want 2026.4.22 instead:
openclaw gateway stop
npm install -g openclaw@2026.4.22
openclaw --version
openclaw gateway install --force
openclaw gateway start
If gateway stop can’t reach the wedged gateway, use the PID:
kill -TERM 377477
sleep 5
If the older binary refuses because config was touched by newer OpenClaw, don’t force yet — paste that exact error.
Is v22 too far?
2026.4.22 is not crazy far, but I’d avoid jumping straight there unless 2026.4.23 still has the problem.
Reason: each step back increases the chance of config/state migrations or package/runtime dependency differences biting you. 2026.4.23 is a safer rollback target and likely far enough if this came from .24/.25/.26.
Recommended order:
npm install -g openclaw@2026.4.23
Test. If still broken:
npm install -g openclaw@2026.4.22
Definitely make the backup first:
openclaw backup create --verify --output ~/openclaw-backups
what should I write for issue I will open?
Use something like this:
## Summary
OpenClaw Gateway becomes CPU-saturated and agents stop progressing mid-work until I send another message. During the bad state, HTTP `/healthz` and sometimes `/readyz` can respond, but Gateway WS/RPC commands time out or close during connect.
## Environment
- OpenClaw: 2026.4.26 (be8c246)
- OS: Ubuntu 24.04.4 LTS
- Node: v24.14.1
- Install: npm global via nvm
- Gateway bind: loopback, port 18789
- systemd: not installed
## Symptoms
- Agents stop mid-work and only continue after I send a message.
- `openclaw-gateway` reaches very high CPU:
- observed 322% CPU
- later observed 504% CPU
- Gateway RPC becomes unreliable:
- `openclaw gateway stability --json --limit 100` failed with `gateway timeout after 10000ms`
- `openclaw cron show ...` failed with `gateway closed (1000 normal closure): no close reason`
- `openclaw logs --plain ...` failed with `gateway closed (1000)`
- HTTP liveness still worked:
- `curl http://127.0.0.1:18789/healthz` → `{"ok":true,"status":"live"}`
- `curl http://[::1]:18789/healthz` → `{"ok":true,"status":"live"}`
- `/readyz` returned `{"ready":true,"failing":[]}` at least once
## Evidence
Thread view while CPU was high:
```text
openclaw-gateway 504% CPU
Per-thread CPU showed many hot libuv worker threads:
libuv-worker 78.4%
libuv-worker 63.9%
libuv-worker 46.5%
libuv-worker 45.1%
...
openclaw-gateway main thread ~21%
`strace -f -p <gateway-pid> -c` showed mostly:
```text
epoll_pwait
statx
futex
write/read
Path-level strace showed repeated OpenClaw package/plugin metadata resolution:
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json = ENOENT
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions
/home/najef/.openclaw/cron/jobs-state.json
Logs repeatedly showed a stuck heartbeat session:
```text
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0
...
There was also a cron task repeatedly timing out:
{
"runtime": "cron",
"label": "Orion Heartbeat",
"agentId": "neon",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"status": "timed_out",
"error": "cron: job execution timed out"
}
Active Memory was enabled globally and configured for many agents. It was pinned to MiniMax M2.5:
```json
{
"plugins.entries.active-memory.config.model": "minimax/MiniMax-M2.5",
"timeoutMs": 15000,
"thinking": "low"
}
Logs showed Active Memory timeouts:
```text
active-memory: agent=arnold session=agent:arnold:telegram:direct:2101884310 activeProvider=minimax activeModel=MiniMax-M2.5 start timeoutMs=15000
active-memory ... done status=timeout elapsedMs=16511 summaryChars=0
embedded_run_failover_decision ... failoverReason="timeout" ... timedOut=true aborted=true
MCP also had a MiniMax server configured:
{
"command": "uvx",
"args": ["minimax-coding-plan-mcp", "-y"]
}
Logs showed:
```text
failed to start server "minimax" (uvx minimax-coding-plan-mcp -y): Error: MCP server connection timed out after 30000ms
Current hypothesis
Gateway appears to enter CPU saturation / worker-pool saturation. It may be triggered by heartbeat/cron/active-memory/MCP setup, but the deeper symptom is many busy libuv-worker threads plus repeated package/plugin metadata resolution. Once saturated, WS/RPC commands fail while simple HTTP health endpoints may still respond.
Expected behavior
Gateway should recover stuck heartbeat/session work, not remain in processing for hours, and WS/RPC should not become unusable under this condition.
## Workarounds considered
- Downgrade to 2026.4.23 or 2026.4.22
- Disable Active Memory globally
- Disable bundled MCP / remove `minimax` MCP
- Reduce heartbeat frequency
I have not yet confirmed which of these avoids the CPU saturation.
maybe not include things we are not sure are an issue like minmax m2.5?
Yes, agreed. Keep the issue focused on confirmed evidence, and put uncertain things under “possibly related” or omit them.
Use this cleaner version:
## Summary
OpenClaw Gateway becomes CPU-saturated and agents stop progressing mid-work until I send another message. During the bad state, simple HTTP health endpoints can respond, but Gateway WS/RPC commands time out or close during connect.
## Environment
- OpenClaw: 2026.4.26 (be8c246)
- OS: Ubuntu 24.04.4 LTS
- Node: v24.14.1
- Install: npm global via nvm
- Gateway bind: loopback, port 18789
- systemd: not installed
## Symptoms
- Agents stop mid-work and only continue after I send another message.
- `openclaw-gateway` reaches very high CPU:
- observed 322% CPU
- later observed 504% CPU
- Gateway RPC becomes unreliable:
- `openclaw gateway stability --json --limit 100` failed with `gateway timeout after 10000ms`
- `openclaw cron show ...` failed with `gateway closed (1000 normal closure): no close reason`
- `openclaw logs --plain ...` failed with `gateway closed (1000)`
- HTTP liveness still worked:
- `curl http://127.0.0.1:18789/healthz` → `{"ok":true,"status":"live"}`
- `curl http://[::1]:18789/healthz` → `{"ok":true,"status":"live"}`
- `/readyz` returned `{"ready":true,"failing":[]}` at least once
## Evidence
Process list while CPU was high:
```text
openclaw-gateway 504% CPU
Per-thread CPU showed many hot libuv-worker threads:
libuv-worker 78.4%
libuv-worker 63.9%
libuv-worker 46.5%
libuv-worker 45.1%
...
openclaw-gateway main thread ~21%
`strace -f -p <gateway-pid> -c` showed activity including:
```text
epoll_pwait
statx
futex
write/read
Path-level strace showed repeated package/plugin/cron-state paths:
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/package.json = ENOENT
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/package.json
/home/najef/.nvm/versions/node/v24.14.1/lib/node_modules/openclaw/dist/extensions
/home/najef/.openclaw/cron/jobs-state.json
Logs repeatedly showed a stuck heartbeat session:
```text
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6733s queueDepth=0
stuck session: sessionId=arnold sessionKey=agent:arnold:main:heartbeat state=processing age=6763s queueDepth=0
...
A cron task also timed out while this was happening:
{
"runtime": "cron",
"label": "Orion Heartbeat",
"agentId": "neon",
"childSessionKey": "agent:neon:telegram:direct:2101884310",
"status": "timed_out",
"error": "cron: job execution timed out"
}
## Expected behavior
Gateway should recover or fail stuck heartbeat/session work, not remain in `processing` for hours, and WS/RPC should not become unusable while HTTP health endpoints still respond.
## Notes