I need help diagnosing a native subagent hang/regression.
Environment:
- OpenClaw 2026.5.7 stable
- Linux arm64, Node v25.9.0
- Gateway service: openclaw-gateway.service
- Main model: openai-codex/gpt-5.5
- Native subagent runtime
- Stable appears to use @mariozechner/pi-coding-agent 0.73.0
Problem:
Native subagents used to work, but since around May 10–11, a single subagent can hang the whole gateway. This is not just “too many subagents”.
Symptoms:
- Dashboard/WebChat request timeouts / WebSocket disconnects
- Discord gateway disconnects/timeouts
- Gateway becomes very slow/unresponsive
- Restart often needs SIGKILL
- Logs show event-loop starvation:
- eventLoopUtilization=1
- cpuCoreRatio≈1
- large eventLoopDelay
- Discord fetch timers delayed tens of seconds
- WebSocket calls backed up for 100–200s+
Important test:
A minimal Discord-origin subagent reproduced the hang, so this does not seem Dashboard-specific.
Discord smoke test:
- runId: 30e325bf-a394-4328-bd3d-9f5aee80ed65
- childSessionKey: agent:coder:subagent:e6459bbb-fbc6-49f3-9c65-df56cfa1c2e5
- requesterOrigin.channel: discord
- context=isolated, lightContext=true, runTimeoutSeconds=60
- task only: reply DISCORD_SUBAGENT_SMOKE_OK
Logs:
- prep totalMs=143495
- session-resource-loader=141347ms
- startup totalMs=40595
- attempt-dispatch=39821ms
- subagent lane timed out after 90000ms
- Discord fetch timeout elapsed 46636ms; timer delayed 36636ms
- chat.history WS calls took 145–215s
- cleanup timed out at pi-trajectory-flush
- systemd SIGKILLed gateway during restart
- service reported 5.6G memory peak
Earlier run:
- runId bb9da5a5-3c79-4924-870a-38ba1fa9ec89
- outcome: subagent run lost active execution context
- session-resource-loader=59705ms
- same event-loop starvation / Discord timeout pattern
Question:
Is this a known 2026.5.7 issue, stale state issue, or pi-coding-agent regression? Should we upgrade to beta 2026.5.12-beta.6, clean subagent/session state?