#Session lockup on cold start — event loop blocking causing Slack disconnect loop

1 messages · Page 1 of 1 (latest)

ruby nimbus
#

Problem: Session lockup on cold start — the main session hangs at startup with "processing" shown but no active run, causing Slack to disconnect and reconnect in a loop.

Root cause identified:

  • bundle-tools: 5039ms on every cold session start blocks the event loop
  • During this block, Slack socket ping (30s default) times out
  • Reconnect loop deadlocks with the stuck event loop
  • WAL checkpoint fails because the session is frozen mid-heartbeat

Evidence from logs:

  • eventLoopDelayMaxMs=32866 on cold startup (spike from 17s model boot + 34s bundle-tools)
  • Normal P99 is 32-110ms after warmup
  • Session counter: 91 sessions on main agent

Current workaround: Restart gateway from Control UI instead of from inside the session. Hot reload of browser plugin works fine.

Question: Is there a way to either:

  1. Reduce the bundle-tools blocking time on cold start, OR
  2. Increase the Slack socket ping timeout to be longer than the worst-case event loop delay, OR
  3. Add a WAL checkpoint before the cold-start bottleneck so recovery works after a crash?

Happy to provide more details or test any fixes.

river sapphireBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

regal parrotBOT
#

I checked the current source/docs. Short version: your workaround makes sense, and I don’t see a config knob for the specific Slack socket ping timeout today.

What I’d try first to reduce cold-start blocking:

  • Identify whether an enabled bundle is contributing MCP/LSP servers: openclaw plugins list then openclaw plugins inspect <id>.
  • If you don’t need those bundle MCP/LSP tools on the main agent, disable the offending bundle/plugin entry or disable plugins for that runtime. The cold path is creating/materializing bundle MCP tools and LSP runtimes before the run starts.
  • I would not rely on tools.deny: ["bundle-mcp"] as a latency fix without testing it: docs say it opts out of bundle MCP exposure, but the code path I checked appears to build/materialize bundled tools before final policy filtering, so it may still pay the startup cost.

For Slack:

  • Current Slack config exposes mode: "socket" | "http", but I don’t see a supported OpenClaw config field for Socket Mode ping/pong timeout. The Bolt receiver is created with autoReconnectEnabled: false; ping timeout behavior is coming from Slack’s socket client internals.
  • If this is hurting production, switching Slack to HTTP/Event API mode may avoid the long-lived socket timeout failure mode, but it requires a reachable webhook/signing-secret setup.

For checkpoint/recovery:

  • Session store writes are atomic JSON writes, not a SQLite WAL path for sessions.json.
  • There is an early session-entry persist before the embedded Pi run/tool prep in the current path, but if the frozen state is “processing with no active run”, the missing durable bit may be the active-run/heartbeat lifecycle state rather than the normal session entry. A pre-bundle-tools active-run checkpoint sounds like a reasonable bugfix area.

Please file this as a GitHub issue with openclaw --version, openclaw status --all, and the log slice around bundle-tools / eventLoopDelayMaxMs: