#WhatsApp stale socket

1 messages · Page 1 of 1 (latest)

clear fulcrum
#

WhatsApp stale-socket disconnect every ~35 min (+ rapid drops after restart)

Env: OpenClaw 2026.3.12, macOS Intel x64, Node v25.4.0, local gateway, launchd KeepAlive

Problem: Health-monitor detects stale-socket, restarts WhatsApp channel every ~35 min. 470+ occurrences since Feb 21. Messages fail during reconnect windows.

Pattern (Mar 12-13 timestamps):
19:45, 20:20, 20:55, 21:30, 22:05, 23:00, 23:35, 00:40, 03:15, 03:50, 05:20, 05:55, 06:30, 07:05, 07:40
~35 min intervals. Too regular for network issues.

Additional: After gateway restart (config.patch → SIGUSR1), listener reconnects but dies within 2-4 min — much faster than the 35-min cycle. Restart destabilizes the Baileys session.

Root cause: makeWASocket in source doesn't set keepAliveIntervalMs (uses Baileys default 30s). Socket silently dies ~20 min after connecting. 15-min health check catches it → 35 min cycle.

Impact: Outbound messages fail silently. Deliveries exceed max retries (5/5) → moved to failed/. Agent can't reliably message users.

Logs:

[health-monitor] [whatsapp:default] restarting (reason: stale-socket)
[whatsapp] No messages received in 30m - restarting connection
[whatsapp] Web connection closed (status 499)

Config: channelHealthCheckMinutes: 15, reconnect.maxAttempts: 0

Related: #41339 (Discord same pattern), #16270 (reconnect loop bans), #19856 (messages lost after reconnect)

Request: Expose Baileys socket tuning (keepAliveIntervalMs, connectTimeoutMs) as config. Investigate silent socket death.

frank lanternBOT
#

Your analysis lines up with what the current source is doing — but there’s an extra “gotcha” that explains the super-regular ~35 min cadence even if the Baileys socket isn’t actually dying.

What’s happening (based on current code)

There are two independent restart mechanisms in play:

  1. Gateway channel health monitor → reason: stale-socket
    The gateway marks a channel unhealthy if:
  • connected === true
  • and lastEventAt is older than 30 minutes (DEFAULT_CHANNEL_STALE_EVENT_THRESHOLD_MS = 30 * 60_000)
  • and the channel isn’t Telegram long-polling / webhook mode

Source: src/gateway/channel-health-policy.ts

For WhatsApp Web, lastEventAt is set on connect via createConnectedChannelStatusPatch() (sets lastEventAt = now) and then only bumped again when a message is handled. So if you go > 30 minutes without an “event”, the gateway will classify it as stale and restart it.

Source: src/gateway/channel-status-patches.ts, src/web/auto-reply/monitor.ts

#

The “~35 min” part is consistent with 30 min threshold + health-check cadence alignment (you’re running checks every 15 min, so it often trips on the next tick after crossing 30).

  1. WhatsApp watchdog → status 499 + “No messages received in 30m”
    Separately, the WhatsApp web monitor has a watchdog that forces a reconnect if it has seen messages before and then sees 30m of silence (it emits that status: 499 “watchdog-timeout”).

Source: src/web/auto-reply/monitor.ts

Re: keepAliveIntervalMs

You’re also correct that we currently call makeWASocket({...}) without passing Baileys socket tuning like keepAliveIntervalMs / connectTimeoutMs.

Source: src/web/session.ts

That said: the gateway’s stale-socket logic can cause the exact restart pattern you’re seeing even if the socket is perfectly healthy, because “no events for 30m” is treated as “socket stale”.

Practical workarounds (until there’s a proper fix)

  • If you want to stop the periodic flapping right now, the blunt instrument is setting:
    • gateway.channelHealthCheckMinutes: 0 (disables the health monitor)
#
  • or set it much higher to reduce churn.
  • Also sanity check that your config uses the right path for reconnect tuning: it’s web.reconnect.* (not reconnect.*). And maxAttempts: 0 means “unlimited” in current logic.

What to file as an issue (this should be a GitHub issue)

This really wants a code fix / config surface. I’d open an issue proposing one (or both) of:

  • Bump lastEventAt periodically from the existing WhatsApp heartbeat timer (so health-monitor doesn’t false-positive on quiet accounts)
  • Expose Baileys socket options (keepAliveIntervalMs, connectTimeoutMs, maybe more) under config (likely channels.whatsapp.* or web.*)

If you file it, include the key detail that stale-socket is currently a hard-coded 30m “no events” threshold and WhatsApp’s lastEventAt isn’t updated by any heartbeat/ping.

GitHub issues: https://github.com/openclaw/openclaw/issues

clear fulcrum
#

WhatsApp outbound fails while inbound + auto-reply work (state mismatch)

Env: OpenClaw 2026.3.12, macOS Intel x64, Node v25.4.0, local gateway

Problem: Agent's message tool send fails with "No active WhatsApp Web listener" while inbound messages arrive and auto-reply sends succeed — on the same connection, seconds apart.

Evidence (gateway.log, Mar 13):
20:38:28 [whatsapp] Inbound message received ← UP
20:38:37 [ws] res ✗ send "No active listener" ← FAIL (9s later)
20:38:45 [whatsapp] Auto-replied successfully ← WORKS (17s later)

20:40:48 [whatsapp] Inbound message received ← UP
20:41:20 [ws] res ✗ send "No active listener" ← FAIL
20:41:35 [whatsapp] Auto-replied successfully ← WORKS

20:46:12 [whatsapp] Inbound message received ← UP
20:46:32 [ws] res ✗ send "No active listener" ← FAIL
20:46:40 [whatsapp] Auto-replied successfully ← WORKS
Pattern repeats every message. Baileys socket is connected (inbound works, auto-reply sends work), but the agent RPC send path reports no listener.

Key detail: Failing sends are [ws] ⇄ res ✗ send (WebSocket RPC from agent → gateway). Successful sends are [whatsapp] Auto-replied (internal auto-reply path). Different code paths checking listener state differently.

Separate from stale-socket issue (30m no-event threshold false-positive — already resolved with channelHealthCheckMinutes: 0).

Impact: Agent can't reliably send outbound messages even on a healthy connection. LLM generates reply, delivery fails via RPC, then auto-reply succeeds seconds later causing duplicates.

Likely cause: Race condition after gateway restart — auto-reply monitor reconnects before RPC send handler's listener reference updates.

frank lanternBOT
#

That “inbound works + auto-reply can send, but message tool outbound says No active WhatsApp Web listener” pattern is almost certainly not Baileys / not a timing race — it’s a bundle split / duplicated-singleton bug in 2026.3.12.

Why it happens

There are two different outbound paths:

  • Auto-reply path replies via the inbound message object (msg.reply(...) / msg.sendMedia(...)). That does not require the global “active listener” singleton.

    • Source: src/web/auto-reply/deliver-reply.ts
  • Agent/tool outbound path (message tool / RPC send) goes through sendMessageWhatsApp() which calls requireActiveWebListener() from web/active-listener.ts.

    • Source: src/web/outbound.tssrc/web/active-listener.ts

In the published dist for 2026.3.12, the WhatsApp monitor and the outbound runtime end up importing different compiled copies of active-listener, so they don’t share the same listeners Map:

  • WhatsApp channel monitor (channel-web-…js) calls setActiveWebListener(...) from dist/active-listener-BKtjeNQA.js
  • Runtime outbound (runtime-whatsapp-outbound.runtime-…jsoutbound-eow-Inac.js) calls requireActiveWebListener(...) from dist/active-listener-C1AyS2lR.js

So the listener is “set” in one module instance, but “required” in another → outbound tool sends always see “no active listener”, while auto-reply still works.