#Bug #66920 — recurrence after doctor --fix

1 messages · Page 1 of 1 (latest)

tropic violet
#

We ran openclaw doctor --fix twice today (around 10h UTC and 15h UTC). Both times groups started receiving messages briefly, but then went silent again within a few hours — without any gateway restart in between.

So the fix resolves the symptom temporarily, but groups drop inbound again on their own. DMs continue working normally throughout.

Setup: OpenClaw 2026.4.21, 14 WhatsApp groups, all with requireMention: true, groupPolicy: "allowlist".

Is there a more permanent fix, or is this a known recurrence of #66920? Any workaround that doesn't require running doctor --fix every few hours?


deep blazeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

queen waveBOT
tropic violet
#

Goal: Monitor WhatsApp groups using OpenClaw. Groups should receive inbound messages continuously.

Expected behaviour: After openclaw doctor --fix, groups receive inbound messages normally.

What happened: doctor --fix works temporarily — groups receive messages for 1-2 hours, then go silent again without any gateway restart. DMs continue working throughout. This happened twice today (fixed ~10h UTC, silent again ~18h UTC; fixed ~15h UTC, silent again ~21h UTC).

Commands run: openclaw doctor --fix (twice today)

Environment:

  • OpenClaw 2026.4.21
  • Ubuntu, systemd service
  • WhatsApp Web runtime
  • 14 groups, all requireMention: true, groupPolicy: "allowlist"

Relevant log snippet:

messagesHandled=26 | uptime=4200077ms   ← working
messagesHandled=0  | uptime=1800048ms   ← silent again (no restart)

Question: Is doctor --fix expected to be a permanent fix or only temporary? Is there a workaround to prevent groups from going silent without manual intervention?


tropic violet
#

@lone light

normal furnace
#

did it work before 2026.4.21; i am on 2026.4.20 and the group messages are working for me

tropic violet
#

I was on 2026.4.15 before today. Updated to 2026.4.21 and the problem started. Groups were stable on 2026.4.15 with the same config.

normal furnace
#

looks like 4.21 problem then

gloomy sleet
#

Check the live logs. I was losing my mind but turned out to be an OOM error

Server was 8GB RAM but apparently the node heap is 2GB by default, updated that 4 and all good

lone lightBOT
#

Nice catch, that fits a lot better than a fresh #66920 recurrence.

If the live logs showed OOM, I’d treat this as memory pressure first, not a group-policy bug. doctor --fix probably just bought you a little time before the process hit the heap limit again.

For anyone finding this later, the useful checks are:

  • openclaw logs --follow
  • openclaw status --all

And if you’re on a constrained host, increasing Node heap via NODE_OPTIONS=--max-old-space-size=4096 can absolutely be the difference.

Docs:

Glad you found it. 🎉 If it stays stable now, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it.

tropic violet
#

Thank you so much, That was exactly it.

After your suggestion, we increased the Node heap to 4GB and the gateway has been stable since. To confirm: before the fix, we had reduced the number of monitored groups from 12 to 3 and the disconnections stopped — which makes perfect sense now. Fewer groups = less memory pressure = no OOM.

Marking as solution. 🎉

gloomy sleet
tropic violet
#

Hi Krill! Another question after the OOM fix (which is working great, by the way 🎉).

We noticed the WhatsApp runtime has a 30-minute watchdog that restarts the connection when no messages are received:

No messages received in 30m - restarting connection

This is resetting messagesHandled to 0 every 30 minutes during quiet periods (overnight, early morning). Our monitoring agent (Julie) reads the OpenClaw logs to generate shift reports — but because of the reconnect, all group message history is lost between cycles.

Two questions:

  1. Is there a config option to disable or extend this 30-minute watchdog?
  2. Does the reconnect actually cause missed messages, or does it re-hydrate the group history on reconnect?

Thanks again for the OOM tip — 9+ hours stable since applying --max-old-space-size=4096. 🙏

lone lightBOT
#

Glad the heap bump is holding, that’s a good sign. 🎉

Short answer:

  1. I don’t see a user config for that 30m no-message watchdog in 2026.4.21.
    The exposed WhatsApp runtime knobs are web.enabled, web.heartbeatSeconds, and web.reconnect.*. The message idle timeout itself looks hardcoded to 30 minutes in the WhatsApp monitor code right now.

  2. Reconnect is not a full history rehydrate.
    What it does today is:

    • create a new connection, so per-connection fields like messagesHandled reset to 0
    • keep running normally after reconnect
    • process some very recent catch-up events, but it does not replay full missed group history back into auto-reply context

A couple of useful nuances:

  • messagesHandled is effectively a per-connection counter, not a lifetime counter.
  • The in-memory group-history buffer survives a reconnect within the same process, but if Julie is reading logs, the reconnect will still make the metrics look discontinuous.
#
  • Older offline/history-sync messages are intentionally skipped for auto-reply, so I would not assume reconnect fully backfills anything missed during a longer outage.

So for Julie, I’d recommend:

  • treat connectionId as a session boundary
  • aggregate across reconnects externally
  • don’t rely on messagesHandled as a monotonic total

Docs:

If you want, I can help you draft a tight GitHub issue for “make WhatsApp no-message watchdog configurable” since this feels like a real product gap, not operator error.

tropic violet
#

Thanks Krill, that's exactly what we needed to know.

We'll redesign Julie to process messages in real-time as they arrive (each message triggers a write to an accumulator file), and the shift reports will read from that file instead of logs. That way reconnects don't matter.

And yes please — a GitHub issue for a configurable idle watchdog would be great. Happy to test any PR on our setup (12 groups, 8GB VPS).

lone lightBOT
#

That redesign sounds solid. Real-time accumulation is the right move here.

I can’t open the GitHub issue myself from here, but here’s a paste-ready draft for https://github.com/openclaw/openclaw/issues/new:

Title: WhatsApp web idle watchdog should be configurable (30m no-message reconnect)

### Summary

The WhatsApp web runtime currently forces a reconnect after ~30 minutes with no inbound messages:

`No messages received in 30m - restarting connection`

This seems to be a fixed idle watchdog, not a user-configurable setting.

### Environment
#

- OpenClaw 2026.4.21
- WhatsApp web runtime
- 8GB VPS
- 12-14 WhatsApp groups
- `requireMention: true`
- `groupPolicy: "allowlist"`

### Actual behavior

During quiet periods (overnight / early morning), the runtime reconnects every ~30 minutes even when nothing is wrong.

Effects:
- per-connection counters like `messagesHandled` reset to `0`
- `connectionId` changes every cycle
#
- log-based monitoring/reporting becomes discontinuous
- reconnect does not appear to fully rehydrate missed group history for auto-reply context

### Expected behavior

Please expose the idle watchdog as a config value, or allow disabling it.

Something alongside the existing WhatsApp `web.*` runtime settings would help, for example:
- configurable idle timeout
- optional disable switch
- possibly account-scoped override

### Why this matters

For low-traffic deployments, “no inbound messages for 30m” is normal and should not automatically imply a stale connection.
#

This creates false-positive reconnect churn and breaks monitoring pipelines that rely on connection continuity.

### Extra context

We initially thought this was related to a group-delivery issue, but that turned out to be separate OOM pressure. After increasing Node heap (`--max-old-space-size=4096`), the runtime became stable, and the remaining issue is this 30-minute idle reconnect behavior during quiet periods.

Happy to test a PR on our setup.

If you paste the issue link back here, I’m happy to help refine it.

tropic violet
lone lightBOT
#

Perfect, thanks for filing it.

That issue is crisp, and the test offer helps a lot. Between the heap fix and Julie’s accumulator approach, you’ve got a pretty sane workaround in the meantime.

If the explanation above answered the watchdog question, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it for the next person who hits this. 🎉