#Whatsapp constantly breaking or getting routed wrong leading to agent not executing simple tasks

1 messages · Page 1 of 1 (latest)

lyric snow
#

Ended up with a bloated architecture with too many step. Deterministic scripts were not getting used correctly.
Longer more complex tasks ended up in conversational chat mode without openclaw actually handing them of for execution

old leafBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

clear yokeBOT
lyric snow
#

The bigger issues are:

  • routing/ownership
  • runtime drift
  • truth seams
  • execution orchestration
  • Desired end state remains:
    • thin router
    • thick executor
    • one owner per request
    • exact blocker instead of silence
    • verified success instead of optimistic ack
#

Most important open architecture findings:

  • Control plane exists but runtime is still hybrid.
  • Remaining pain comes from:
    • plugin hooks shelling directly into script owners
    • router and selector holding too much business logic
    • guard/tracker/control-plane truth seams
    • stdout/final-JSON truth on some supervised paths
    • /data runtime drift
    • some live proof surfaces bypassing true inbound hook paths
#

The gateway log now points at a more specific live problem: repeated gateway/ws ... closed before connect events after starting channels and sidecars..., with no actual inbound activity afterwards. I’m pulling the current openclaw channels status --json output directly to see whether the WhatsApp channel is failing to link/connect rather than just timing out.

#

The config claims whatsapp-passive-forwarder is enabled, but there is no corresponding live plugin directory under /data/.openclaw/plugins. That mismatch is likely why the passive-feed health surface says “route configured” while no inbound requests ever reach the feed. I’m tracing where that forwarder is actually supposed to live and wiring it back onto the loaded runtime path.

#

I’ve got a concrete config/runtime mismatch now: the hook config says whatsapp-passive-forwarder is enabled, but the loaded plugin surface only contains the five known WhatsApp plugins and there’s no visible forwarder implementation. I’m validating that against the runtime log, then I’ll wire the missing forwarder back in on the narrowest owner path.

rustic whaleBOT
#

Let me check the current docs/source tree and grab the exact version output so I can answer this without guessing.

#

Yep, I think your diagnosis is basically right. This smells like 4 seams, not 1 bug.

One very real gotcha if ACP is involved: current ACP defaults are read-only-ish. File writes / exec can fail in non-interactive mode unless you intentionally configure the ACPX permissions. That can look exactly like “it stayed in chat mode and never actually executed”.
See the ACP permission section here: https://docs.openclaw.ai/tools/acp-agents

#

First thing I’d want from your box:

openclaw --version
openclaw status --all
openclaw status --deep
openclaw tasks list
openclaw tasks audit
openclaw logs --follow

And if ACP/harnesses are part of the path, also run:

/acp doctor
#

If you paste:

  • your bindings[]
  • channels.whatsapp
  • ACP config
  • agents.list[].runtime
  • one failing log slice

…I can usually tell you pretty quickly whether the broken seam is routing, ACP permissions, or task ownership.

#

Yep, that narrows it down a lot.

I’d summarize it like this: the control plane looks unified, but execution truth is still fragmented.

The biggest red flags in your list are:

  • plugin hooks shelling directly into script owners
    That creates a second execution path. Once that exists, router ownership is mostly fictional.

  • router/selector holding business logic
    The router should answer only: who owns this request?
    If it also decides how the work happens, you get drift and duplicate logic.

  • guard/tracker/control-plane truth seams
    If guard says one thing, tracker says another, and chat says “done”, you don’t have a source of truth.

  • stdout/final-JSON truth on supervised paths

#

That usually means success is inferred too late and too loosely. The executor should emit a single structured terminal result, not “maybe parse stdout, maybe trust final JSON”.

  • live proof surfaces bypassing inbound hook paths
    This one is nasty. If proofs/debug surfaces skip the real ingress path, you can get fake confidence while production routing is still wrong.

The shape I’d push toward is:

  1. Ingress chooses one owner
  2. Owner creates one task/execution record
  3. Executor runs everything through one supervised path
  4. Only the task/execution record can declare terminal success/failure
  5. Chat/UI renders that result, never invents it

So, concretely:

  • Keep the router thin: bindings, ownership, session selection, nothing else.
  • Move business logic into the executor/workflow layer.
#
  • Make Task Flow / task ledger the truth, not stdout or chat text.
  • Treat plugin scripts as leaf ops behind the executor, not independent runtimes.
  • Kill any “proof” or “debug” surface that does not enter through the same inbound hook chain as real traffic.

If you want, I can turn your findings into a sharper remediation plan, like:

  • Phase 1: unify ownership + ingress
  • Phase 2: remove side-channel execution
  • Phase 3: make terminal state structured and authoritative
  • Phase 4: rewire proof/debug surfaces to the real path

That feels like the right next step here.

#

Yep, that sounds much more concrete now.

This no longer smells like “the agent chose the wrong mode”. It smells like dead ingress plus a false-green health surface.

What your findings imply:

  • route configured is not enough
    If whatsapp-passive-forwarder is enabled in config but absent from the loaded runtime plugin surface, then the route can look configured while nothing is actually able to receive inbound events.

  • gateway/ws ... closed before connect right after sidecars start
    That usually points to bootstrap / runtime path / plugin load failure, not “WhatsApp was just idle”.

  • /data mismatch is probably real
    If config, plugin inventory, and loaded runtime disagree, the gateway may be starting from a different root than the one you’re inspecting.

I’d check it in exactly this order:

#
  1. Confirm the running gateway’s actual config/runtime root
    openclaw gateway status and openclaw health --verbose are the fastest sanity checks here.

  2. Compare configured plugin names vs loaded plugin names
    If the forwarder is configured but not loaded, that’s the seam.

  3. Check channel state, not just route state
    openclaw channels status --json should tell you whether WhatsApp is actually linked/connected.

  4. Only after ingress is live, debug routing/executor ownership
    Until inbound is real, the orchestration layer is mostly noise.

So I’d treat the likely root cause as:

config says passive forwarder exists → runtime did not load it → health surface reports config truth, not execution truth → no inbound feed ever arrives → agent never gets a real ownerable request

That’s a good catch.

#

One nuance: the missing directory under /data/.openclaw/plugins is strong evidence, but the decisive proof is still the loaded plugin inventory from the live runtime. If that inventory also lacks the forwarder, you’ve basically found the bug.

At that point the fix is probably one of these:

  • the forwarder was never installed into the active plugin root
  • the gateway is loading a different root/config than expected
  • the health surface is incorrectly treating “configured” as “live”

If it ends up being that third case, I’d file a GitHub issue with:

  • openclaw --version
  • openclaw status --all
  • openclaw channels status --json
  • the relevant gateway/ws ... closed before connect lines
  • configured plugin name vs loaded plugin list

That’s a solid repro.

lyric snow
#

The passive feed is not the fix for Testing Claw. Its own tests show it intentionally drops managed internal groups, including Testing Claw. That means the real bug is the internal-group fallback: sync_whatsapp_logs_from_sessions.py is still wired to a session directory that no longer exists in this OpenClaw build. I’m now locating the new internal transcript source so I can retarget that sync path instead of chasing the passive-feed route.

rustic whaleBOT
#

I’m going to check the current session/transcript path logic in the ref checkout so I can point you at the actual source instead of guessing.

#

Yep, that sounds like the real seam.

I checked the current ref/docs, and the old flat sessions/ assumption is stale in current OpenClaw.

Current source of truth:

  • Store: ~/.openclaw/agents/<agentId>/sessions/sessions.json
  • Transcripts: ~/.openclaw/agents/<agentId>/sessions/<sessionId>.jsonl

Docs:

Important nuance: newer builds can persist an explicit sessionFile in sessions.json, so a sync script should not assume sessionId -> ~/.openclaw/sessions/<sessionId>.jsonl anymore.

So if sync_whatsapp_logs_from_sessions.py is still wired to an old single sessions dir, I’d retarget it like this:

#
  1. Enumerate stores via:
    openclaw sessions --all-agents --json
    
  2. For each agent store, read sessions.json
  3. For each entry:
    • prefer sessionFile if present
    • otherwise derive from that agent’s sessions/ dir + sessionId
  4. Only then map the relevant internal-group session keys

That should survive the newer per-agent layout and any explicit transcript-path overrides.

So, short version:

don’t chase passive feed for Testing Claw
because its tests say it intentionally drops managed internal groups.

#

Instead:

fix the fallback to read per-agent session stores/transcripts from agents/<agentId>/sessions/, honoring sessionFile.

If you want, paste the old path logic from sync_whatsapp_logs_from_sessions.py and I can help translate it to the current layout.

lyric snow
#

The practical meaning:

Earlier, one real issue was restart churn. That part was real and some of it was fixed.
But even after those restart fixes, the system still has a second problem:
the runtime can be “gateway ready” without the WhatsApp receive path being actually healthy.
In other words, you have a false-ready runtime:
router loaded
plugins loaded
probe path works
but no real incoming WhatsApp events are reaching the feed/log/router chain
The strongest live signals for that are:

passive feed health endpoint is reachable, but stale since April 16
no fresh groupchat.jsonl ingress
no fresh Testing Claw task/control-plane evidence
gateway health call itself hangs
previous logs showed ws closed before connect, which is consistent with the channel/control path not actually becoming usable
So the real architecture bug now is:

runtime readiness != WhatsApp channel readiness

You’re currently declaring the runtime usable from:

gateway ready
hooks loaded
runtime-ready marker
But the thing you actually need for live chat replies is:

WhatsApp channel connected
inbound events flowing
feed/log/router path receiving real messages
That last part is not proven healthy.

What this means for Phase 5:

This is not primarily a Phase 5 lane-contract bug anymore
It is now a runtime/channel ingress bug
So continuing to tweak Testing Claw routing is low-yield until the transport/readiness layer is fixed

rustic whaleBOT
#

Yep, I think that’s the right cut now.

This is the classic liveness vs usefulness split:

  • Gateway/runtime ready = process is up, hooks loaded, probe path responds
  • WhatsApp ready = channel is connected and usable
  • Ingress healthy = real inbound events are actually flowing through feed/log/router

Right now OpenClaw is effectively treating 1 as if it proves 3, and it doesn’t.

So I’d frame the bug exactly the way you did:

runtime readiness != WhatsApp channel readiness

And yes, that means Phase 5 routing tweaks are low-yield until the transport/ingress layer is fixed.

What I’d want the system to expose separately is:

#
  • coreReady
  • channelConnected
  • lastInboundAt
  • ingressFresh
  • chatUsable

Because “gateway ready + plugins loaded” should never greenlight live chat if:

  • lastInboundAt is stale
  • the health call hangs
  • the channel/control WS never reaches usable state
  • no fresh ingress artifacts are appearing

That’s a real bug boundary, not just a Testing Claw quirk.

Practical next move, I think:

#
  1. Treat this as a channel ingress/readiness issue
  2. Stop spending cycles on router/lane logic for now
  3. Verify whether the runtime can ever prove:
    • WhatsApp connected
    • inbound event received
    • feed/log/router saw it
  4. If not, readiness should be degraded/false, not “ready”

The docs side that’s most relevant here is the deep/live status path:

If you can reproduce:

  • openclaw --version
  • openclaw status --deep
  • openclaw channels status --json
#
  • hanging openclaw health --verbose
  • stale passive feed timestamp
  • the ws closed before connect log lines

…that feels issue-worthy, because it’s not just one broken route. It’s a false-ready readiness contract.

lyric snow
#

WhatsApp channel/session audit

Connected: no decisive proof
Startup logs in container only show gateway bootstrap and starting channels and sidecars, then hooks loaded; they do not show any positive WhatsApp linked/connected/session-ready signal.
Source: docker logs --timestamps --since 2026-04-17T15:56:39Z openclaw
Usable: no
openclaw channels status --json hangs past 30s.
openclaw gateway call health --json also failed to return usable status within 30s.
Those are broken status surfaces, not health proof.
Inbound receive path active: no
Passive feed listener is up, but /data/state/whatsapp_passive_feed_health.json shows:
healthy=false
reason="passive feed reachable but no ingress observed"
requests_total=0
accepted_total=0
last_request_at=null

rustic whaleBOT
#

That audit is already strong enough to call this channel-down / false-ready, not a Testing Claw routing problem.

What it proves:

  • No positive WhatsApp ready signal
    Startup only shows bootstrap/hooks, not linked/connected/session-ready.

  • Status surfaces are not trustworthy right now
    If openclaw channels status --json and openclaw gateway call health --json both hang past 30s, they are not evidence of health.

  • Inbound path is dead
    requests_total=0, accepted_total=0, last_request_at=null is about as clear as it gets.

So the operational state is basically:

  • gateway process started: yes
  • WhatsApp channel usable: no proof
#
  • inbound receive path flowing: no
  • live chat routing/debugging worth doing: also no

I’d describe the bug as:

Gateway can report or imply readiness while the WhatsApp channel never reaches a usable connected state, and the health/status surfaces can hang instead of reporting that clearly.

That’s a real runtime bug.

At this point I’d stop chasing feed/router logic and bucket the remaining possibilities as:

  1. WhatsApp auth/session restore never actually completes
  2. Channel bootstrap deadlocks before ready
  3. Health/status calls are blocked behind the same broken init path
  4. Container state/credentials mount is not the one the runtime is actually using

And honestly, what you have now is already issue-quality evidence.