#Anyone seen this? Bug: Gateway re-delivers old messages after lane contention (#58443)

1 messages · Page 1 of 1 (latest)

sterile crane
#

Bug: Gateway re-delivers old messages after lane contention (#58443)

Running 2026.4.15 (latest). Seeing old assistant messages being re-delivered into Telegram group chats as if they're new — hours after they were originally sent.

What happens:

• Session runs with multiple subagents (parallel analysis work)
• Gateway logs show "lane wait exceeded" warnings (one subagent lane waited 152s, main lane waited 13s)
• After contention clears, an old assistant response (already delivered hours ago) gets re-sent to the channel
• No stale messages in any queue or pending files on disk — appears to be in-memory during lane drain

Impact: Old analysis documents (~15,000 words) suddenly appear in group chat as if freshly generated. User has to manually flag them as stale. Agent has no way to distinguish re-delivered messages from fresh ones.

Environment: Linux, Node v22.22.0, Telegram group chat, Anthropic claude-opus-4-6

Already added reproduction details + version confirmation to GitHub issue #58443. Anyone else seeing this on 2026.4.15?

mystic pythonBOT
tough boneBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

clear yarrowBOT
#

Let me sanity-check the current docs/source before I answer, so I don't hand-wave this.

#

I found the ref checkout, just fixing the path and checking the message/delivery docs plus the relevant code paths.

#

Yeah, I don’t think you’re alone.

I found a few related GitHub reports, not your exact repro but very much the same neighborhood:

  • #9471: delivery race + lane wait exceeded / announce drain timeout behavior
  • #16729: sub-agent announce delivery succeeds externally but goes weird internally
  • #29106: old already-delivered messages get replayed later

A couple things from source that make your report sound real:

  • lane wait exceeded is just a warning from the command queue, it should not itself resend anything.
  • Sub-agent completions do a direct announce first, then can fall back to queued announce delivery.
  • There are also some very recent post-2026.4.15 fixes on main around delivery dedupe:
    • fix: tighten delivery mirror dedupe
    • fix(sessions): dedupe redundant delivery mirrors
    • fix(outbound): replay queued session context
#

So my read is: this looks like a genuine outbound delivery/idempotency bug under announce/queue contention, not just Telegram being weird, and not just stale files on disk.

If you want to strengthen #58443, I’d add:

  • openclaw status --all
  • the log span for:
    • original successful send
    • later duplicate resend
  • whether the duplicate was:
    • a subagent completion announce, or
    • a normal main-session assistant reply
  • whether block streaming or explicit message tool sends were involved

Docs:

#

And since you’re specifically asking for other people’s sightings, I’d also ask in #users-helping-users for real-world confirmations while the GitHub issue tracks the bug.

sterile crane
#

@clear yarrow — thanks for digging into the source. Pulling the specifics you asked for:

openclaw status --all — key points:

• Version: 2026.4.15 confirmed
• OS: linux 6.12.73+deb13-amd64, Node 22.22.0
• Gateway: running (pid 427905), systemd active
• Agents: 3 total (opus-strategic / sonnet-working / haiku-routine), 51 sessions on opus-strategic
• Channel: Telegram (default account, token config OK)

Log timeline for the duplicate incident:

• 07:29:44 UTC — lane wait exceeded: lane=session:agent:opus-strategic:telegram:direct:8708903377 waitedMs=40230 queueAhead=0
• 08:25:48 UTC — lane wait exceeded: lane=session:agent:main:subagent:0e8fc5a9-... waitedMs=152474 queueAhead=0 (this is the 152s subagent wait from the original report)
• 08:32:25 UTC — lane wait exceeded: lane=main waitedMs=12999 queueAhead=0 (the 13s main lane wait)

Duplicate classification:

• The re-delivered message was a subagent completion announce, not a normal main-session assistant reply. It was the output of a previous research subagent that had completed ~hours earlier, and it got replayed into the Telegram group as if fresh.
• No block streaming in this flow; deliveries go through the normal message tool path.
• No stale files in ~/.openclaw/delivery-queue/ at the time of re-delivery (checked mid-incident).

Which of your referenced issues look closest:

• #29106 is the closest match by title.
• #9471 matches the trigger condition we saw.
• The post-2026.4.15 main-branch fixes — are directly relevant.

  1. I'll update #58443 with the log timeline, duplicate classification (subagent completion announce), and cross-ref #29106 / #9471
  2. Worth asking in #users-helping-users for reproductions — will do

Any view on whether we should wait for the main-branch fixes to ship in a point release, or if there's a workaround we could apply now?

clear yarrowBOT
#

Let me check the exact config knobs before I suggest a workaround.

#

Yeah, with that extra detail, I’d use a workaround now and still wait for a proper point-release fix for the root cause.

My read now is:

  • this is very likely the subagent completion announce path
  • the risky pattern is probably:
    1. direct completion announce gets slow / times out
    2. OpenClaw falls back to the queued announce path
    3. that queued announce drains much later and looks like a fresh message

That fits your “already delivered, then hours later replayed” symptom pretty well.

What I’d do now

1. Raise subagent announce timeout
I verified this key exists in current source/schema:

#
{
  agents: {
    defaults: {
      subagents: {
        announceTimeoutMs: 300000, // 5 min
      },
    },
  },
}

Why this helps: your waits are already in the 152s range, and the subagent announce delivery code has its own timeout. If the first delivery is merely slow, a longer timeout may avoid the “send succeeded late, fallback queued duplicate later” pattern.

2. Keep subagent completion messages short
For now, I’d avoid letting subagents announce giant research payloads directly into Telegram groups. Have them return:

#
  • a short summary in the completion message
  • the full writeup as a file/artifact/paste/link if possible

Big 15k-word completion announces are exactly the sort of payload that make this path uglier.

3. Don’t tune concurrency first
I would not start by increasing maxConcurrent. That might reduce some waits, but it also risks making contention noisier without fixing the duplicate-announce path.

4. Check queue mode
If that session has ever been put on steer-backlog, reset it to collect. collect is the safer queue mode when you’re trying to avoid duplicate-looking followups.

Recommendation

  • Production / important chats: apply the timeout + shorter announce workaround now
  • If you can test safely: try current main/nightly on a non-prod instance, because the recent dedupe/delivery fixes do look relevant
  • Root fix: still belongs in the release, not as an ops habit you live with forever
sterile crane
#

One more thing from the logs that strengthens the picture — this isn't just lane contention, there's also a session resumption idempotency problem underneath it.

Around the incident, the subagent session 0e8fc5a9-... was flagged orphaned and resumed twice in a 33-second window, each time with a different run ID: 08:22:33.778 found orphaned subagent session: agent:main:subagent:0e8fc5a9-... (run=528044b8-...)
08:22:34.117 resumed orphaned session: agent:main:subagent:0e8fc5a9-...
08:23:06.442 found orphaned subagent session: agent:main:subagent:0e8fc5a9-... (run=1cba8306-...) ← different runId
08:23:14.958 resumed orphaned session: agent:main:subagent:0e8fc5a9-...
08:25:48.123 lane wait exceeded: lane=session:agent:main:subagent:0e8fc5a9-... waitedMs=152474
08:26:49.959 ⇄ res ✓ agent 113ms runId=announce:v1:agent:main:subagent:0e8fc5a9-...:40608cf0-...
08:31:28.286 ✅ Sent via Telegram. Message ID: 2252 ← user-visible duplicate. So the sequence looks like: orphaned-session resumption fires twice → second resumption re-creates the announce context → announce replay goes out as a "new" delivery.

That makes your replay queued session context fix on main sound like the direct match — the bug isn't "lane wait triggered a resend", it's "orphaned session resumption isn't idempotent, and the replayed context goes through the announce path". Lane contention is the trigger condition that exposes it.

Happy to pull more session-state evidence if useful.

clear yarrowBOT
#

Yep, that changes the picture a lot. Let me check the orphan/resume path before I say something sloppy.

#

Yep, that’s a much stronger signal.

From source, the subagent announce dedupe key is built from:

  • childSessionKey
  • childRunId

Specifically, it becomes an announce id like:

  • announce:v1:<childSessionKey>:<childRunId>

So if the same logical subagent session gets resumed twice with different run IDs, those resumes produce different announce idempotency keys. In plain English, the system now sees them as two distinct completion announces, not one duplicate.

That matches your theory really well.

The other source bit that fits: orphan recovery resumes the subagent by sending a fresh agent call with a new random idempotency key, then remaps the old run to the new runId. If that recovery/resume path happens twice in a short window, you can absolutely end up with duplicate logical completions carrying different announce identities.

#

So I’d update my view to:

  • lane contention is probably the trigger
  • run remap / orphan resume is probably what breaks dedupe

Which means:

  1. announceTimeoutMs is still a reasonable mitigation,
  2. but it’s probably not the real fix anymore,
  3. and #58443 should probably frame this as an announce-idempotency bug across resumed/orphaned subagent runs.

If you can add these to the issue, that would be the killer evidence:

  • old runId
  • resumed runId #1
  • resumed runId #2
  • same childSessionKey
#
  • timestamps for orphan detection + both resumes + duplicate delivery

I’d also explicitly note:

current announce idempotency appears scoped to childRunId, but resumed orphan recovery can mint multiple runIds for the same logical subagent completion

That’s a very actionable maintainer clue.

At this point, if this workflow matters in production, I’d be a bit cautious about relying on a timeout-only workaround. It smells like a real code fix is needed.

sterile crane
#

Great — here's the final evidence pack for the maintainer.

Old runId: 528044b8-4ec7-4b4b-a207-5c40eb4b1c3c (original, then orphaned)
Resumed runId #1: from the 08:22:33 resumption (remapped)
Resumed runId #2: 1cba8306-cc61-43f0-8073-3bd7f909e4f8 (second orphan detection)
Resulting announce runId: 40608cf0-8ce3-463b-9e01-1b4464746f67
Same childSessionKey throughout: agent:main:subagent:0e8fc5a9-2521-40a0-a7d0-b4047d22559f

Announce IDs this produced:

• announce:v1:agent:main:subagent:0e8fc5a9-...:40608cf0-... ← the one that delivered at 08:31:28
• (any earlier announce tied to the original 528044b8-... runId would have had a different announce id and would have been de-duped separately)

Full timeline with IDs: 08:22:33.778 orphan detected: sessionKey=agent:main:subagent:0e8fc5a9-...
old/current runId=528044b8-4ec7-4b4b-a207-5c40eb4b1c3c
08:22:34.117 resumed (remap #1)
08:23:06.442 orphan detected AGAIN: sessionKey=agent:main:subagent:0e8fc5a9-...

#

new runId=1cba8306-cc61-43f0-8073-3bd7f909e4f8
08:23:14.958 resumed (remap #2)
08:25:48.123 lane wait exceeded: waitedMs=152474 on the resumed session's lane
08:26:49.959 announce fired with runId=40608cf0-8ce3-463b-9e01-1b4464746f67
→ announce:v1:agent:main:subagent:0e8fc5a9-...:40608cf0-...
08:31:28.286 Telegram delivery: message ID 2252 (the user-visible duplicate) Explicit maintainer-targeted note:

The current announce idempotency key appears scoped to childRunId, but orphan recovery mints a fresh childRunId on every resume and remaps the old run to it. When orphan recovery fires more than once for the same logical subagent (as here, 33 seconds apart), each resume produces a different announce idempotency key for what is, logically, the same completion. The dedupe layer sees them as distinct announces and both get delivery paths opened — even though they belong to one childSessionKey.

Key point: dedupe should key on (childSessionKey, logical completion identity), not (childSessionKey, childRunId) — because childRunId is unstable across orphan/resume.

Agreed on the fix direction: announceTimeoutMs is a trigger-side mitigation, not a fix. The real fix is making announce idempotency stable across runId remaps. Happy to keep this issue under active monitoring and post confirmation when a point release containing the main-branch replay queued session context / tighten delivery mirror dedupe fixes is shipped.

clear yarrowBOT
#

That’s excellent, honestly. This is maintainer-grade evidence now.

I’d summarize #58443 like this:

Likely root cause

This does not look like “just lane contention”.

For the same childSessionKey, the subagent was orphan-detected and resumed multiple times, which minted fresh runIds via remap/recovery.

Subagent completion announce idempotency appears to be keyed to childSessionKey + childRunId (announce:v1:<childSessionKey>:<childRunId>), not to the logical child session / completion lineage.

That means each resumed/remapped run gets a new announce identity, so a later announce for the same logical completion bypasses dedupe and can be delivered again.

Lane contention likely made the stale announce visible later, but the duplicate itself appears to come from resume/remap changing the announce idempotency key.

And then include your concrete evidence:

#
  • same childSessionKey across the whole incident
  • original run 528044b8-...
  • resumed/remapped runs
  • later announce run 40608cf0-...
  • duplicate delivery timestamp 08:31:28
  • orphan detected twice within 33s

If you want one sharp maintainer ask, I’d phrase it as:

Can announce idempotency be anchored to a stable logical completion identity for the child session, instead of the current resumed/remapped childRunId?

That’s the heart of it.

For now, I’d keep the temporary mitigations in place:

  • shorter subagent completion payloads
  • raised announceTimeoutMs
  • avoid high-value Telegram group deliveries for big parallel subagent completions until a fix lands
#

This is a really solid bug report now.

#

Yep, that’s the right framing.

Only one small tweak I’d make: I wouldn’t imply those main-branch dedupe fixes are expected to solve this one unless a maintainer confirms it. They’re relevant context, but your root-cause writeup is stronger than that and more specific:

  • same childSessionKey
  • multiple orphan/resume remaps
  • unstable childRunId
  • announce idempotency derived from that unstable childRunId
  • duplicate delivery path opened for the same logical completion

So I’d keep the maintainer note centered on:

announce idempotency is currently runId-scoped, but orphan recovery makes runId non-stable for the same logical child completion

That’s the actionable bug.

Honestly, this is a very solid report now. If they ask for one more artifact, the only extra thing I’d grab is the original Telegram delivery message ID for the first send, just to pair “first delivery” vs “duplicate delivery 2252” in one neat table. But even without that, this is strong.

sterile crane
#

Updated #58443 — root cause analysis (confirmed maintainer-grade via community review)

Likely root cause:
This does not look like "just lane contention". For the same childSessionKey, the subagent was orphan-detected and resumed multiple times, which minted fresh runIds via remap/recovery. Subagent completion announce idempotency appears to be keyed to childSessionKey + childRunId (announce:v1:<childSessionKey>:<childRunId>), not to the logical child session / completion lineage. Each resumed/remapped run gets a new announce identity, so a later announce for the same logical completion bypasses dedupe and can be delivered again. Lane contention made the stale announce visible later, but the duplicate itself appears to come from resume/remap changing the announce idempotency key.

#

Concrete evidence (OpenClaw 2026.4.15, Telegram group, 2026-04-18): | UTC Timestamp | Event |
| ------------- | ------------------------------------------------------------------------------------- |
| 08:22:29.176 | ✅ Original delivery — Message ID 2239 |
| 08:22:33.778 | orphan detected: childSessionKey=agent:main:subagent:0e8fc5a9-..., runId=528044b8-... |
| 08:22:34.117 | resumed (remap #1) |
| 08:23:06.442 | orphan detected AGAIN (same childSessionKey), new runId=1cba8306-... (33s later) |
| 08:23:14.958 | resumed (remap #2) |
| 08:25:48.123 | lane wait exceeded: waitedMs=152474 on resumed session's lane |
| 08:26:49.959 | announce fired: announce:v1:agent:main:subagent:0e8fc5a9-...:40608cf0-... |
| 08:31:28.286 | ✅ Duplicate delivery — Message ID 2252 (9m59s after original) | Same childSessionKey throughout: agent:main:subagent:0e8fc5a9-2521-40a0-a7d0-b4047d22559f

Runs involved:

• Original/current runId: 528044b8-4ec7-4b4b-a207-5c40eb4b1c3c
• Resumed runId: 1cba8306-cc61-43f0-8073-3bd7f909e4f8
• Final announce runId: 40608cf0-8ce3-463b-9e01-1b4464746f67

Maintainer-targeted ask:
Can announce idempotency be anchored to a stable logical completion identity for the child session, instead of the current resumed/remapped childRunId? Announce idempotency is currently runId-scoped, but orphan recovery makes runId non-stable for the same logical child completion — opening a duplicate delivery path.

Current mitigations in place on our side:

• Shorter subagent completion payloads

#

• Raised announceTimeoutMs
• Avoiding high-value Telegram group deliveries for large parallel subagent completions until a fix lands

Environment: Linux 6.12.73+deb13-amd64, Node v22.22.0, OpenClaw 2026.4.15, Telegram group, Anthropic claude-opus-4-7.

clear yarrowBOT
#

That’s a really strong update. Clean, specific, and hard to argue with.

I think you’ve done the useful part now:

  • symptom
  • timeline
  • repro conditions
  • concrete runId/remap evidence
  • source-backed idempotency hypothesis

At this point I’d mostly just:

  1. keep the temporary mitigations in place,
  2. watch for maintainer confirmation or a linked fix/PR,
  3. add any new reproductions only if they bring a new data point, not just “happened again”.

If you get a maintainer reply later and want a second pass on it, paste it here and I’ll help decode it.

#

That table is excellent. Super clear.

Two tiny suggestions, otherwise I’d leave it alone:

  1. If you haven’t already tested it, add announceTimeoutMs as “experimental mitigation tried / trying”, not as a fix.
    That helps separate:

    • trigger-side mitigation
    • actual idempotency bug
  2. If GitHub mangles the table on mobile, mirror it as a short bullet timeline underneath.
    The content is strong, so it’s mostly a readability hedge.

Other than that, I wouldn’t add much more. This is exactly the kind of evidence that makes a bug easy to pick up and reason about.

#

Nice, that rounds it out well.

Only nit I’d clean up: model version consistency.

Earlier you had claude-opus-4-6, now this says claude-opus-4-7. If 4-7 is the corrected value, I’d update the issue so the environment section stays crisp and nobody wastes time on a fake variable.

Other than that, yep, this looks buttoned up.

sterile crane
#

Appreciate the source-level dig on this, @clear yarrow Krill. Just posted a follow-up clarifying model version consistency (4.6 → 4.7 crossover at 09:23 UTC), reframing mitigations as trigger-side, and mirroring the timeline as bullets for mobile. Will watch for maintainer response and keep mitigations in place.