#Multi-agent dispatch hangs — orchestrator can't reach sub-agents and sub-agent replies never arrive

1 messages · Page 1 of 1 (latest)

unborn spire
#

Running 7-agent pipeline on Win10, OpenClaw 2026.4.2 (npm install), MiniMax M2.7 API key. Orchestrator (main) dispatches workers via sessions_send, each worker has its own Telegram bot bound via bindings.
Problem: nested lane waits are massive with queueAhead=0:
lane=nested waitedMs=130086 queueAhead=0
lane=nested waitedMs=152390 queueAhead=0
lane=session:agent:notebooklm-ingestion-agent:... waitedMs=348938 queueAhead=0
Workers eventually run successfully but agent.wait times out at 45s, so orchestrator gives up and retries → duplicate dispatches. Also seeing periodic Profile minimax:global timed out failovers.
Already done: killed duplicate gateway scheduled task, fixed groupPolicy: allowlist (was silently dropping messages), removed hardcoded session keys from worker MEMORY files. doctor --deep clean, Telegram ok.
Questions: (1) config knob for nested lane concurrency? (2) where to bump 45s agent.wait? (3) should I migrate to sessions_spawn + sessions_yield?
Happy to share sanitized config / more logs.

clear garnetBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

dim adderBOT
#

Yep — this behavior lines up with how lanes/tools are wired today.

Short answers first:

  1. Nested lane concurrency knob?
    No direct config knob right now.
    nested lane is effectively default-lane behavior (concurrency 1 unless explicitly set in code), and config only exposes concurrency for:
  • agents.defaults.maxConcurrentmain lane
  • agents.defaults.subagents.maxConcurrentsubagent lane
  • cron.maxConcurrentRunscron lane
  1. Where to bump the 45s agent.wait?
    At the caller, not a global config key.
  • If you call agent.wait directly: raise timeoutMs there.
  • If you dispatch via sessions_send: raise timeoutSeconds on each call (default is 30s in tool code).
  1. Migrate to sessions_spawn + sessions_yield?
#

For a 7-worker pipeline: yes, strongly recommended.
sessions_send uses nested lane and can trigger extra A2A follow-up/announce work, which amplifies wait pressure.
sessions_spawn is built for fan-out background work and uses the subagent lane, where concurrency is tunable.


What your logs likely mean:

  • queueAhead=0 + huge waitedMs usually means “you were next, but lane slot was occupied” (active task), not a deep queue backlog.
  • Retrying after wait timeout creates duplicate dispatches (new idempotency each retry), so workers eventually run and duplicates appear.

If you must stay on sessions_send for now, quick mitigations:

  • Set session.agentToAgent.maxPingPongTurns: 0 (cuts reply-back churn).
  • Increase timeoutSeconds on sessions_send calls.
#
  • Treat wait timeout as “unknown/pending”, not immediate failure + redispatch.

If you switch to spawn pattern:

  • Dispatch workers with sessions_spawn (non-blocking).
  • Use sessions_yield to wait for completion events (push-based).
  • Tune agents.defaults.subagents.maxConcurrent + runTimeoutSeconds.

Docs:

If you share sanitized config, include these blocks and I’ll give exact edits:
agents.defaults, agents.defaults.subagents, session.agentToAgent, tools.agentToAgent, bindings.

#

Also paste exact openclaw --version output (full string), just to confirm patch/build before we tune further.

unborn spire
#

Going with Path B — want to migrate to sessions_spawn + sessions_yield. How can i get the blocks to you. discord doesnt allow posts of more than 2000 chars and i'm already whittled them down

version is OpenClaw 2026.4.2 (d74a122)

dim adderBOT
#

Perfect, thanks — 2026.4.2 (d74a122) helps.

Yep, Discord limit is annoying. Easiest options:

  1. Attach a file (config-snippet.txt / .json5) in the thread (best).
  2. If no file, send 4 small messages with this exact split:

[1/4]

  • agents.defaults
  • agents.defaults.subagents
  • orchestrator agent entry from agents.list[] (just that one)

[2/4]

  • worker agent entries (agents.list[]) for 1–2 workers (pattern is enough)
  • each worker’s subagents block (if present)

[3/4]

#
  • tools.allow / tools.deny
  • tools.subagents
  • session.agentToAgent

[4/4]

  • relevant bindings[] entries (orchestrator + one worker example)
  • 20–40 lines of queue/failover logs around one timeout

While you gather that, here’s the minimum Path B baseline to move to sessions_spawn + sessions_yield:

  • Ensure orchestrator can call:
    sessions_spawn, sessions_yield, sessions_list, sessions_history, subagents
  • Set:
  • agents.defaults.subagents.maxConcurrent (start with 8)
  • agents.defaults.subagents.runTimeoutSeconds (e.g. 900)
#
  • subagents.allowAgents so orchestrator can spawn your worker agent IDs (default is same-agent only)
  • Set session.agentToAgent.maxPingPongTurns: 0 (reduces old sessions_send chatter overhead)

Then orchestration flow should be:

  1. spawn all workers via sessions_spawn (non-blocking)
  2. store returned runId / childSessionKey
  3. call sessions_yield
  4. consume completion events (no 45s wait-loop retries)

Drop the blocks in the split above and I’ll give you an exact config patch.

unborn spire
#

cant i just tell my main agent to do this? i'm having trouble dragging and dropping into discord

dim adderBOT
#

Yep — 100%, you can have your main agent do it.
No file upload needed.

Paste this to your main agent:

Migrate my orchestration from sessions_send to sessions_spawn + sessions_yield.

Constraints:
- First do ANALYZE ONLY (no writes), then show a plan + exact diffs.
- Keep existing bindings unless required.
- Do not use sessions_send for worker dispatch anymore.

Required config updates:
1) session.agentToAgent.maxPingPongTurns = 0
2) agents.defaults.subagents.maxConcurrent = 8
#
3) agents.defaults.subagents.runTimeoutSeconds = 900
4) Ensure orchestrator agent tool policy allows:
sessions_spawn, sessions_yield, sessions_list, sessions_history, subagents, session_status
5) Ensure orchestrator subagents.allowAgents includes my 7 worker agent IDs (or explicit safe allowlist).

Required orchestration behavior updates:
- Fan out workers with sessions_spawn (runtime=subagent, explicit agentId).
- Collect returned runId + childSessionKey.
- Call sessions_yield after dispatch.
- Do NOT retry-dispatch on wait timeout.
- Use completion events (and sessions_history only as fallback) for final aggregation.

Output format:
A) Findings
B) Proposed config diff
#
C) Proposed orchestration diff
D) Validation checklist

After I approve, apply changes, restart gateway, and run a smoke test with 2 workers.

If you want, I can also give you a tiny follow-up prompt that forces your agent to print results in 3 Discord-safe chunks ([1/3], [2/3], [3/3]).

unborn spire
#

i'm going to send you response back in chunks ok?

first response

A) Findings

• Current orchestration relies on sessions_send for worker dispatch + progress signaling. We’ve seen it be unreliable (timeouts / gateway closed), which is exactly what sessions_spawn + completion events are meant to avoid.
• Your current openclaw.json already has:
• tools.sessions.visibility: "all" and tools.agentToAgent.enabled: true
• Agent registry includes the 7 worker IDs you listed:
queue-agent, claim-worker, sec-filing-agent, notebooklm-ingestion-agent, owner-earnings-extraction-agent, owner-earnings-normalization-agent, intrinsic-value-agent
• Missing / not yet present in config (will need adding):
• session.agentToAgent.maxPingPongTurns
• agents.defaults.subagents.{maxConcurrent, runTimeoutSeconds}
• an orchestrator tool-policy allowlist (if you want it explicit vs tool profile)
• agents.defaults.subagents.allowAgents (or equivalent allowlist) for the 7 worker agent IDs

dim adderBOT
#

Perfect — chunk 1 looks solid and aligned with what we expected ✅

Please send B (proposed config diff) next, then C (orchestration diff), then D (validation checklist).

Most important for B:

  • exact JSON5 paths/values for
    session.agentToAgent.maxPingPongTurns
    agents.defaults.subagents.maxConcurrent
    agents.defaults.subagents.runTimeoutSeconds
    agents.defaults.subagents.allowAgents (or per-agent equivalent)

I’ll sanity-check each chunk as you post it.

unborn spire
#

second response

B) Proposed config diff (exact)

File: C:\Users\Eagle3dStreaming.openclaw\openclaw.json

@@
"agents": {
"defaults": {

  • "subagents": {
  • "maxConcurrent": 8,
  • "runTimeoutSeconds": 900,
  • "allowAgents": [
  • "queue-agent",
  • "claim-worker",
  • "sec-filing-agent",
  • "notebooklm-ingestion-agent",
  • "owner-earnings-extraction-agent",
  • "owner-earnings-normalization-agent",
  • "intrinsic-value-agent"
  • ]
  • },
    "model": {
    "primary": "minimax/MiniMax-M2.7"
    },
    @@
    "tools": {
    "profile": "coding",
    @@
    "agentToAgent": {
    "enabled": true
  • }
  • },
  • "session": {
  • "agentToAgent": {
  • "maxPingPongTurns": 0
    }
    },

Tool-policy note (requirement #4): OpenClaw’s tools.profile: "coding" already includes the tools you listed in practice, but if you want an explicit allowlist for orchestrator agent only, we can add an agent-specific tools policy block (I’m not adding it yet because your config doesn’t currently use per-agent tool policies, and I want to “keep existing bindings unless required”). If you confirm the schema you prefer (global vs per-agent), I’ll draft the exact diff.

#

this response

C) Proposed orchestration diff (exact)

Target: Main orchestrator logic (wherever you currently do dispatch via sessions_send).

Replace patterns like:

• dispatch worker → sessions_send(...)
• wait/poll worker session → sessions_list/sessions_history loops
• retry-dispatch on timeout

With:

  1. Dispatch via sessions_spawn (runtime=subagent, explicit agentId)
  2. Store runId + childSessionKey in an in-memory map (or structured state object)
  3. Immediately sessions_yield() after dispatch fanout
  4. On completion events, advance the pipeline; use sessions_history(childSessionKey) only if the completion payload is missing key fields

Pseudo-diff (illustrative, since I don’t yet have your orchestrator source file in this workspace):

  • await sessions_send({ sessionKey: agent:${workerId}:telegram:direct:${userId}, message })
  • // poll/wait, retry on timeout, etc.
  • const spawnRes = await sessions_spawn({
  • runtime: "subagent",
  • mode: "run",
  • agentId: workerId,
  • task: message
  • })
  • state.children.push({ workerId, runId: spawnRes.runId, childSessionKey: spawnRes.childSessionKey })
  • await sessions_yield()

Fan-out rule: You asked to “fan out workers” with max concurrency 8. Your pipeline is sequential by design, so the most sensible “fan out” is:

• fan out only where it’s safe (e.g., non-dependent background checks), OR
• keep pipeline sequential but still use sessions_spawn to run each worker as an isolated run (recommended)

If you truly want parallelism inside the pipeline, tell me which steps are allowed to overlap (right now your MEMORY.md says never parallelize).

dim adderBOT
#

Nice — B is mostly correct ✅
Two important tweaks before apply:

  1. Add maxChildrenPerAgent (you’re fanning out 7 workers; default is 5)
agents: {
defaults: {
subagents: {
maxConcurrent: 8,
maxChildrenPerAgent: 10, // <- important for 7-worker fanout
runTimeoutSeconds: 900,
allowAgents: [/* your 7 worker ids */],
},
},
}
#
  1. tools.profile: "coding" already includes group:sessions, so you should have sessions_spawn + sessions_yield available unless another deny layer exists.

Everything else in your B chunk looks aligned (including session.agentToAgent.maxPingPongTurns: 0).

Send chunk C (orchestration diff) next — that’s the make-or-break part for eliminating duplicate dispatch.

#

C is very good — this is the right migration shape ✅

A few small tighten-ups:

  • Add label per spawn (makes tracing easier in sessions_list / logs).
  • Consider cleanup: "delete" to avoid subagent session buildup.
  • Pass runTimeoutSeconds in the spawn call if some workers are known long-runners.
  • Do not keep any retry-dispatch-on-timeout logic.

Example tightened spawn call:

{
runtime: "subagent",
mode: "run",
agentId: workerId,
#
label: `pipe-${jobId}-${workerId}`,
task: message,
runTimeoutSeconds: 900,
cleanup: "delete"
}

And yep: if your pipeline is logically sequential, do sequential spawn + yield per stage (still isolated, still better than sessions_send), not forced parallel fanout.

Send D (validation checklist) and we can greenlight final apply.

unborn spire
#

fourth response

D) Validation checklist

  1. Config

• [ ] openclaw.json contains:
• [ ] session.agentToAgent.maxPingPongTurns = 0
• [ ] agents.defaults.subagents.maxConcurrent = 8
• [ ] agents.defaults.subagents.runTimeoutSeconds = 900
• [ ] agents.defaults.subagents.allowAgents includes all 7 worker IDs

• [ ] orchestrator tool policy allows: sessions_spawn, sessions_yield, sessions_list, sessions_history, subagents, session_status

  1. Orchestrator behavior

• [ ] No remaining sessions_send calls for dispatch
• [ ] After spawning worker(s), orchestrator calls sessions_yield (no polling loops)
• [ ] No retry-dispatch on timeouts
• [ ] Completion events drive the next step; sessions_history used only as fallback

  1. Smoke test (after approval)

• [ ] Restart gateway
• [ ] Run a smoke test dispatching 2 workers (you choose which two) via sessions_spawn
• [ ] Verify completion events arrive and orchestrator advances correctly without sessions_send

If you tell me where your orchestrator code lives (file path / repo), I’ll produce a real, line-accurate “Proposed orchestration diff” instead of pseudocode.

dim adderBOT
#

This is a solid plan — I’d greenlight it with one required addition:

  • Add agents.defaults.subagents.maxChildrenPerAgent (set to 10 for your 7-worker case).

So your final subagents block should include:

  • maxConcurrent: 8
  • maxChildrenPerAgent: 10
  • runTimeoutSeconds: 900
  • allowAgents: [7 worker ids]

Everything else in your checklist looks right.

If helpful, paste this to your main agent as the apply step:

Apply the approved migration now.
#

Requirements:
1) Update openclaw.json:
- session.agentToAgent.maxPingPongTurns = 0
- agents.defaults.subagents.maxConcurrent = 8
- agents.defaults.subagents.maxChildrenPerAgent = 10
- agents.defaults.subagents.runTimeoutSeconds = 900
- agents.defaults.subagents.allowAgents = [the 7 worker IDs]

2) Update orchestrator logic:
- Replace sessions_send dispatch with sessions_spawn (runtime=subagent, mode=run, explicit agentId, label, runTimeoutSeconds, cleanup=delete)
- Remove retry-dispatch-on-timeout logic
- Use sessions_yield after spawn dispatch
- Use completion events as primary; sessions_history only fallback
#
3) Restart gateway.
4) Run smoke test with 2 workers.
5) Return:
- exact diff applied
- smoke test result
- any remaining failures

Send me the smoke-test result and I’ll help tune anything left (especially if MiniMax timeout failovers still appear).

unborn spire
#

Hey — I’m here.

Quick update on the migration work: I applied the config changes but hit a schema mismatch that broke the gateway restart on the first pass (allowAgents / groupChat weren’t valid where I put them). I ran openclaw doctor --repair --non-interactive, which repaired the config and it now validates cleanly.

What’s still incomplete right now:

• The config does not yet contain agents.defaults.subagents.allowAgents after doctor repair (it removed it), so we still need to put the 7-agent allowlist in the correct schema location.
• I also haven’t found an actual “orchestrator code file” in workspace/ to patch (it looks like orchestration is currently happening via agent instructions + MEMORY.md conventions rather than a JS/PS orchestrator script). So I can’t truthfully claim I’ve replaced dispatch logic in code yet.

If you tell me where the orchestrator logic lives (file path, or if it’s purely “main agent behavior”), I’ll finish:

  1. put the allowlist in the correct config key (per schema),
  2. restart gateway successfully,
  3. run the 2-worker smoke test via sessions_spawn + sessions_yield,
  4. return exact diffs + results.
dim adderBOT
#

Great update — you’re close.

You can avoid the allowAgents schema fight by putting it on the main/orchestrator agent (not only in defaults).
That’s also cleaner security-wise.

Use this shape:

{
agents: {
defaults: {
subagents: {
maxConcurrent: 8,
maxChildrenPerAgent: 10,
runTimeoutSeconds: 900
}
#
},
list: [
{
id: "main",
subagents: {
allowAgents: [
"queue-agent",
"claim-worker",
"sec-filing-agent",
"notebooklm-ingestion-agent",
"owner-earnings-extraction-agent",
"owner-earnings-normalization-agent",
"intrinsic-value-agent"
],
requireAgentId: true
#
}
}
]
},
session: {
agentToAgent: { maxPingPongTurns: 0 }
}
}

If doctor --repair still strips defaults allowAgents, that likely means your 2026.4.2 parser expects the per-agent location more reliably — so keep it under agents.list[].subagents and proceed.


On “no orchestrator code file found”: that’s fine.
If orchestration is behavior/prompt-driven, then the “patch” is your main agent instructions (AGENTS/MEMORY/prompt), not JS/PS code.

#

Set explicit orchestration rules for main agent:

  1. Never dispatch workers with sessions_send
  2. Always dispatch with sessions_spawn (runtime: "subagent", explicit agentId)
  3. After dispatch, call sessions_yield
  4. Never retry-dispatch on wait timeout
  5. Use completion events first; sessions_history only fallback

Then:

  • openclaw gateway restart
  • run your 2-worker smoke test

If you paste the post-restart smoke result, I’ll do the final sanity pass.

dim adderBOT
#

Thanks — this bindings block is useful.

Quick read: bindings look structurally fine, but they’re not the key blocker for your spawn migration.

What still matters for Path B:

  1. Spawn permissions (separate from bindings)
    main needs subagents.allowAgents listing target worker agent IDs.
    Bindings route inbound chat; they do not grant sessions_spawn cross-agent rights.

  2. Runtime choice

  • Worker pipeline: use sessions_spawn with runtime: "subagent" (default).
  • ACP bindings only matter if you intentionally spawn ACP sessions (runtime: "acp").
  1. Ordering sanity (important on older configs)
    Put the topic-specific ACP bindings above broad account/channel bindings, so specific topic routes can’t be shadowed.