#Multi agent setup

15 messages · Page 1 of 1 (latest)

ebon wind
#

Hi, I’m testing a multi-agent OpenClaw setup and I’m seeing partial infrastructure instability. Short subagent pings work, but longer delegated tasks often fail or return partial results with gateway/tick timeout style errors, even though the gateway eventually comes up and wired agents can answer simple liveness checks. Could you help me figure out why deeper subagent/specialist coordination is unstable and what logs or diagnostics you want me to provide?

deft foxBOT
rancid terraceBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

deft foxBOT
formal oarBOT
#

Yes, and tick timeout is a useful clue.

In current OpenClaw, that usually means the client stopped receiving gateway keepalive ticks, so I’d first suspect gateway stall/reload under heavier multi-agent load, not basic reachability. Short pings working but longer delegated runs failing often points to one of these:

  1. Gateway/event-loop stall or restart during child runs
  2. Queue or session-store contention under fan-out
  3. Subagent timeout / announce-back failure
  4. Model/backend instability that only shows up on larger, tool-heavy prompts

Please paste this first:

openclaw --version
openclaw status --all
openclaw gateway status --deep
openclaw health --verbose
#
openclaw models status

Then reproduce one failing run and grab:

openclaw logs --follow

What I want from that repro:

  • the exact timestamp of the failure
  • whether it fails during spawn, child execution, or announce back
  • your subagent fan-out at the time, for example 1 child vs 4-8 children
  • the relevant config snippet for:
    • agents.defaults.maxConcurrent
#
  • agents.defaults.subagents
    • acp.maxConcurrentSessions if this is ACP, not native subagents
    • the model/provider used by parent and child

Things I’ll specifically look for in the logs:

  • tick timeout
  • gateway restart/reload lines
  • timeout waiting for session store lock
  • queue delay / queued for ...ms
  • subagent timed out
  • provider 429 / 5xx
  • local OpenAI-compatible backend errors on larger prompts/tools

Quick isolation test, if you want: run the same deeper task with just 1 active child. If that becomes stable, it strongly suggests contention/resource pressure rather than routing.

Docs:

ebon wind
#

21:47:49+00:00 error Subagent completion direct announce failed for run daf1c3c1-fe4a-43f7-928e-0274f118002e: gateway closed (4000): tick timeout Gateway target: ws://127.0.0.1:18789 Source: local loopback Config: C:\Users\Openclaw.openclaw\openclaw.json Bind: lan

formal oarBOT
#

That log is helpful.

A few important clues:

  • gateway closed (4000): tick timeout means the gateway client watchdog stopped receiving keepalive ticks.
  • Because the target is ws://127.0.0.1:18789 / local loopback, this looks more like the gateway becoming unresponsive or restarting locally than a Discord/network issue.
  • Subagent completion direct announce failed is specifically the direct completion handoff path. It does not necessarily mean the child run itself failed. OpenClaw should try a queue fallback after this, so the next log lines matter a lot.

What I’d like next:

openclaw --version
openclaw status --all
openclaw gateway status --deep
openclaw doctor
#

And please paste 60 to 120 seconds of logs before and after 21:47:49+00:00, not just the one line.

What I’m looking for in that window:

  • gateway restart / reload lines
  • drain timeout reached; proceeding with restart
  • timeout waiting for session store lock
  • long queued for ...ms delays
  • provider/model 429 / 5xx
  • whether the queue fallback announce succeeded after the direct path failed

Also please include your relevant config values for:

  • agents.defaults.maxConcurrent
  • agents.defaults.subagents.maxConcurrent
  • agents.defaults.subagents.maxChildrenPerAgent
  • agents.defaults.subagents.runTimeoutSeconds
#
  • agents.defaults.subagents.announceTimeoutMs

My current guess is: the gateway is stalling under heavier subagent load, and the first visible symptom is the completion handoff losing ticks.

Quick isolation test, if you want:

  • run the same deeper task with 1 child only
  • or temporarily set subagent concurrency very low

If that stabilizes it, we’re probably dealing with resource / queue / lock contention, not broken routing.

Docs:

If you paste the surrounding log window, I think we can get much more concrete.

ebon wind
#

00:03:52+02:00 [diagnostic] lane wait exceeded: lane=session:agent:main:telegram:direct:846947265 waitedMs=4160 queueAhead=0

formal oarBOT
#

That one is interesting, but it’s more backpressure evidence than a root cause.

What it means:

  • lane=session:... = OpenClaw is serializing work for that specific session
  • waitedMs=4160 = this task sat ~4.1s before it could start
  • queueAhead=0 does not mean the lane was idle, it only means there was nothing else queued behind the currently active task

So this usually means:

  • one task was already active for that session
  • the next task had to wait for it to finish or hit a boundary
  • this is session-level contention, not necessarily a global crash

By itself, 4 seconds isn’t terrible. The question is whether you’re seeing lots of these, or much bigger waits, right before the tick timeout.

What I’d want now is the surrounding log window showing:

#
  • a few lines before and after this lane wait exceeded
  • any lane task done
  • any lane task error
  • any gateway restart / drain / reload lines
  • any timeout waiting for session store lock

So yeah, this supports the theory that heavier work is clogging the queue/session lane, but it doesn’t yet prove why the gateway later misses ticks.

Two concrete things that would help a lot:

  1. still paste:
openclaw --version
openclaw status --all
openclaw gateway status --deep