Hi, I’m testing a multi-agent OpenClaw setup and I’m seeing partial infrastructure instability. Short subagent pings work, but longer delegated tasks often fail or return partial results with gateway/tick timeout style errors, even though the gateway eventually comes up and wired agents can answer simple liveness checks. Could you help me figure out why deeper subagent/specialist coordination is unstable and what logs or diagnostics you want me to provide?
#Multi agent setup
15 messages · Page 1 of 1 (latest)
Yes, and tick timeout is a useful clue.
In current OpenClaw, that usually means the client stopped receiving gateway keepalive ticks, so I’d first suspect gateway stall/reload under heavier multi-agent load, not basic reachability. Short pings working but longer delegated runs failing often points to one of these:
- Gateway/event-loop stall or restart during child runs
- Queue or session-store contention under fan-out
- Subagent timeout / announce-back failure
- Model/backend instability that only shows up on larger, tool-heavy prompts
Please paste this first:
openclaw --version
openclaw status --all
openclaw gateway status --deep
openclaw health --verbose
openclaw models status
Then reproduce one failing run and grab:
openclaw logs --follow
What I want from that repro:
- the exact timestamp of the failure
- whether it fails during spawn, child execution, or announce back
- your subagent fan-out at the time, for example 1 child vs 4-8 children
- the relevant config snippet for:
agents.defaults.maxConcurrent
agents.defaults.subagentsacp.maxConcurrentSessionsif this is ACP, not native subagents- the model/provider used by parent and child
Things I’ll specifically look for in the logs:
tick timeout- gateway restart/reload lines
timeout waiting for session store lock- queue delay /
queued for ...ms - subagent
timed out - provider
429/5xx - local OpenAI-compatible backend errors on larger prompts/tools
Quick isolation test, if you want: run the same deeper task with just 1 active child. If that becomes stable, it strongly suggests contention/resource pressure rather than routing.
Docs:
- FAQ: https://docs.openclaw.ai/help/faq
- Troubleshooting: https://docs.openclaw.ai/gateway/troubleshooting
- Subagents: https://docs.openclaw.ai/tools/subagents
If you paste those outputs plus one failing log window, I can help narrow it down pretty quickly.
21:47:49+00:00 error Subagent completion direct announce failed for run daf1c3c1-fe4a-43f7-928e-0274f118002e: gateway closed (4000): tick timeout Gateway target: ws://127.0.0.1:18789 Source: local loopback Config: C:\Users\Openclaw.openclaw\openclaw.json Bind: lan
That log is helpful.
A few important clues:
gateway closed (4000): tick timeoutmeans the gateway client watchdog stopped receiving keepalive ticks.- Because the target is
ws://127.0.0.1:18789/ local loopback, this looks more like the gateway becoming unresponsive or restarting locally than a Discord/network issue. Subagent completion direct announce failedis specifically the direct completion handoff path. It does not necessarily mean the child run itself failed. OpenClaw should try a queue fallback after this, so the next log lines matter a lot.
What I’d like next:
openclaw --version
openclaw status --all
openclaw gateway status --deep
openclaw doctor
And please paste 60 to 120 seconds of logs before and after 21:47:49+00:00, not just the one line.
What I’m looking for in that window:
- gateway restart / reload lines
drain timeout reached; proceeding with restarttimeout waiting for session store lock- long
queued for ...msdelays - provider/model
429/5xx - whether the queue fallback announce succeeded after the direct path failed
Also please include your relevant config values for:
agents.defaults.maxConcurrentagents.defaults.subagents.maxConcurrentagents.defaults.subagents.maxChildrenPerAgentagents.defaults.subagents.runTimeoutSeconds
agents.defaults.subagents.announceTimeoutMs
My current guess is: the gateway is stalling under heavier subagent load, and the first visible symptom is the completion handoff losing ticks.
Quick isolation test, if you want:
- run the same deeper task with 1 child only
- or temporarily set subagent concurrency very low
If that stabilizes it, we’re probably dealing with resource / queue / lock contention, not broken routing.
Docs:
- https://docs.openclaw.ai/help/faq
- https://docs.openclaw.ai/gateway/troubleshooting
- https://docs.openclaw.ai/tools/subagents
If you paste the surrounding log window, I think we can get much more concrete.
00:03:52+02:00 [diagnostic] lane wait exceeded: lane=session:agent:main:telegram:direct:846947265 waitedMs=4160 queueAhead=0
That one is interesting, but it’s more backpressure evidence than a root cause.
What it means:
lane=session:...= OpenClaw is serializing work for that specific sessionwaitedMs=4160= this task sat ~4.1s before it could startqueueAhead=0does not mean the lane was idle, it only means there was nothing else queued behind the currently active task
So this usually means:
- one task was already active for that session
- the next task had to wait for it to finish or hit a boundary
- this is session-level contention, not necessarily a global crash
By itself, 4 seconds isn’t terrible. The question is whether you’re seeing lots of these, or much bigger waits, right before the tick timeout.
What I’d want now is the surrounding log window showing:
- a few lines before and after this
lane wait exceeded - any
lane task done - any
lane task error - any gateway restart / drain / reload lines
- any
timeout waiting for session store lock
So yeah, this supports the theory that heavier work is clogging the queue/session lane, but it doesn’t yet prove why the gateway later misses ticks.
Two concrete things that would help a lot:
- still paste:
openclaw --version
openclaw status --all
openclaw gateway status --deep