I got a director agent that is asking to other 3 agents to do tasks... but the problem I find is that I have to stay there, just to say "keep going until it works" and I think that's stupid, once they got the README with clear instructions, they might enter in work mode and don't stop until they got doubts or think they completed... is that possible??
#How to keep pushing a project semi automated?
76 messages · Page 1 of 1 (latest)
Yeah, mostly yes.
What you want is basically the orchestrator pattern, and OpenClaw supports it, but with a few switches.
If you're using native subagents, the big one is:
{
agents: {
defaults: {
subagents: {
maxSpawnDepth: 2,
maxChildrenPerAgent: 5,
runTimeoutSeconds: 0, // no default timeout
},
},
},
}
Why this matters:
- default
maxSpawnDepthis 1, so a child can't manage more children maxSpawnDepth: 2enables main → director → workers
If you also want them to stay attached to Discord threads/sessions:
{
channels: {
discord: {
threadBindings: {
enabled: true,
spawnSubagentSessions: true,
},
},
},
}
Then spawn with thread: true and mode: "session".
Small caveat, because this part matters:
- yes, they can keep working without you saying “keep going”
- no, there is not really a safe “retry forever no matter what” mode
- they still stop when they think they are done or blocked
So the task prompt/README should explicitly say something like:
Keep iterating until the task is complete. Run checks/tests, fix failures, and only stop if you are blocked by missing info or credentials.
If you're using Codex / Claude Code / Gemini CLI style workers instead of native subagents, the same idea exists, but that's under ACP sessions, not subagents.
Docs:
- https://docs.openclaw.ai/tools/subagents
- https://docs.openclaw.ai/tools/acp-agents
- https://docs.openclaw.ai/gateway/configuration-reference
If you want, paste:
openclaw --version- your
agents.defaults.subagentsconfig - whether these workers are native subagents or ACP/Codex-style sessions
and I’ll tell you the exact setup.
this is what I see in the agents part:
"agents": {
"defaults": {
"model": {
"primary": "openai-codex/gpt-5.4"
},
"models": {
"openai-codex/gpt-5.4": {}
},
"workspace": "C:\Users\Openclaw\.openclaw\workspace",
"compaction": {
"mode": "safeguard"
},
"maxConcurrent": 4,
"subagents": {
"maxConcurrent": 8
}
},
and this is the channels part:
"channels": {
"telegram": {
"enabled": true,
"groups": {
"*": {
"requireMention": true
}
},
"groupPolicy": "allowlist",
"streaming": {
"mode": "partial"
},
"accounts": {
"default": {
"botToken": "876701648_xSVDeI-6CDHPil9QlwOw8Q",
"dmPolicy": "pairing"
},
"marketing": {
"botToken": "8707578396:AAEeFpcPMYNCle2ov0wQtU",
"dmPolicy": "pairing"
},
"reverse": {
"botToken": "8532686362:Y6gttKRLW_ejtaoRAp6HIGBps",
"dmPolicy": "pairing"
},
"football": {
"botToken": "862415a73f3VP-kddiBV6f9CQM7w-So",
"dmPolicy": "pairing"
},
"lowcode": {
"botToken": "8039779443:AANEupJhcAFVqPnbcs",
"dmPolicy": "pairing"
},
"tibia": {
"botToken": "6205292813:AAHWIiDHTFR2_ffmpXGo",
"dmPolicy": "pairing"
}
}
}
}
Two important things:
- Please rotate those Telegram bot tokens ASAP. You pasted real secrets in a public thread.
- In the config you showed,
subagentsonly hasmaxConcurrent, so you have not enabled nested orchestration yet.
If your flow is really:
main → director → 3 workers
then add this under agents.defaults.subagents:
{
agents: {
defaults: {
subagents: {
maxConcurrent: 8,
maxSpawnDepth: 2,
maxChildrenPerAgent: 5,
runTimeoutSeconds: 0
}
}
}
}
What each does:
maxSpawnDepth: 2= lets a subagent act as a director/orchestratormaxChildrenPerAgent: 5= lets that director have several workersrunTimeoutSeconds: 0= no default timeout
A couple caveats:
- Yes, this can remove the need for you to keep saying “keep going”
- No, there is not a magic “work forever until success” switch
- the agent can still stop if it thinks it is done or blocked
So your README/task should be explicit, something like:
Keep iterating until the task is complete. Run checks/tests, fix failures, and only stop if blocked by missing info, permissions, or credentials. Do not ask for confirmation between normal steps.
Also, the channels block you pasted is Telegram-only, so thread-binding settings are not relevant here. Thread-bound persistent subagent sessions are currently a Discord thing, not Telegram.
One more nuance:
- if main agent is directly spawning the 3 workers,
maxSpawnDepthis not needed - if the director subagent is spawning the workers, then
maxSpawnDepth: 2is the key
Docs:
If you paste openclaw --version, I can give you the exact final config shape for your setup.
after changing the agents.defaults.subagents should I restart something?
Usually, restart the Gateway, not the whole machine.
OpenClaw does support config hot-reload by default (gateway.reload.mode: "hybrid"), so some config changes apply without a manual restart, but for agents.defaults.subagents I would do the safe thing:
- save config
- run
openclaw gateway restart - stop any old director/workers
- start a fresh run
Why I’d restart anyway: subagent role/capabilities are written when the session is spawned, so an already-running director may keep the old behavior.
So short answer: not always required, but yes, for this change I’d restart the gateway and respawn the agents.
Docs:
If you want, after you change it I can sanity-check the exact final agents.defaults.subagents block.
I've connected to openai because I have the 30$ plan but I am not sure which limits I have
Short version: the $30 ChatGPT/OpenAI subscription does not tell you the exact OpenClaw limits by itself.
The important split is:
- Codex OAuth / ChatGPT sign-in → subscription-style quota, OpenAI-managed
- OpenAI API key → pay-per-token Platform API billing
In OpenClaw those are treated as different auth paths, so:
- ChatGPT/Codex sign-in does not automatically mean you have direct API credits
- Codex OAuth limits can be different from the ChatGPT website/app
- the exact usable window is whatever OpenAI is exposing to that auth path
Best way to check what OpenClaw can currently see:
openclaw models status
and if you want more detail:
openclaw models status --json
If you want the most predictable limits/billing, use an OpenAI API key instead of subscription OAuth.
Docs:
If you paste the output of openclaw models status and openclaw --version (with anything sensitive redacted), I can help interpret it.
14:14:30+02:00 [tools] exec failed: elevated is not available right now (runtime=direct). Failing gates: allowFrom (tools.elevated.allowFrom.<provider> / agents.list[].tools.elevated.allowFrom.<provider>) Context: provider=exec-event session=agent:main:telegram:direct:846947265 Fix-it keys: - tools.elevated.enabled - tools.elevated.allowFrom.<provider> - agents.list[].tools.elevated.enabled - agents.list[].tools.elevated.allowFrom.<provider> raw_params={"command":"Get-ChildItem C:\ProgramData\chocolatey\logs; Write-Output '---'; Get-Content C:\ProgramData\chocolatey\logs\chocolatey.log -Tail 200","workdir":"C:\Users\Openclaw\.openclaw\workspace","yieldMs":10000,"timeout":120,"elevated":true}
That log is actually pretty revealing:
- the tool call asked for
elevated: true - but the session says
runtime=direct
So in this case, elevated mode is not needed. In direct runtime, exec is already on the host. The failure is only because the agent requested the elevated escape hatch and your config does not allow it.
First thing I’d try
Send this in that session:
/elevated off
Then retry. Or start a fresh session and make sure the agent is not forcing elevated exec.
If you really do want elevated mode
Only needed when you're running sandboxed and want to break out of the sandbox.
Add something like:
{
tools: {
elevated: {
enabled: true,
allowFrom: {
telegram: ["tg:YOUR_TELEGRAM_NUMERIC_ID"]
}
}
}
}
Then restart the gateway:
openclaw gateway restart
Important nuance
If you also have per-agent restrictions under agents.list[].tools.elevated, then both the global and per-agent allowlists must pass.
So the practical answer is:
- direct runtime → don't use elevated unless you really mean to
- sandboxed runtime → configure
tools.elevated.*
Docs:
If you want, paste your tools block and I’ll point to the exact place to add it.
the problem is specialist coordination short tasks are ok but longer ones ends up in timeouts and desyncs, I try to have 1 agent to give instructions to other 3 agents but I keep getting trouble... it should not matter that I am running in VMWARE virtual box with network adaptor in NAT mode??
VMware NAT mode is probably not the main problem.
If all the agents are running inside the same Gateway/VM, then subagent coordination is mostly internal, so NAT usually does not break that by itself.
Where NAT can matter:
- if something outside the VM needs to connect into it
- if the VM has flaky outbound internet / DNS
- if the VM sleeps, pauses, or drops network for a moment
But for what you described, timeouts/desyncs on long specialist runs sound more like runtime limits or orchestration design, not NAT.
Most likely culprits:
- ACP runtime timeout
If these are ACP/Codex-style sessions, the bundledacpxruntime defaults to 120 seconds for embedded turns. Longer work can hit that.
Docs: https://docs.openclaw.ai/tools/acp-agents
-
Subagent run timeout
For native subagents, check whether you or the agent are settingrunTimeoutSeconds.
Docs: https://docs.openclaw.ai/tools/subagents -
Exec timeout
Long shell commands can also die on exec timeout. Default is 1800s.
Docs: https://docs.openclaw.ai/tools/exec -
Too much micro-coordination
1 director constantly steering 3 workers in tiny steps is honestly brittle. Bigger, self-contained tasks per worker are usually much more reliable.
So my answer is:
- NAT mode alone: probably no
- timeouts/desyncs from long orchestration: probably yes, but from config/runtime/prompt structure
Before going deeper, please paste:
openclaw --version
and ideally the exact timeout/desync message you're seeing. That’ll tell us whether this is:
- native subagent timeout
- ACP timeout
- exec timeout
- or a network/session issue inside the VM.
- exactly I am running Codex, I got runtime timeouts yes
- not sure how to check that
- there are some exec timeouts too
I got the latest openclaw version
Yep, that points much more to Codex ACP timeout, not VMware NAT.
If you’re running Codex, the first thing to check is:
/acp status
That shows the effective runtime options for that ACP session.
What to do
For a quick test, raise the session timeout:
/acp timeout 300
If 5 minutes is still too short, try 600.
If you want to change the default globally:
openclaw config set plugins.entries.acpx.config.timeoutSeconds 300
openclaw gateway restart
The docs say the bundled acpx runtime defaults embedded turns to 120 seconds, so long Codex turns can absolutely hit that.
About the exec timeouts
Those are a separate layer.
execdefault timeout is 1800s- but if the ACP runtime turn dies first, the whole Codex turn can still fail earlier
So I’d fix the ACP timeout first, then worry about exec.
Practical advice
Honestly, 1 director + 3 Codex workers on one setup is pretty easy to make brittle.
I’d try:
- 1 director + 1 or 2 workers
- bigger worker tasks
- less constant back-and-forth steering
That tends to desync a lot less.
If you want one more useful check
Run:
/acp doctor
Docs:
If you paste the output of /acp status, I can tell you exactly which timeout is biting you.
ACP error (ACP_SESSION_INIT_FAILED): Session is not ACP-enabled: agent:main:telegram:direct:846947265
next: If this session is stale, recreate it with /acp spawn and rebind the thread.
Yep, that error makes sense.
It means you ran an ACP command in a normal OpenClaw session, not in a Codex ACP session.
This part says it clearly:
Session is not ACP-enabled: agent:main:telegram:direct:...
So the fix is:
- In that Telegram conversation, spawn a Codex ACP session into the current chat:
/acp spawn codex --bind here
- Then check it:
/acp status
- Then raise the timeout:
/acp timeout 300
If you want to fully reset it first:
/acp close
/acp spawn codex --bind here
Important distinction
agent:main:telegram:direct:...= normal OpenClaw chat session- ACP commands only work after you bind/spawn an ACP session into that conversation
Since you’re on Telegram, --bind here is the simplest path. You do not need a Discord-style thread for that.
If later you want child topic/thread spawning on Telegram, that’s a separate config path, but I’d get --bind here working first.
Docs:
If you want, I can give you the exact 3-command sequence I’d use to reset the chat cleanly and re-test Codex.
well I told you that I am using codex because I see this in the gateway console when starting: [gateway] agent model: openai-codex/gpt-5.4
Yep, you're right, and that confusion is on me.
[gateway] agent model: openai-codex/gpt-5.4 means:
- your normal OpenClaw agent is using the Codex/OpenAI auth+model route
- it does not mean the session is an ACP Codex harness session
Those are two different things.
So in your case:
agent:main:telegram:direct:...= normal OpenClaw session- model = openai-codex/gpt-5.4
- therefore
/acp ...is the wrong tool path unless you explicitly spawned an ACP session
What to focus on instead
For your setup, the relevant part is native subagents, not ACP.
So I’d look at:
-
maxSpawnDepth: 2
if your director subagent is spawning worker subagents -
/subagents infoand/subagents log
to see which child is timing out / desyncing -
exec timeouts
because subagents themselves do not have a default timeout unless you setrunTimeoutSeconds;execstill has its own timeout
Important implication
If you did not set:
agents.defaults.subagents.runTimeoutSeconds
then your “timeouts” are probably not subagent-run timeouts. They’re more likely:
- exec timeout
- model/provider stall
- orchestration getting brittle from too much back-and-forth
My recommendation
For now, ignore ACP and test this path:
- keep
maxSpawnDepth: 2 - use 1 director + 1 or 2 workers
- give workers bigger chunks
- inspect with
/subagents list,/subagents info,/subagents log
Docs:
If you want, paste the exact subagents block you ended up with, and I’ll tell you whether it matches the “director + workers” setup correctly.
my director has 3 workers
Ah, then I’m very suspicious of your concurrency settings.
From the config you posted earlier, you had:
agents.defaults.maxConcurrent: 4agents.defaults.subagents.maxConcurrent: 8
But with main + director + 3 workers, you can end up with 5 active runs.
And per docs:
agents.defaults.maxConcurrent= max parallel agent runs across sessionssubagents.maxConcurrent= subagent lane cap, not the overall global cap
So yes, 3 workers can already be bumping into your global limit.
I’d change it to something like
{
agents: {
defaults: {
maxConcurrent: 6,
subagents: {
maxConcurrent: 8,
maxSpawnDepth: 2,
maxChildrenPerAgent: 5,
runTimeoutSeconds: 0
}
}
}
}
Why
maxSpawnDepth: 2= director can have workersmaxChildrenPerAgent: 5= 3 workers is comfortably allowedrunTimeoutSeconds: 0= no default subagent timeoutmaxConcurrent: 6= gives headroom for main + director + 3 workers
Then restart the gateway.
My honest recommendation
Even if 3 workers is supported, I’d still test with 2 workers first.
Multi-agent setups get flaky fast when you add too much fan-out, especially on a VM.
Docs:
If you want, send the exact final agents.defaults block before you restart and I’ll sanity-check it.
[13/04/2026 15:03] Adrian H.: /subagents info reverse
[13/04/2026 15:03] Director: ⚠️ Unknown subagent id: reverse
I dont know the ids but if it's what's specified in openclaw.json then something is wrong because I can see there is an agent called reverse with id reverse
I dont get what should I do with runTimeoutSeconds
the subagents list brings this:
active subagents:
(none)
recent subagents (last 30m):
(none)
can you help?
Yep. Since you're on native OpenClaw sessions using the Codex model route, this is the config I’d try first.
Keep your existing model/workspace settings, just merge in these parts
{
agents: {
defaults: {
maxConcurrent: 6,
subagents: {
maxConcurrent: 8,
maxSpawnDepth: 2,
maxChildrenPerAgent: 5,
runTimeoutSeconds: 0
}
}
},
tools: {
exec: {
timeoutSec: 3600
}
}
}
Why these changes
maxConcurrent: 6
gives enough headroom for main + director + 3 workersmaxSpawnDepth: 2
required for director → workersmaxChildrenPerAgent: 5
comfortably allows 3 workers
runTimeoutSeconds: 0
avoids default subagent timeoutstools.exec.timeoutSec: 3600
gives long shell/test steps more room
Then do this
- save config
- run
openclaw gateway restart - start a fresh run, don’t reuse the old one
- test with 2 workers first, then 3 if stable
If it still flakes
Use:
/subagents list/subagents info <id>/subagents log <id>
Docs:
If you paste your updated agents.defaults block, I’ll sanity-check it before you restart.
All three failed immediately with the same pattern:
• timeout to ws://127.0.0.1:18789
• local loopback path
• bind still reported as lan
So the bottleneck is clearly gateway/runtime health, not specialist capability anymore.
Yep, I agree, that’s gateway/runtime health now, not worker quality.
And one important nuance:
- if
gateway.bindis"lan", the gateway should still be reachable on127.0.0.1from inside the VM - so a timeout to
ws://127.0.0.1:18789usually means the gateway is not actually listening / is unhealthy / wrong port / wrong config path, not that LAN bind itself is the problem
Also, since you're on VirtualBox NAT:
- NAT can affect host ↔ guest access unless you forward ports
- but it should not break
127.0.0.1inside the guest
I’d check these next, in this order
openclaw gateway status --deep
openclaw health
If it looks stuck or unhealthy:
openclaw gateway restart
Things to verify
-
Port really is 18789
docs say that’s the default WS port -
You are editing the active config file/profile
stale/wrong config path can make “bind says lan” misleading -
If
gateway.bind: "lan"is set,gateway.authis valid
non-loopback binds require a real auth path
Very likely causes from what you described
- gateway process wedged / not listening
- wrong config file loaded
- wrong port
- auth/bind mismatch after config changes
What to paste here
The output of:
openclaw gateway status --deep
That will probably tell us the next move immediately.
Docs:
openclaw gateway status --deep
with gateway running or stopped?
With it running first.
That gives the useful signal.
So:
- leave the gateway running
- run:
openclaw gateway status --deep
If it says the gateway is unreachable / not listening / unhealthy, that’s already the clue.
If the gateway is currently stopped, status --deep is still safe to run, but it’ll mostly just confirm “not running”.
So the best sequence is:
openclaw gateway status --deep
and if needed after that:
openclaw gateway restart
openclaw gateway status --deep
Paste the output here and I’ll help read it.
Not yet, I can’t see your terminal from here.
I need you to paste the output of:
openclaw gateway status --deep
Then I can check it with you right away.
the output is here: https://pastebin.com/sVHbA59E
Yep, I checked it.
Good news: the final status is healthy now.
These are the key lines:
RPC probe: okListening: 0.0.0.0:18789
So right now the gateway is up and answering.
What actually happened
The important part is the middle section:
Found stale gateway process(es)Port 18789 is already in use- restart timed out waiting for health
So the real issue was a stale/wedged gateway process, not “3 bad workers” and not really NAT.
Also, this part is normal:
bind=lan- probe target
ws://127.0.0.1:18789
A LAN bind can still be probed through loopback locally.
What I’d do next
Now that the gateway is healthy again:
- start a fresh director/worker run
- do not reuse the failed old sessions
- if it happens again, run:
openclaw doctor
If the stale-process thing keeps repeating
Use a cleaner bounce:
openclaw gateway stop
openclaw gateway start
instead of only restart.
So, short version: the gateway was sick, now it looks healthy. Retest from a fresh run.
If the next run still dies, paste that new error and we can narrow it down fast.
how do I use a fresh run?
Yep, easiest way:
-
send
/stop
this aborts the current run and stops spawned subagents too -
then send
/new
that starts a fresh session
So the clean sequence is:
/stop
/new
Send them as standalone messages.
That gives you a fresh run without deleting old transcripts.
Docs:
If you want, after /new I can help you word the next “director + workers” prompt so it’s less likely to stall again.
I saw there are interfaces to see your agents, and that you can even see how they meet in a room and work together, how is that GUI?? it takes extra tokens or something?
Yep, the built-in GUI is the Control UI / Dashboard.
Open it with:
openclaw dashboard
or directly in a browser at:
http://127.0.0.1:18789/
Does the GUI itself cost extra tokens?
Normally, no.
Just looking at:
- sessions
- logs
- agents/tools panels
- status/info
does not itself burn model tokens.
You spend tokens when the UI causes an actual agent/model run, like:
- sending a chat message
- spawning workers
- asking for a new reply
- running model-backed tools
So the GUI is mostly a control/inspection layer, not a separate billing layer.
One nuance
If you saw a really fancy “agents in a room working together” view, that may have been:
- a demo
- a custom client
- or a visualization on top of the same gateway/session data
The official built-in thing is the Control UI / Dashboard.
Since you’re on VirtualBox NAT
The easiest path is:
- open the dashboard inside the VM, or
- use
openclaw dashboardthere
If you want to open it from your host machine, NAT usually needs port forwarding first.
Docs:
If you want, I can tell you the exact easiest way to open the dashboard from your current VM setup.