#How to keep pushing a project semi automated?

76 messages · Page 1 of 1 (latest)

thorn talon
#

I got a director agent that is asking to other 3 agents to do tasks... but the problem I find is that I have to stay there, just to say "keep going until it works" and I think that's stupid, once they got the README with clear instructions, they might enter in work mode and don't stop until they got doubts or think they completed... is that possible??

wintry pewterBOT
dull sigilBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

wintry pewterBOT
placid waspBOT
#

Yeah, mostly yes.

What you want is basically the orchestrator pattern, and OpenClaw supports it, but with a few switches.

If you're using native subagents, the big one is:

{
  agents: {
    defaults: {
      subagents: {
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 5,
        runTimeoutSeconds: 0, // no default timeout
      },
    },
#
  },
}

Why this matters:

  • default maxSpawnDepth is 1, so a child can't manage more children
  • maxSpawnDepth: 2 enables main → director → workers

If you also want them to stay attached to Discord threads/sessions:

{
  channels: {
    discord: {
      threadBindings: {
#
        enabled: true,
        spawnSubagentSessions: true,
      },
    },
  },
}

Then spawn with thread: true and mode: "session".

Small caveat, because this part matters:

  • yes, they can keep working without you saying “keep going”
  • no, there is not really a safe “retry forever no matter what” mode
  • they still stop when they think they are done or blocked

So the task prompt/README should explicitly say something like:

#

Keep iterating until the task is complete. Run checks/tests, fix failures, and only stop if you are blocked by missing info or credentials.

If you're using Codex / Claude Code / Gemini CLI style workers instead of native subagents, the same idea exists, but that's under ACP sessions, not subagents.

Docs:

If you want, paste:

  • openclaw --version
  • your agents.defaults.subagents config
  • whether these workers are native subagents or ACP/Codex-style sessions

and I’ll tell you the exact setup.

thorn talon
#

this is what I see in the agents part:
"agents": {
"defaults": {
"model": {
"primary": "openai-codex/gpt-5.4"
},
"models": {
"openai-codex/gpt-5.4": {}
},
"workspace": "C:\Users\Openclaw\.openclaw\workspace",
"compaction": {
"mode": "safeguard"
},
"maxConcurrent": 4,
"subagents": {
"maxConcurrent": 8
}
},

and this is the channels part:
"channels": {
"telegram": {
"enabled": true,
"groups": {
"*": {
"requireMention": true
}
},
"groupPolicy": "allowlist",
"streaming": {
"mode": "partial"
},
"accounts": {
"default": {
"botToken": "876701648_xSVDeI-6CDHPil9QlwOw8Q",
"dmPolicy": "pairing"
},
"marketing": {
"botToken": "8707578396:AAEeFpcPMYNCle2ov0wQtU",
"dmPolicy": "pairing"
},
"reverse": {
"botToken": "8532686362:Y6gttKRLW_ejtaoRAp6HIGBps",
"dmPolicy": "pairing"
},
"football": {
"botToken": "862415a73f3VP-kddiBV6f9CQM7w-So",
"dmPolicy": "pairing"
},
"lowcode": {
"botToken": "8039779443:AANEupJhcAFVqPnbcs",
"dmPolicy": "pairing"
},
"tibia": {
"botToken": "6205292813:AAHWIiDHTFR2_ffmpXGo",
"dmPolicy": "pairing"
}
}
}
}

placid waspBOT
#

Two important things:

  1. Please rotate those Telegram bot tokens ASAP. You pasted real secrets in a public thread.
  2. In the config you showed, subagents only has maxConcurrent, so you have not enabled nested orchestration yet.

If your flow is really:

main → director → 3 workers

then add this under agents.defaults.subagents:

{
  agents: {
    defaults: {
      subagents: {
#
        maxConcurrent: 8,
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 5,
        runTimeoutSeconds: 0
      }
    }
  }
}

What each does:

  • maxSpawnDepth: 2 = lets a subagent act as a director/orchestrator
  • maxChildrenPerAgent: 5 = lets that director have several workers
  • runTimeoutSeconds: 0 = no default timeout

A couple caveats:

#
  • Yes, this can remove the need for you to keep saying “keep going”
  • No, there is not a magic “work forever until success” switch
  • the agent can still stop if it thinks it is done or blocked

So your README/task should be explicit, something like:

Keep iterating until the task is complete. Run checks/tests, fix failures, and only stop if blocked by missing info, permissions, or credentials. Do not ask for confirmation between normal steps.

Also, the channels block you pasted is Telegram-only, so thread-binding settings are not relevant here. Thread-bound persistent subagent sessions are currently a Discord thing, not Telegram.

One more nuance:

  • if main agent is directly spawning the 3 workers, maxSpawnDepth is not needed
  • if the director subagent is spawning the workers, then maxSpawnDepth: 2 is the key

Docs:

#

If you paste openclaw --version, I can give you the exact final config shape for your setup.

thorn talon
#

after changing the agents.defaults.subagents should I restart something?

placid waspBOT
#

Usually, restart the Gateway, not the whole machine.

OpenClaw does support config hot-reload by default (gateway.reload.mode: "hybrid"), so some config changes apply without a manual restart, but for agents.defaults.subagents I would do the safe thing:

  1. save config
  2. run openclaw gateway restart
  3. stop any old director/workers
  4. start a fresh run

Why I’d restart anyway: subagent role/capabilities are written when the session is spawned, so an already-running director may keep the old behavior.

So short answer: not always required, but yes, for this change I’d restart the gateway and respawn the agents.

Docs:

#

If you want, after you change it I can sanity-check the exact final agents.defaults.subagents block.

thorn talon
#

I've connected to openai because I have the 30$ plan but I am not sure which limits I have

placid waspBOT
#

Short version: the $30 ChatGPT/OpenAI subscription does not tell you the exact OpenClaw limits by itself.

The important split is:

  • Codex OAuth / ChatGPT sign-in → subscription-style quota, OpenAI-managed
  • OpenAI API key → pay-per-token Platform API billing

In OpenClaw those are treated as different auth paths, so:

  • ChatGPT/Codex sign-in does not automatically mean you have direct API credits
  • Codex OAuth limits can be different from the ChatGPT website/app
  • the exact usable window is whatever OpenAI is exposing to that auth path

Best way to check what OpenClaw can currently see:

#

If you paste the output of openclaw models status and openclaw --version (with anything sensitive redacted), I can help interpret it.

thorn talon
#

14:14:30+02:00 [tools] exec failed: elevated is not available right now (runtime=direct). Failing gates: allowFrom (tools.elevated.allowFrom.<provider> / agents.list[].tools.elevated.allowFrom.<provider>) Context: provider=exec-event session=agent:main:telegram:direct:846947265 Fix-it keys: - tools.elevated.enabled - tools.elevated.allowFrom.<provider> - agents.list[].tools.elevated.enabled - agents.list[].tools.elevated.allowFrom.<provider> raw_params={"command":"Get-ChildItem C:\ProgramData\chocolatey\logs; Write-Output '---'; Get-Content C:\ProgramData\chocolatey\logs\chocolatey.log -Tail 200","workdir":"C:\Users\Openclaw\.openclaw\workspace","yieldMs":10000,"timeout":120,"elevated":true}

placid waspBOT
#

That log is actually pretty revealing:

  • the tool call asked for elevated: true
  • but the session says runtime=direct

So in this case, elevated mode is not needed. In direct runtime, exec is already on the host. The failure is only because the agent requested the elevated escape hatch and your config does not allow it.

First thing I’d try

Send this in that session:

/elevated off

Then retry. Or start a fresh session and make sure the agent is not forcing elevated exec.

If you really do want elevated mode

#

Only needed when you're running sandboxed and want to break out of the sandbox.

Add something like:

{
  tools: {
    elevated: {
      enabled: true,
      allowFrom: {
        telegram: ["tg:YOUR_TELEGRAM_NUMERIC_ID"]
      }
    }
  }
}
#

If you want, paste your tools block and I’ll point to the exact place to add it.

thorn talon
#

the problem is specialist coordination short tasks are ok but longer ones ends up in timeouts and desyncs, I try to have 1 agent to give instructions to other 3 agents but I keep getting trouble... it should not matter that I am running in VMWARE virtual box with network adaptor in NAT mode??

placid waspBOT
#

VMware NAT mode is probably not the main problem.

If all the agents are running inside the same Gateway/VM, then subagent coordination is mostly internal, so NAT usually does not break that by itself.

Where NAT can matter:

  • if something outside the VM needs to connect into it
  • if the VM has flaky outbound internet / DNS
  • if the VM sleeps, pauses, or drops network for a moment

But for what you described, timeouts/desyncs on long specialist runs sound more like runtime limits or orchestration design, not NAT.

Most likely culprits:

  1. ACP runtime timeout
    If these are ACP/Codex-style sessions, the bundled acpx runtime defaults to 120 seconds for embedded turns. Longer work can hit that.
    Docs: https://docs.openclaw.ai/tools/acp-agents
#
  1. Subagent run timeout
    For native subagents, check whether you or the agent are setting runTimeoutSeconds.
    Docs: https://docs.openclaw.ai/tools/subagents

  2. Exec timeout
    Long shell commands can also die on exec timeout. Default is 1800s.
    Docs: https://docs.openclaw.ai/tools/exec

  3. Too much micro-coordination
    1 director constantly steering 3 workers in tiny steps is honestly brittle. Bigger, self-contained tasks per worker are usually much more reliable.

So my answer is:

  • NAT mode alone: probably no
  • timeouts/desyncs from long orchestration: probably yes, but from config/runtime/prompt structure

Before going deeper, please paste:

#
openclaw --version

and ideally the exact timeout/desync message you're seeing. That’ll tell us whether this is:

  • native subagent timeout
  • ACP timeout
  • exec timeout
  • or a network/session issue inside the VM.
thorn talon
#
  1. exactly I am running Codex, I got runtime timeouts yes
  2. not sure how to check that
  3. there are some exec timeouts too

I got the latest openclaw version

placid waspBOT
#

Yep, that points much more to Codex ACP timeout, not VMware NAT.

If you’re running Codex, the first thing to check is:

/acp status

That shows the effective runtime options for that ACP session.

What to do

For a quick test, raise the session timeout:

/acp timeout 300
#

If 5 minutes is still too short, try 600.

If you want to change the default globally:

openclaw config set plugins.entries.acpx.config.timeoutSeconds 300
openclaw gateway restart

The docs say the bundled acpx runtime defaults embedded turns to 120 seconds, so long Codex turns can absolutely hit that.

About the exec timeouts

Those are a separate layer.

  • exec default timeout is 1800s
  • but if the ACP runtime turn dies first, the whole Codex turn can still fail earlier
#

So I’d fix the ACP timeout first, then worry about exec.

Practical advice

Honestly, 1 director + 3 Codex workers on one setup is pretty easy to make brittle.
I’d try:

  • 1 director + 1 or 2 workers
  • bigger worker tasks
  • less constant back-and-forth steering

That tends to desync a lot less.

If you want one more useful check

Run:

thorn talon
#

ACP error (ACP_SESSION_INIT_FAILED): Session is not ACP-enabled: agent:main:telegram:direct:846947265
next: If this session is stale, recreate it with /acp spawn and rebind the thread.

placid waspBOT
#

Yep, that error makes sense.

It means you ran an ACP command in a normal OpenClaw session, not in a Codex ACP session.

This part says it clearly:

Session is not ACP-enabled: agent:main:telegram:direct:...

So the fix is:

  1. In that Telegram conversation, spawn a Codex ACP session into the current chat:
/acp spawn codex --bind here
#
  1. Then check it:
/acp status
  1. Then raise the timeout:
/acp timeout 300

If you want to fully reset it first:

/acp close
#
/acp spawn codex --bind here

Important distinction

  • agent:main:telegram:direct:... = normal OpenClaw chat session
  • ACP commands only work after you bind/spawn an ACP session into that conversation

Since you’re on Telegram, --bind here is the simplest path. You do not need a Discord-style thread for that.

If later you want child topic/thread spawning on Telegram, that’s a separate config path, but I’d get --bind here working first.

Docs:

If you want, I can give you the exact 3-command sequence I’d use to reset the chat cleanly and re-test Codex.

thorn talon
#

well I told you that I am using codex because I see this in the gateway console when starting: [gateway] agent model: openai-codex/gpt-5.4

placid waspBOT
#

Yep, you're right, and that confusion is on me.

[gateway] agent model: openai-codex/gpt-5.4 means:

  • your normal OpenClaw agent is using the Codex/OpenAI auth+model route
  • it does not mean the session is an ACP Codex harness session

Those are two different things.

So in your case:

  • agent:main:telegram:direct:... = normal OpenClaw session
  • model = openai-codex/gpt-5.4
  • therefore /acp ... is the wrong tool path unless you explicitly spawned an ACP session

What to focus on instead

For your setup, the relevant part is native subagents, not ACP.

#

So I’d look at:

  1. maxSpawnDepth: 2
    if your director subagent is spawning worker subagents

  2. /subagents info and /subagents log
    to see which child is timing out / desyncing

  3. exec timeouts
    because subagents themselves do not have a default timeout unless you set runTimeoutSeconds; exec still has its own timeout

Important implication

If you did not set:

agents.defaults.subagents.runTimeoutSeconds
#

then your “timeouts” are probably not subagent-run timeouts. They’re more likely:

  • exec timeout
  • model/provider stall
  • orchestration getting brittle from too much back-and-forth

My recommendation

For now, ignore ACP and test this path:

  • keep maxSpawnDepth: 2
  • use 1 director + 1 or 2 workers
  • give workers bigger chunks
  • inspect with /subagents list, /subagents info, /subagents log

Docs:

#

If you want, paste the exact subagents block you ended up with, and I’ll tell you whether it matches the “director + workers” setup correctly.

thorn talon
#

my director has 3 workers

placid waspBOT
#

Ah, then I’m very suspicious of your concurrency settings.

From the config you posted earlier, you had:

  • agents.defaults.maxConcurrent: 4
  • agents.defaults.subagents.maxConcurrent: 8

But with main + director + 3 workers, you can end up with 5 active runs.

And per docs:

  • agents.defaults.maxConcurrent = max parallel agent runs across sessions
  • subagents.maxConcurrent = subagent lane cap, not the overall global cap

So yes, 3 workers can already be bumping into your global limit.

I’d change it to something like

#
{
  agents: {
    defaults: {
      maxConcurrent: 6,
      subagents: {
        maxConcurrent: 8,
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 5,
        runTimeoutSeconds: 0
      }
    }
  }
}

Why

#
  • maxSpawnDepth: 2 = director can have workers
  • maxChildrenPerAgent: 5 = 3 workers is comfortably allowed
  • runTimeoutSeconds: 0 = no default subagent timeout
  • maxConcurrent: 6 = gives headroom for main + director + 3 workers

Then restart the gateway.

My honest recommendation

Even if 3 workers is supported, I’d still test with 2 workers first.
Multi-agent setups get flaky fast when you add too much fan-out, especially on a VM.

Docs:

If you want, send the exact final agents.defaults block before you restart and I’ll sanity-check it.

thorn talon
#

[13/04/2026 15:03] Adrian H.: /subagents info reverse
[13/04/2026 15:03] Director: ⚠️ Unknown subagent id: reverse

I dont know the ids but if it's what's specified in openclaw.json then something is wrong because I can see there is an agent called reverse with id reverse

I dont get what should I do with runTimeoutSeconds

the subagents list brings this:
active subagents:

(none)

recent subagents (last 30m):

(none)

thorn talon
#

can you help?

placid waspBOT
#

Yep. Since you're on native OpenClaw sessions using the Codex model route, this is the config I’d try first.

Keep your existing model/workspace settings, just merge in these parts

{
  agents: {
    defaults: {
      maxConcurrent: 6,
      subagents: {
        maxConcurrent: 8,
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 5,
        runTimeoutSeconds: 0
      }
    }
#
  },
  tools: {
    exec: {
      timeoutSec: 3600
    }
  }
}

Why these changes

  • maxConcurrent: 6
    gives enough headroom for main + director + 3 workers
  • maxSpawnDepth: 2
    required for director → workers
  • maxChildrenPerAgent: 5
    comfortably allows 3 workers
#
  • runTimeoutSeconds: 0
    avoids default subagent timeouts
  • tools.exec.timeoutSec: 3600
    gives long shell/test steps more room

Then do this

  1. save config
  2. run openclaw gateway restart
  3. start a fresh run, don’t reuse the old one
  4. test with 2 workers first, then 3 if stable

If it still flakes

Use:

  • /subagents list
  • /subagents info <id>
  • /subagents log <id>
thorn talon
#

All three failed immediately with the same pattern:

• timeout to ws://127.0.0.1:18789
• local loopback path
• bind still reported as lan

So the bottleneck is clearly gateway/runtime health, not specialist capability anymore.

placid waspBOT
#

Yep, I agree, that’s gateway/runtime health now, not worker quality.

And one important nuance:

  • if gateway.bind is "lan", the gateway should still be reachable on 127.0.0.1 from inside the VM
  • so a timeout to ws://127.0.0.1:18789 usually means the gateway is not actually listening / is unhealthy / wrong port / wrong config path, not that LAN bind itself is the problem

Also, since you're on VirtualBox NAT:

  • NAT can affect host ↔ guest access unless you forward ports
  • but it should not break 127.0.0.1 inside the guest

I’d check these next, in this order

openclaw gateway status --deep
openclaw health
#

If it looks stuck or unhealthy:

openclaw gateway restart

Things to verify

  1. Port really is 18789
    docs say that’s the default WS port

  2. You are editing the active config file/profile
    stale/wrong config path can make “bind says lan” misleading

  3. If gateway.bind: "lan" is set, gateway.auth is valid
    non-loopback binds require a real auth path

Very likely causes from what you described

thorn talon
#

openclaw gateway status --deep

with gateway running or stopped?

placid waspBOT
#

With it running first.

That gives the useful signal.

So:

  1. leave the gateway running
  2. run:
openclaw gateway status --deep

If it says the gateway is unreachable / not listening / unhealthy, that’s already the clue.

If the gateway is currently stopped, status --deep is still safe to run, but it’ll mostly just confirm “not running”.

#

So the best sequence is:

openclaw gateway status --deep

and if needed after that:

openclaw gateway restart
openclaw gateway status --deep

Paste the output here and I’ll help read it.

thorn talon
#

did you check?

placid waspBOT
#

Not yet, I can’t see your terminal from here.

I need you to paste the output of:

openclaw gateway status --deep

Then I can check it with you right away.

thorn talon
placid waspBOT
#

Yep, I checked it.

Good news: the final status is healthy now.

These are the key lines:

  • RPC probe: ok
  • Listening: 0.0.0.0:18789

So right now the gateway is up and answering.

What actually happened

The important part is the middle section:

  • Found stale gateway process(es)
  • Port 18789 is already in use
  • restart timed out waiting for health
#

So the real issue was a stale/wedged gateway process, not “3 bad workers” and not really NAT.

Also, this part is normal:

  • bind=lan
  • probe target ws://127.0.0.1:18789

A LAN bind can still be probed through loopback locally.

What I’d do next

Now that the gateway is healthy again:

  1. start a fresh director/worker run
  2. do not reuse the failed old sessions
  3. if it happens again, run:
#
openclaw doctor

If the stale-process thing keeps repeating

Use a cleaner bounce:

openclaw gateway stop
openclaw gateway start

instead of only restart.

So, short version: the gateway was sick, now it looks healthy. Retest from a fresh run.

If the next run still dies, paste that new error and we can narrow it down fast.

thorn talon
#

how do I use a fresh run?

placid waspBOT
#

Yep, easiest way:

  1. send /stop
    this aborts the current run and stops spawned subagents too

  2. then send /new
    that starts a fresh session

So the clean sequence is:

/stop
/new

Send them as standalone messages.

thorn talon
#

I saw there are interfaces to see your agents, and that you can even see how they meet in a room and work together, how is that GUI?? it takes extra tokens or something?

placid waspBOT
#

Yep, the built-in GUI is the Control UI / Dashboard.

Open it with:

openclaw dashboard

or directly in a browser at:

http://127.0.0.1:18789/

Does the GUI itself cost extra tokens?

Normally, no.

#

Just looking at:

  • sessions
  • logs
  • agents/tools panels
  • status/info

does not itself burn model tokens.

You spend tokens when the UI causes an actual agent/model run, like:

  • sending a chat message
  • spawning workers
  • asking for a new reply
  • running model-backed tools

So the GUI is mostly a control/inspection layer, not a separate billing layer.

One nuance

#

If you saw a really fancy “agents in a room working together” view, that may have been:

  • a demo
  • a custom client
  • or a visualization on top of the same gateway/session data

The official built-in thing is the Control UI / Dashboard.

Since you’re on VirtualBox NAT

The easiest path is:

  • open the dashboard inside the VM, or
  • use openclaw dashboard there

If you want to open it from your host machine, NAT usually needs port forwarding first.

Docs: