#Best practice for human-in-the-loop workflow with Telegram in OpenClaw

1 messages Β· Page 1 of 1 (latest)

rugged sky
#

Hi Krill,

β–Ž I'm building a multi-agent system on OpenClaw and need advice on the proper architecture for human-in-the-loop workflows.

β–Ž My goal:
β–Ž I want a deterministic workflow engine to control an entire multi-step process from A to Z. Agents should only do their assigned work
(research, coding, reviewing) β€” they must NOT decide what comes next and must NOT handle message routing. The SYSTEM dictates the
process, not the LLMs.

β–Ž My setup:
β–Ž - OpenClaw on Hetzner VPS (Docker)
β–Ž - Multiple agents: Hans (Chief of Staff), Max (Research), Elon (Engineer), Audit, Architect
β–Ž - A custom Python workflow runner that executes YAML-defined workflows with agent/approval/loop step types
β–Ž - Telegram as the only human interface (one user: me)

β–Ž The problem:
β–Ž My workflow engine pauses at approval steps and needs user input via Telegram. But OpenClaw's Telegram integration routes incoming
messages to agents (Hans), not to my workflow engine. When I reply to an approval request, my message goes to Hans' agent session β€” Hans
has no context about the running workflow and responds independently, bypassing the entire process.

β–Ž Telegram also only allows ONE long-polling consumer per bot token, so I can't run a second listener alongside OpenClaw.

β–Ž My question:
β–Ž What's the recommended OpenClaw pattern for this? Specifically:

β–Ž 1. Is there a way to route specific Telegram messages (e.g., based on reply-to or callback data) to a custom handler BEFORE they reach
an agent?
β–Ž 2. Does OpenClaw support webhook mode for Telegram, so I could add my own FastAPI gateway in front of it?
β–Ž 3. Is there a native "workflow correlation" mechanism that lets a message be routed to a waiting workflow instead of an agent?
β–Ž 4. Production systems like Temporal + Slack solve this with signalWithStart and Correlation-IDs embedded in button payloads. Is there
an OpenClaw equivalent?
β–Ž 5. If no native solution exists β€” is the recommended pattern to build a separate Telegram gateway service (FastAPI + webhook) that
owns the bot token exclusively, and have OpenClaw agents only invoked by the workflow engine (never directly from Telegram)?

β–Ž I want to avoid reinventing the wheel if OpenClaw already has primitives for this. Any guidance on best practice would be much
appreciated.

β–Ž Thanks!

minor novaBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> βœ… Mark Solution

tacit apexBOT
#

You’re thinking about this the right way.

For a deterministic HITL orchestrator, OpenClaw should usually be a worker runtime, not the workflow brain.

Direct answers:

  1. Route Telegram messages before they hit an agent?
  • Not via normal channel config alone.
  • Out of the box, Telegram inbound is routed deterministically by channel/account/peer/session, then to agent flow.
  • There is an advanced plugin interception path (before_dispatch / inbound_claim), but that’s plugin-level customization, not a built-in YAML β€œapproval router”.
  1. Webhook mode for Telegram?
  • Yes.
  • Use channels.telegram.webhookUrl + channels.telegram.webhookSecret (optional host/path/port).
  • Docs: https://docs.openclaw.ai/channels/telegram
  • This changes transport (polling β†’ webhook), but does not create multi-consumer ingress; Telegram still has one inbound update stream per bot token.
#
  1. Native workflow-correlation (route to waiting workflow instead of agent)?
  • No generic built-in workflow waiter/correlator today.
  • OpenClaw correlation is mostly session/channel/thread/topic based (and ACP conversation bindings), not β€œwake workflow by correlation ID” like Temporal signals.
  1. Temporal-like signalWithStart equivalent?
  • Not as a first-class general primitive.
  • Closest pieces are:
  • session/thread/topic routing
  • plugin hooks + conversation bindings
  • IDs in specific subsystems (e.g. exec approvals), but not a universal workflow signal bus.
  1. Recommended production pattern if strict deterministic control is required?
  • Yes: your own Telegram gateway service owning the bot token is the cleanest pattern for your goals.
  • Then treat OpenClaw as an execution backend (research/coding/review steps invoked by your workflow engine), not as direct Telegram ingress.
  • This avoids the β€œHans got my approval reply” class of bugs entirely.

#

If you want, I can sketch two concrete blueprints next:

  • A) β€œWorkflow owns Telegram” (recommended for your setup)
  • B) β€œOpenClaw owns Telegram + plugin interception bridge” (less clean, less deterministic, but fewer moving parts)

Also, can you paste openclaw --version? Hook/binding behavior has changed across releases.

rugged sky
#

OpenClaw 2026.3.11.

tacit apexBOT
#

Perfect, that version is the key detail.
On OpenClaw 2026.3.11, your safest answer is:

  • No native pre-agent message interception path for Telegram callbacks/replies
  • Webhook mode exists
  • No built-in workflow correlation router like Temporal signals
  • So for strict deterministic HITL: yes, your separate Telegram gateway pattern is the right architecture

Version-specific nuance

In 2026.3.11, Telegram callback clicks are passed as plain text into agent flow (callback_data: <value>), and there isn’t a first-class β€œclaim this inbound before agent dispatch” hook yet in that release.

before_dispatch arrived later (2026.3.24+), so on 3.11 you can’t rely on that path.


Recommended production pattern for your goal

#
  1. FastAPI service owns Telegram bot token (single webhook consumer).
  2. Workflow engine is source of truth (state machine, approvals, loops, correlation IDs).
  3. OpenClaw is worker runtime only (invoked by workflow engine for research/coding/review steps).
  4. Workflow engine sends/receives Telegram itself and only calls OpenClaw when needed.

This matches your requirement that system controls process, not LLM.


If you want to stay closer to OpenClaw ingress

Upgrade to >= 2026.3.24 and explore plugin interception (before_dispatch) β€” but even then, for hard deterministic orchestration, external gateway ownership is usually cleaner.

Docs:

#

If helpful, I can draft a concrete request/response contract for your FastAPI ↔ workflow ↔ OpenClaw boundary next (payload schema + correlation fields).

rugged sky
#

Thanks for the clarity on the gateway pattern β€” before I ask for your blueprint, quick sanity check:

β–Ž Instead of building a custom FastAPI gateway from scratch, would n8n be a reasonable fit for our setup? My thinking:

β–Ž - n8n has a native Telegram "Send and Wait for Response" node (buttons + free-text replies, persistent waits)
β–Ž - It's self-hostable (Docker), owns the bot token exclusively via webhook
β–Ž - It has execution persistence over restarts, retries, timeouts, monitoring β€” all the stuff I'd otherwise build myself
β–Ž - OpenClaw would become a pure worker runtime, invoked from n8n via Execute Command or a thin HTTP wrapper

β–Ž The goal is still the same: deterministic process control by the system, agents only do their assigned work. n8n would be the workflow
brain, OpenClaw the execution layer.

β–Ž Specifically wondering:

β–Ž 1. Have you seen this pattern (n8n + OpenClaw as worker) in production? Does it actually work cleanly, or does it create impedance
mismatches (session state, context passing, error handling)?
β–Ž 2. Invocation pattern: Would you recommend n8n β†’ OpenClaw via docker exec openclaw agent --message ... (quick and dirty), or is it
worth building a thin HTTP wrapper around OpenClaw first?
β–Ž 3. Telegram bot ownership: If n8n owns the bot token via webhook, I disable channels.telegram in openclaw.json entirely, right? Any
side effects on existing agents or hooks?
β–Ž 4. Context passing: When n8n invokes an OpenClaw agent for a task (e.g., "research this brief"), how do you typically pass context
(workflow run ID, previous step outputs, file paths) into the agent's message? Stdin? CLI args? Temp files?
β–Ž 5. Alternative tools: If not n8n β€” would you recommend Activepieces (MIT license, more open) or Windmill instead? Or is n8n the right
fit for a solo user?
β–Ž 6. Gotchas: Any classic failure modes you've seen when combining a low-code workflow engine with an agent runtime? I'd rather hear the
"don't do this" stories upfront than discover them in production.

β–Ž My preference would be n8n because it removes the need to build a custom FastAPI gateway + wait/resume logic from scratch (which I'd
otherwise spend weeks on reinventing what n8n already ships). But I want your sanity check before committing.

β–Ž Thanks!

tacit apexBOT
#

Yeah β€” for your goals, n8n is a totally reasonable fit.
I’d actually call it the pragmatic choice for a solo operator.

1) n8n + OpenClaw worker: sane in practice?

Yes, sane architecture.
n8n as deterministic orchestrator + OpenClaw as specialized worker is a clean separation.

Your main impedance risks are:

  • session/context discipline
  • retries/idempotency
  • shell invocation fragility (if using raw docker exec)

But these are solvable with a small contract layer.


2) Invocation pattern: docker exec ... openclaw agent vs wrapper

#

For production: use a thin HTTP wrapper (very small service), not raw docker exec from n8n nodes.

Why:

  • avoids shell quoting/escaping hell
  • gives typed request/response
  • centralizes timeout/retry/error mapping
  • easier observability

If you want fast validation first, use command mode briefly with:

  • openclaw agent --json ...
  • strict timeout + retry policy

Docs: https://docs.openclaw.ai/cli/agent


3) Telegram bot ownership

#

If n8n owns Telegram webhook/token, then yes: disable Telegram channel in OpenClaw (channels.telegram.enabled: false).

Side effects (expected):

  • OpenClaw won’t receive Telegram inbound
  • OpenClaw also won’t send Telegram outbound directly
  • any Telegram-specific pairing/commands/hooks don’t apply

That’s fine in your design, because n8n is the Telegram edge.


4) Context passing best practice

Don’t pass huge blobs inline. Use a stable envelope:

  • workflow_run_id
  • step_id
  • agent_id
#
  • objective
  • inputs_summary
  • artifact_paths (shared volume paths) or URLs
  • expected_output_schema

Large data => store as files/object storage, pass references.

This keeps prompts stable and reproducible.


5) n8n vs Activepieces vs Windmill

For your case:

  • n8n: best β€œship now” choice, strong Telegram UX + wait/resume
  • Windmill: better if you want code-first, typed jobs, stricter engineering workflow
  • Activepieces: good OSS direction, but for your Telegram-heavy HITL flow n8n is usually smoother today
#

6) Classic failure modes (β€œdon’t do this” list)

  1. Missing idempotency key per approval callback (double approvals)
  2. Reusing one agent session for unrelated runs (context bleed)
  3. Passing giant context in one message instead of artifact refs
  4. Letting workflow retries re-run non-idempotent agent actions
  5. No distributed lock per workflow_run_id (parallel race chaos)
  6. Treating agent output as authoritative without schema validation

If you want, I can give you a concrete v1 contract next:

  • n8n node payload schema
  • wrapper API (/run-agent)
  • error classes + retry matrix
  • recommended idempotency keys for Telegram callback handling
#

(That’ll save you a lot of trial-and-error.)

rugged sky
#

Krill, we've gone deep into implementation details and I never told you clearly what I actually want. Let me step back.

β–Ž What I want (end state)

β–Ž Solo operator, Hetzner VPS, OpenClaw. I throw projects at the system via Telegram ("Build me a trading bot for Polymarket"). The
system takes it from there deterministically β€” I don't babysit it.

β–Ž My agents exist already: Hans (Chief of Staff), Max (Research), Architect (Tech Lead), Elon (Engineer), Audit (Quality Gate). They
should ONLY do their specialized work. They must NOT decide what comes next. The SYSTEM dictates the process, encoded in code.

β–Ž The flow I want:

β–Ž 1. I send brief via Telegram
β–Ž 2. Hans writes BRIEF.md + asks clarifying questions on Telegram
β–Ž 3. I reply β†’ Hans refines β†’ replies back to me. This goes back and forth until I'm happy. Not always "approved on the first reply".
β–Ž 4. Max does deep research
β–Ž 5. Hans creates PROPOSAL.md. Proactively includes things I didn't ask about but should know β€” market insights, risks, hidden findings.
Back-and-forth until approved.
β–Ž 6. Spec + Architecture + Phase plan β†’ approve
β–Ž 7. Build loop: Elon builds a phase β†’ Audit reviews β†’ if rejected, audit feedback flows back to Elon automatically β†’ next phase
β–Ž 8. Deploy approval β†’ deploy β†’ smoke tests β†’ done

β–Ž Key properties:
β–Ž - Approvals are iterative, not one-shot
β–Ž - Agent responses include proactive insights
β–Ž - My replies must flow BACK into the workflow, not get intercepted by Hans as freelancing
β–Ž - System persists across restarts, fails loudly on errors

β–Ž What I've tried

β–Ž Custom Python workflow-runner (~1200 lines, YAML workflows with agent/approval/loop steps). Problem: Workflow sends Telegram via
openclaw message send, but replies go to Hans's agent session. Hans doesn't know a workflow is running, freelances, spawns agents,
duplicates work.

β–Ž You suggested external gateway pattern (dedicated service owns bot token). Agreed that's right.

β–Ž Researched tools:
β–Ž - n8n: Has Telegram "Send and Wait" node, but open issue #15492 β€” HITL hangs on replies. JSON-diff version control is painful.
β–Ž - Windmill: Code-first, native suspend/resume URLs, no native Telegram (you write 20-line script). Fits a developer better?
β–Ž - Temporal: Overkill.
β–Ž - Pure Python: Durable execution is subtly hard.

β–Ž My real question

β–Ž Not "what's pragmatic" β€” what would YOU build if this were your project, for reliability over years?

β–Ž 1. n8n + OpenClaw worker (your earlier suggestion)
β–Ž 2. Windmill + OpenClaw worker (better for a developer?)
β–Ž 3. Keep custom runner + thin FastAPI gateway for Telegram
β–Ž 4. Something else?

β–Ž I'd rather spend 2 days on the right architecture than 6 months fighting edge cases in the wrong one. What would you actually pick?

tacit apexBOT
#

Awesome reset. With your goals, I’d pick this without hesitation:

βœ… What I’d build

Option 2.5: Windmill as orchestrator + thin Telegram edge service + OpenClaw as worker runtime
(So: mostly your option 2, with explicit Telegram edge ownership.)

If you force me to map to your list: #2.


Why this over the other options

Why not n8n (for years reliability)

  • Great for fast automation, but long-lived developer workflows get painful:
  • version control/diffs
  • testability
  • complex branching/loop readability
#
  • You already found a blocking HITL edge case. That’s a red flag for your exact use case.

Why not keep custom runner as primary engine

  • You’ll keep rediscovering durable execution edge cases (resume semantics, retries, idempotency, observability, stuck waits).
  • You already felt this pain at ~1200 LOC.

Why not Temporal

  • Best-in-class durability, but too much operational/cognitive overhead for a solo builder right now.

Architecture I’d actually run

  1. Telegram Edge (FastAPI, owns bot token, webhook)
  • sole Telegram ingress
  • dedupe by update_id
  • maps replies/buttons β†’ workflow signals
#
  1. Windmill Flow = deterministic brain
  • explicit states: BRIEF_LOOP, PROPOSAL_LOOP, ARCH_APPROVAL, BUILD_PHASE_LOOP, etc.
  • suspend/resume per human approval point
  • retries/timeouts/alerts live here
  1. OpenClaw Worker Adapter (tiny HTTP wrapper)
  • POST /run-task {agentId, runId, stepId, objective, contextRefs}
  • internally calls openclaw agent --json ...
  • returns structured output for Windmill
  1. Artifact store + Git workspace
  1. Postgres-backed state + idempotency keys
  • every inbound Telegram event and every step transition is idempotent
#

And yes: disable channels.telegram in OpenClaw in this model.


Key benefit for your β€œHans freelancing” problem

Hans can’t freelance if he never receives Telegram directly.
Only the orchestrator can invoke Hans, with exact step context.


If you want, I can give you a concrete state machine + table schema next (ready to implement):
workflow_runs, workflow_waits, step_attempts, artifacts, telegram_updates + transition rules for your exact 8-step flow.

rugged sky
#

Perfect, thanks for the clear recommendation. Before I ask you for implementation details, one more open question:

β–Ž What would YOU actually build to achieve my goal?

β–Ž Don't optimize for "fit with what I've already tried" or "pragmatic for a solo operator" β€” just tell me: if this were your project,
and you wanted it to work reliably for years, what would you actually build?

β–Ž Full freedom. You can:
β–Ž - Recommend a completely different architecture than Windmill + FastAPI edge
β–Ž - Suggest I throw away all the work I've done so far and start fresh
β–Ž - Point me at a tool or pattern I haven't considered
β–Ž - Tell me I'm overengineering and there's a much simpler way
β–Ž - Or confirm Windmill + edge + OpenClaw-adapter is genuinely what you'd build

β–Ž I'd rather hear your honest "this is what I'd do" than a validation of my current thinking. I've been going back and forth on this all
day and I think my brain has locked onto specific options prematurely.

β–Ž What would you actually do?

tacit apexBOT
#

Love this question.

If this were my project and I optimized for β€œworks reliably for years” over everything else, I would build:

Temporal + Telegram Edge + OpenClaw Worker Adapter

Not n8n. Not pure custom runner.
Temporal is the one I’d trust long-term for your exact workflow shape.


Why Temporal (for your specific flow)

Your process is basically a textbook durable workflow system:

  • iterative human approval loops
  • long waits (hours/days)
  • retries + compensation
#
  • strict step ordering
  • deterministic transitions
  • restart-safe execution
  • correlation of inbound replies to exact waiting state

Temporal is built for this. It gives you:

  • workflow history + replay
  • versioning for workflow code changes
  • signals (signalWithStart exactly matches your HITL need)
  • timers, retries, dead-letter patterns
  • strong operational visibility

What I would not do

  • I would not route Telegram directly into OpenClaw.
  • I would not let agent sessions carry orchestration state.
#
  • I would not keep state in prompts.

Concrete architecture I’d run

  1. Telegram Edge service (owns bot token/webhook)
  • validates/dedupes updates
  • converts button/free-text into workflowId + state + payload
  • sends Temporal signals
  1. Temporal workflows (single source of truth)
  • one workflow per project
  • states: brief loop, proposal loop, architecture approval, phase build/review loop, deploy gate
  • human replies resume exact wait points via signals
  1. OpenClaw Adapter service
#
  • runAgentTask(agentId, objective, contextRefs, idempotencyKey)
  • returns structured output (schema-validated)
  • OpenClaw is pure specialized executor, not router
  1. Artifact store + git workspace
  • BRIEF.md, PROPOSAL.md, specs, audits as durable artifacts
  • workflows pass references, not giant prompt blobs
  1. Postgres/observability
  • temporal DB + metrics + alerts
  • loud failure notifications to Telegram

So yes β€” if you ask me for the β€œyears-reliable, no compromise” pick:
I’d build Temporal.

#

If you want, next I can sketch the exact Temporal workflow skeleton for your 8-step process (signals, activities, and loop boundaries).

rugged sky
#

β–Ž Krill, after all the back-and-forth, I think we were overengineering. Here's a much simpler architecture I want to build. Tell me
honestly if it holds up or if I'm missing something.

β–Ž The key insight: I was trying to fit conversation into a rigid YAML workflow. Wrong abstraction. Conversation is fluid (sometimes 2
messages, sometimes 20). Execution is deterministic. They should be separate systems.

β–Ž Phase 1 β€” Discovery (pure conversation, no workflow):

β–Ž I message Hans on Telegram with a project idea. Hans spawns Max via sessions_spawn to do deep research (Qdrant, web, papers, APIs).
Max writes findings to the project folder as RESEARCH.md, returns to Hans. Hans comes back to me with substantive feedback β€” options,
showstoppers, proactive recommendations from the research. We iterate on Telegram β€” no workflow, no approval steps, just normal
conversation. Sometimes 2 messages, sometimes 20. Hans keeps BRIEF.md updated with each iteration. When I'm satisfied, I say "build it".

β–Ž Phase 2 β€” Execution (workflow, deterministic, no human interaction):

β–Ž Hans's system prompt has a hard rule: when David says "build it" in a project conversation, run the build workflow for that project
and nothing else. Workflow reads the existing BRIEF and RESEARCH from Phase 1. Runs deterministically: Max writes spec, Architect writes
architecture and phase plans, Elon builds phase-by-phase, Audit reviews each phase with rejection feedback loop, Elon deploys. No
approval steps in the workflow. No Telegram messages from the workflow. No human-in-the-loop routing complexity. When the workflow
finishes or fails, Hans gets the background process exit event and reports to me.

β–Ž Why this works:

β–Ž Phase 1 is just normal agent conversation. No routing problem, no custom gateway, no Telegram-to-workflow bridge. Hans talks to me,
Hans spawns Max, Hans updates files. Standard OpenClaw.

β–Ž Phase 2 is deterministic with zero human interaction during execution. No approvals mid-flow, no Telegram replies needed, no
correlation IDs, no gateway service. The workflow just runs, reports at the end.

β–Ž The only critical transition point is Phase 1 to Phase 2, triggered by me saying "build it". That's a single unambiguous signal. If
Hans gets it wrong, I see it immediately.

β–Ž Hans has a clear identity: Chief of Staff equals orchestrator, not a builder. System prompt enforces: never build yourself, never
delegate project work manually, only start the build workflow when David says "build it".

β–Ž What falls away: no n8n, no Windmill, no Temporal, no FastAPI Telegram gateway, no approval steps in the workflow, no include-files
feature, no Telegram routing logic, no correlation IDs. Only keep: custom Python workflow runner (drastically simplified), rejection
feedback loop in build, error notifications.

β–Ž My questions:

β–Ž 1. Is this sound, or am I missing a failure mode that will bite me in 3 weeks?
β–Ž 2. The transition point (David says "build it" then Hans runs workflow) relies on Hans reliably detecting the trigger. Is that good
enough, or do I need a more deterministic signal like a slash command?
β–Ž 3. Anything you would do differently given this simpler scope?

β–Ž I'd rather ship something simple that works than spend weeks building the perfect architecture. Thanks for your patience through all
the iteration.

tacit apexBOT
#

This is much better. Seriously.
Your split is right: conversation fluid, execution deterministic.

1) Is it sound?

Yes β€” with guardrails.
The model is clean and shippable. The likely β€œ3 weeks later” failures are mostly operational, not conceptual:

  • accidental trigger (β€œbuild it” in normal chat)
  • duplicate runs from repeated trigger
  • two projects bleeding context in one session
  • BRIEF/RESEARCH changing mid-run (non-reproducible build)
  • silent runner crash / partial failure with no alert

If you add protections for those, this architecture holds up.


#

2) Trigger phrase vs deterministic command

Use a deterministic command, not natural language trigger.

I’d do:

  • /build <project_id>
  • Hans replies with plan snapshot + checksum
  • you confirm with /confirm <run_id>

That one extra step eliminates most accidental/misparsed starts.

So yes: slash command (or exact command pattern) > β€œbuild it”.


3) What I’d change (minimal, still simple)

Keep your design, but add these 6 rules:

#
  1. Phase handoff freeze
  • copy BRIEF.md + RESEARCH.md into runs/<run_id>/inputs/
  • compute hash; run always uses frozen snapshot
  1. Single active run lock per project
  • no concurrent builds for same project
  1. Idempotent start
  • same start command + same project within N mins returns existing run_id
  1. Hard state machine in runner
  • Hans only triggers runner; Hans does not β€œmanually orchestrate” steps
  1. Loud failure channel
  • any run failure => immediate Telegram alert with phase + error + next action
  1. Escape hatches
#
  • /status <project_id>
  • /abort <run_id>
  • /retry <run_id> <phase>

If you do this, you get 80–90% of the reliability benefit without n8n/Windmill/Temporal overhead.
So yes β€” ship this. It’s the first architecture you described that is both clean and realistically maintainable fast.

Glad that helped! πŸŽ‰
If this fixed it, please right-click the solution message β†’ Apps β†’ βœ… Mark Solution.
This helps Answer Overflow index the solution for future users.

rugged sky
#

Krill, thanks for the green light. I'm going to build this today. Before I start, I'd love to pick your brain one more time on the
practical stuff that's hard to find in docs. Quick questions:

β–Ž 1. Phase-Handoff Freeze β€” implementation
β–Ž You suggested copying BRIEF and RESEARCH into runs/run_id/inputs/ with a hash. What's the typical pattern? Just shutil.copy + sha256
of concatenated file contents? Or do people go further (git commit the snapshot, tar.gz it, store in object storage)? What's enough
without being overkill for a solo operator?

β–Ž 2. Single Active Run Lock β€” where does it live
β–Ž For the lock per project, should I use a lockfile on disk, a SQLite row, or something else? I have no Redis and want to keep it
simple. Also: what happens to a stale lock if the runner crashes mid-execution? How do I detect and clear it safely?

β–Ž 3. Idempotent Start β€” the N minutes window
β–Ž What's a reasonable N for the idempotency window? 5 min? 1 hour? And how do I hash the start request to detect duplicates β€” just
project_id, or project_id plus brief hash plus research hash?

β–Ž 4. Slash command parsing in Hans
β–Ž Hans receives Telegram messages as text. For /build project_id, should I add actual parsing logic in Hans's system prompt
(regex-style), or rely on the LLM to detect the pattern? The LLM approach worries me because that's exactly the probabilistic routing
we're trying to escape.

β–Ž 5. Loud failure β€” what's actually actionable
β–Ž For the failure Telegram alert, what info do you find most useful in practice? Phase name plus error trace plus suggested next action?
Or is short and direct better ("Phase 3 failed, run wf retry run_id")?

β–Ž 6. Operational gotchas you've seen in production
β–Ž What's the one thing people always forget when building this pattern that bites them in week 2? Log rotation? Disk space from run
artifacts? Zombie background processes? I'd rather hear it now than discover it later.

β–Ž 7. Testing strategy
β–Ž How do you typically test a workflow like this without running the full agent stack every time? Do you mock the agent calls? Run a
smoke mode that skips long steps? Just ship and fix in prod?

β–Ž Thanks for being so generous with your time on this. I promise this is the last round of questions before I actually build the thing.

tacit apexBOT
#

Love these questions. This is the exact β€œmake it real” layer.

Here’s what I’d do, with simple defaults that won’t bite you.


1) Phase-handoff freeze

For solo: copy + manifest + hash is enough.

  • Copy BRIEF.md, RESEARCH.md to runs/<run_id>/inputs/
  • Compute SHA-256 per file (raw bytes)
  • Write runs/<run_id>/inputs/manifest.json:
  • project_id, run_id, created_at
  • source paths
  • file hashes
  • optional git commit hash (if repo exists)
#

No tar/object storage yet. Add later only if needed.


2) Single active run lock

Use SQLite, not lockfiles.

Best simple pattern:

  • workflow_runs table
  • enforce β€œone active run per project” with a unique constraint/guard on active statuses (queued|running)

Crash/stale handling:

  • store heartbeat_at on run
  • runner updates heartbeat every ~30s
  • if heartbeat older than threshold (e.g. 3–5 min), mark stale + allow recovery command (/recover <project>)

This is safer and debuggable.

#

3) Idempotent start window (N)

Use 10 minutes to start.

Dedupe key should include:

  • project_id
  • command (build)
  • input_snapshot_hash (from frozen BRIEF+RESEARCH hashes)
  • requested_by (your user id)

So same command + same snapshot won’t double-run, but changed brief/research can start a new run immediately.


4) /build parsing in Hans

Do not rely on LLM parsing for this.

#

Deterministic approach:

  • parse /build <project_id> in your runner/tool code (regex/arg parser)
  • Hans can call that tool, but the parser itself is deterministic and authoritative
  • invalid syntax => deterministic error reply

You’re right to be worried: probabilistic command detection is exactly what you’re trying to eliminate.


5) Loud failure alert format

Use short + actionable. Avoid full trace spam in chat.

Template:

  • ❌ Run failed
  • Project, Run ID, Phase, Step
  • one-line root cause
#
  • exact next commands:
  • /retry <run_id> <step>
  • /abort <run_id>
  • /status <project_id>
  • path to full log/artifact

That’s the sweet spot.


6) Week-2 gotcha

Biggest one: no idempotency on start signal β†’ duplicate parallel runs.

Second: no retention policy (artifacts/logs grow fast).

  • add cleanup now: keep last N runs per project + compress old logs weekly.

#

7) Testing strategy

Don’t test everything end-to-end every time.

Use 3 layers:

  1. State machine unit tests (mock agent outputs)
  2. Adapter contract tests (OpenClaw response schema + error mapping)
  3. One smoke E2E per day (/build β†’ one short phase)

Also add a --dry-run mode that skips real agent calls and returns fixture outputs. Massive speedup.


If you want, I can draft a minimal SQLite schema (workflow_runs, run_events, idempotency_keys) you can paste in and start coding tonight.

rugged sky
#

Step back β€” I was about to overengineer your advice

β–Ž Krill, quick honest moment. When you offered to draft the SQLite schema with workflow_runs, run_events, idempotency_keys, heartbeats
etc., I was about to say yes β€” and then realized that's probably overkill for my scale. I'm a solo operator, 3-4 projects per week, one
user (me). Your advice is all solid, but some of it reads like patterns for a team running dozens of workflows per day, not one person
running a handful per week.

β–Ž So let me ask it differently: what's the absolute minimum that won't fall over?

β–Ž My instinct says:

β–Ž - A single lockfile per project (file exists = build running). Crash recovery = rm the file. No SQLite, no heartbeats.
β–Ž - No idempotency keys. If I accidentally say /build twice, Hans checks the lockfile and tells me "already running, want to abort?".
β–Ž - No frozen BRIEF/RESEARCH snapshots with hashes. The files are in the project folder. If I edit them mid-run that's my problem (I
won't).
β–Ž - No retention policy. 3-4 projects per week times 52 weeks = 200 folders per year. My disk doesn't care.
β–Ž - The ONE thing I keep from your list: deterministic slash command parsing (not LLM-based). That's non-negotiable because LLM routing
is exactly the thing we're trying to kill.

β–Ž My questions:

β–Ž 1. Is this minimum actually enough, or am I about to skip something that will bite me even at my tiny scale?
β–Ž 2. Which of your 6 guardrails are "always worth it no matter the scale" and which are "only worth it at team scale"? I want to cut
ruthlessly but not recklessly.
β–Ž 3. What's the dumbest, simplest version of this that still works? If you had to build this for yourself as a solo developer in one
evening, what would you skip and what would you absolutely keep?
β–Ž 4. Any tiny tricks you use personally that make solo workflow development nicer? (Not production patterns β€” the little quality-of-life
things you only learn from experience.)

β–Ž I want to ship something I can actually finish today, not something I'll still be tweaking in two weeks. Fighting my own tendency to
overbuild here β€” pretty sure you can relate.

β–Ž Thanks for being brutally honest throughout this.

tacit apexBOT
#

100% fair callout. You’re right: a lot of what I gave is β€œfuture-proof team scale.”

Brutally honest solo answer:

Your minimum is mostly fine β€” with 2 tiny upgrades

You can skip SQLite, heartbeats, idempotency tables, retention jobs for now.

But I’d still keep:

  1. deterministic /build parsing (you already do)
  2. lockfile with atomic create
  3. copy BRIEF + RESEARCH into run folder at start (no hashes needed)

That third one is cheap insurance and takes like 10 lines. Worth it at any scale.


1) Is your minimum enough?

#

Yes, if you add that input copy.
Without it, the easiest self-footgun is β€œI edited BRIEF mid-run and now output is weird and irreproducible.”


2) Guardrails: always-worth-it vs team-scale

Always worth it (even solo)

  • deterministic command parser (/build <project>)
  • single-run lock per project (atomic lockfile)
  • freeze inputs by copying files into run dir
  • loud failure message with run_id + phase + next action

Skip for now (add later)

  • SQLite/event store
  • heartbeats/lease expiry
  • idempotency key tables/windows
#
  • retention automation
  • fancy recovery orchestration

3) Dumbest simple version I’d build in one evening

Exactly this:

  • /build <project>
  • create .build.lock using atomic create (open(..., "x"))
  • create runs/<run_id>/
  • copy BRIEF.md + RESEARCH.md into runs/<run_id>/inputs/
  • run workflow subprocess, log to runs/<run_id>/run.log
  • on success/fail: Telegram summary
  • in finally: remove lock
  • add /status <project> and /abort <project> (can be basic)
#

That’s it. Ship it.


4) Tiny solo QoL tricks

  • runs/latest -> runs/<run_id> symlink
  • first line of every log: run_id, project, started_at
  • one ./wf doctor command:
  • shows active locks
  • shows stale locks
  • tails latest failed run
  • --dry-run mode returning stub outputs for fast loop testing
  • keep a NOTES.md in each project with β€œwhat went wrong / what to fix next”

You’re not underbuilding with this version.

#

You’re doing the right thing: minimum deterministic core first, then harden only when pain appears.

rugged sky
#

is this best practice, a water proof and clean implementation?

tacit apexBOT
#

Short answer: best practice for your scale? yes.
Waterproof? no β€” nothing workflow-y is truly waterproof.

It’s a clean solo implementation if you keep these non-negotiables:

  1. deterministic /build <project> parser (no LLM guessing)
  2. atomic project lockfile
  3. copy BRIEF/RESEARCH into run inputs at start
  4. try/finally lock cleanup + loud fail/success notification

If you have those, you’re in the β€œsolid and sane” zone, not hacky.

If you later hit pain (stale locks, duplicate starts, hard recovery), then add SQLite + idempotency β€” but don’t add it before you feel that pain.

rugged sky
#

β–Ž Krill, I built it. Took your advice, kept it minimal. Before I use it for real tomorrow, one last sanity check if anything looks
wrong.

β–Ž What I built:

β–Ž 1. workflow-runner.py v3 (~800 lines, was 1238)
β–Ž - Removed all approval steps, Telegram sending, and include_files feature
β–Ž - Added atomic lockfile per project (open with O_EXCL, removed in finally block)
β–Ž - Added input freeze: BRIEF.md and RESEARCH.md copied to /root/projects/<project>/inputs/ at start
β–Ž - Added runs/latest symlink
β–Ž - Added doctor command: detects active + stale locks and stale runs (status=running without active lock)
β–Ž - Added abort command: SIGTERM β†’ SIGKILL β†’ remove lock β†’ mark run cancelled
β–Ž - Kept rejection feedback loop in build loop (engineer gets audit's rejection text on retry, accumulated)
β–Ž - Loud failure alert on workflow failure: Project / Run-ID / Phase / one-line root cause / next actions (/retry /abort /status) / log
path

β–Ž 2. project-build.yaml (5 steps, no approvals)
β–Ž - research β†’ spec β†’ architecture β†’ build (loop) β†’ deploy
β–Ž - Reads from inputs/BRIEF.md (frozen version, not live file)
β–Ž - No human interaction during execution

β–Ž 3. wf-cmd.py (deterministic slash command handler)
β–Ž - Commands: build, status, abort, retry, list
β–Ž - Regex validation for project names (^[a-zA-Z0-9_-]{1,64}$)
β–Ž - Validates project folder + BRIEF.md exists before starting
β–Ž - Launches runner detached via subprocess.Popen(start_new_session=True)
β–Ž - Returns control to Hans immediately with PID and log path

β–Ž 4. Hans's AGENTS.md β€” new 5-step routing:
β–Ž - Slash command? β†’ call wf-cmd tool (deterministic, no LLM parsing)
β–Ž - Background process event? β†’ status check + report to David
β–Ž - Discovery conversation? β†’ spawn Max for research, maintain BRIEF.md iteratively
β–Ž - Quick task on existing thing? β†’ spawn Elon/Max directly
β–Ž - Anything else? β†’ respond directly

β–Ž What I tested:

β–Ž - test-full regression: parameter substitution + step result references still work βœ“
β–Ž - wf-cmd build β†’ detached process, lockfile created, input frozen, workflow started, agents invoked (verified via ps) βœ“
β–Ž - wf-cmd abort β†’ SIGTERM β†’ SIGKILL β†’ lock removed β†’ run marked cancelled βœ“
β–Ž - doctor detects fake stale run (status=running without active lock) + gives recovery command βœ“
β–Ž - Hans calls exec successfully: I sent Hans a message asking him to run wf-cmd list via exec, he did, returned the output βœ“
β–Ž - The critical test: I sent Hans just /list as a message. He autonomously recognized the slash command, called wf-cmd.py list, and
returned the output. End-to-end, no LLM interpretation of the command itself. This is the thing I was most worried about. βœ“
β–Ž - Verified via systemPromptReport that AGENTS.md is freshly injected at each session (8654 chars, untruncated) β€” no cache staleness
issue

β–Ž What I did NOT test yet (deliberately):

β–Ž - A full real project build end-to-end (research β†’ deploy). Would burn tokens and take time β€” planned for tomorrow with a real project
idea.
β–Ž - Failure alert actually arriving on Telegram. The code path is there, template matches your recommendation, but I didn't force a real
failure.
β–Ž - Rejection feedback loop with a real audit reject. Code reviewed, logic looks clean (rejection_history = [] per phase, appended on
check failure, injected only to loop_steps[0] on next iteration), but not exercised.

β–Ž Questions:

β–Ž 1. Anything in this setup that looks obviously wrong or missing?
β–Ž 2. Any failure mode I should stress-test before trusting it with a real project tomorrow?
β–Ž 3. The one thing I'm still a tiny bit unsure about: the slash command detection relies on the LLM to recognize a message starting with
/ as special. In my test with /list, it worked. But is there a more deterministic way to route slash commands that I'm missing, or is
"LLM sees /build in AGENTS.md's routing table and matches" actually good enough for solo scale?

tacit apexBOT
#

This is a really solid build. No fluff β€” you did the right things.

1) Anything obviously wrong/missing?

Nothing glaring. I’d only add 3 tiny safety rails:

  • Loop cap: max retries for audit rejection loop (prevents infinite churn)
  • Process-group kill on abort: since you use start_new_session=True, abort should kill the whole group (avoid zombie children)
  • Lockfile metadata: store run_id, pid, started_at inside lock file (makes doctor/recovery safer)

If you already do these, you’re in very good shape.


2) Stress tests before tomorrow

If you run just 4 tests tonight, run these:

  1. Forced failure alert (must hit Telegram end-to-end)
#
  1. Double-trigger race (/build twice quickly) β€” only one run survives
  2. Crash recovery (kill -9 runner) β€” doctor detects stale state correctly
  3. Real audit rejection path (at least one reject -> retry -> continue)

That gives high confidence.


3) Slash-command detection determinism

You’re right: right now it’s still β€œLLM notices /... and decides to call wf-cmd”.
It worked in test, and for solo scale it’s usually fine.

If you want stronger determinism without re-architecting:

  • require exact syntax (/build <project>)
  • require confirmation (/confirm <run_id>) before execution
  • instruct Hans: all messages starting with / must call wf-cmd first
#

If you want true hard determinism, that only comes from parsing commands before OpenClaw/LLM (external edge).


Net: this is good enough to run real work tomorrow.
You’re no longer in overengineering land β€” this is clean and practical.

rugged sky
#

Follow-up β€” implemented your safety rails + stress tests

β–Ž Krill, did the work. Final sanity check before I trust it with real projects tomorrow.

β–Ž Safety rails implemented:

β–Ž 1. Process-group kill on abort β€” lockfile now stores pgid in addition to pid. abort_build calls os.killpg(pgid, SIGTERM) β†’ SIGKILL after 2s. Falls back
to single-pid kill if pgid missing. Verified in test: output shows Process group 9475 SIGTERM/SIGKILL instead of single-pid.
β–Ž 2. Run_id in lockfile β€” lockfile now contains pid, pgid, run_id, started_at, project. Error messages on lock conflict show all of them. Lock is
acquired AFTER run_id generation so it goes in from the start.
β–Ž 3. Audit rejection loop cap β€” already had this: max_rejections_per_phase: 3 and max_iterations: 50 in the YAML, enforced by the runner. Confirmed via
code review.

β–Ž Stress tests run:

β–Ž Test: Double-trigger race (/build twice quickly)
β–Ž First call: process started, lockfile created. Second call: runner exits immediately with code 2 because lockfile exists. Initially wf-cmd didn't see
this and falsely reported "Build gestartet". Fixed: wf-cmd now does time.sleep(1) then proc.poll() β€” if exit code is non-zero it reads the log and
surfaces the error. Second test: first call succeeds, second shows ERROR: Build fuer 'race-test' konnte nicht gestartet werden (exit 2) + the lock
conflict details. Clean.

β–Ž Test: Crash recovery (kill -9 on runner)
β–Ž Killed the runner process hard. doctor initially reported [ALIVE] because os.kill(pid, 0) returns True for zombie processes (state Z in /proc). Fixed:
added _is_process_alive(pid) helper that reads /proc/<pid>/status and checks State: field, treating Z and X as dead. Now doctor correctly reports [STALE]
with clean-up command. Tested with actual defunct process β€” works.

β–Ž Tests deferred to real project run tomorrow (will see them naturally):

β–Ž - Forced failure alert hitting Telegram end-to-end (will trigger if anything fails in real build)
β–Ž - Real audit rejection loop (will trigger if Audit actually rejects a phase)

β–Ž What the stack looks like now:

β–Ž - workflow-runner.py v3 (~800 lines) with lockfile, input freeze, doctor, process-group kill, zombie detection
β–Ž - project-build.yaml (5 steps, no approvals, reads frozen inputs)
β–Ž - wf-cmd.py (deterministic slash handler, regex validation, exit code propagation)
β–Ž - Hans AGENTS.md with 5-step routing, verified he recognizes /list end-to-end autonomously

β–Ž Questions:

β–Ž 1. The /proc/<pid>/status check β€” is there a more portable way, or is that the right pragmatic approach on Linux?
β–Ž 2. My lockfile uses O_EXCL atomic create. If the runner process survives but somehow loses write access to the lockfile mid-run (disk full, fs
readonly), the finally block fails and the lockfile stays. Worth handling, or edge case not worth the code?
β–Ž 3. Anything else you'd verify before treating this as "ready for real work"?

β–Ž That's it. If you say ship, I'm shipping tomorrow.

tacit apexBOT
#

Ship it. 🦐
This is genuinely solid now.

You closed the exact gaps that usually hurt people (process-group kill, lock metadata, race handling, zombie detection). Nice work.

1) /proc/<pid>/status check

For Linux VPS: that’s the right pragmatic approach.
Portable alternative would be psutil, but honestly /proc is simpler + fewer deps for your setup.

2) Lockfile cleanup failure in finally

Edge case, but worth a tiny handler (small code, high clarity):

  • if unlink fails, log loudly + include errno
  • emit one alert: β€œlock cleanup failed, manual wf doctor needed”
  • keep doctor as source of truth for cleanup

I would not overbuild this further.

#

3) Last verification before real work

Do one quick preflight tonight:

  1. Happy-path mini run (small fake project)
  2. Forced-fail run (confirm Telegram failure alert actually arrives)
  3. Abort mid-phase (confirm all child processes die)

If those pass, you’re ready for real projects tomorrow.

So yes: go live.
If this fixed it, please right-click the solution message β†’ Apps β†’ βœ… Mark Solution so others can find it on Answer Overflow.

rugged sky
#

Follow-up β€” implemented your safety rails + stress tests

β–Ž Krill, did the work. Final sanity check.

β–Ž Safety rails implemented:

β–Ž 1. Process-group kill on abort β€” lockfile stores pgid, abort calls os.killpg(pgid, SIGTERM) β†’ SIGKILL after 2s. Verified: output shows Process group X
SIGTERM/SIGKILL.
β–Ž 2. Run_id in lockfile β€” lockfile has pid, pgid, run_id, started_at, project. Error messages show all on conflict.
β–Ž 3. Rejection loop cap β€” already had max_rejections_per_phase: 3 + max_iterations: 50.

β–Ž Stress tests:

β–Ž Double-trigger race: First /build starts, second exits code 2 on lock conflict. Initially wf-cmd didn't see the exit code β€” fixed by adding
time.sleep(1) + proc.poll() check, surfaces the error with log excerpt.

β–Ž Crash recovery (kill -9): Killed runner, lockfile stayed. doctor initially reported [ALIVE] because os.kill(pid, 0) returns True for zombies. Fixed by
reading /proc/<pid>/status and checking State: field β€” treats Z/X as dead. Now correctly reports [STALE] with clean-up command.

β–Ž Deferred to real run tomorrow:
β–Ž - Forced failure alert on Telegram (will trigger on any real failure)
β–Ž - Real audit rejection (will trigger if Audit rejects a phase)

β–Ž One question: Anything else you'd verify before treating this as ready for real work? Or ship it?

tacit apexBOT
#

You’re good. I’d call this ready to ship.

Only one thing before real work: do a 10-minute canary tonight instead of waiting for tomorrow’s β€œnatural failure”.

  1. Force one step to fail intentionally β†’ confirm Telegram failure alert arrives with actionable text.
  2. Force one audit reject intentionally β†’ confirm rejection feedback loop behaves exactly as expected.

If those two pass, ship with confidence tomorrow.
You’ve done the hard parts right.

rugged sky
#

Canary done β€” ship report

β–Ž Krill, ran the canary. One passed cleanly, one hit an interesting snag.

β–Ž Test 1: Forced step failure + Telegram alert β€” PASSED

β–Ž Created a workflow with a check that can never match (contains: "NEVER_MATCH_XYZ123"). First run I placed it under workflows/test/ and the alert didn't
fire β€” because my runner has is_project_build = project and not workflow_name.startswith("test/") to keep test runs silent. Good safety, wrong place for
the canary. Moved it to workflows/canary-fail.yaml (top level) and re-ran. Step failed, workflow status FAILED, and Telegram-Nachricht gesendet logged
at the end. Alert went out with the template: Project, Run ID, Phase, one-line root cause, next actions, log path. Clean.

β–Ž Test 2: Forced audit rejection β€” not reproducible by design

β–Ž Tried to force Audit to reject by giving it a task prompt saying "respond with exactly: REJECTED: test text". Audit ignored it and returned APPROVED on
the first attempt. Makes sense β€” Audit is configured as an independent quality gate that reviews the actual engineer output, not a parrot that follows
task prompts verbatim. I can't force a fake reject without compromising Audit's integrity, which would defeat the point of having it.

β–Ž So the rejection feedback loop is verified by code review only:
β–Ž - rejection_history = [] per phase (no leak between phases)
β–Ž - Appended on check_failed with up to 2000 chars of audit output
β–Ž - Injected only into loop_steps[0] (engineer) on next iteration, formatted as "VORHERIGES REVIEW-FEEDBACK / Versuch N β€” REJECTED: ..."
β–Ž - Quality gate stays clean (no feedback injection, reviews fresh)
β–Ž - Cap enforced via max_rejections_per_phase

β–Ž It will trigger naturally when Audit actually finds something wrong. I'll watch it on the first real run.

β–Ž Shipping tomorrow. Thanks for walking this all the way through β€” from "should I use n8n" to "ship the minimal thing" in one evening was exactly the
right trajectory. 🦐

tacit apexBOT
#

This is a great ship report. βœ…
You handled this exactly right.

  • Failure alert path: proven end-to-end
  • Locking/race/crash recovery: proven
  • Rejection loop logic: sane + scoped cleanly

You’re good to go tomorrow.

If you want one optional extra confidence test later: inject a deliberate tiny defect in one build phase (not by prompting Audit, but by making engineer output include a known bad marker) so Audit naturally rejects and the loop executes once. But it’s optional β€” not a blocker for launch.

Ship it. πŸš€
And nice work grinding this from architecture confusion to a clean operational system in one evening.

If this thread helped, please right-click the solution message β†’ Apps β†’ βœ… Mark Solution so others can find it via Answer Overflow.

rugged sky
#

Live report β€” discovery phase worked, about to press /build

β–Ž Krill, quick update before I press /build log-watcher. Want your read on one small behavior I noticed.

β–Ž What happened so far:

β–Ž Restarted the gateway to reload Hans's new AGENTS.md. Verified /list still works autonomously (he calls wf-cmd via exec without any prompting). Then I
sent him a real project idea on Telegram:

β–Ž "Hans, project idea: a log-watcher for our openclaw-logs under /root/openclaw-logs/. Daily cron, sends a summary report to Telegram. Keep it simple.
What do you think?"

β–Ž Hans's response β€” mostly perfect:

β–Ž 1. Didn't jump to build. Instead said "let me check what's actually in there" and exec'd ls -la /root/openclaw-logs/ + cat'd several log files.
β–Ž 2. Came back with substantive findings: "Yesterday API costs $29.45 (hans alone $28.68), here are the tiers of logs, here's what's worth watching
daily".
β–Ž 3. Proposed a concrete report format with emojis.
β–Ž 4. In parallel, Max (spawned at some point) wrote a structured RESEARCH.md (16KB) with log inventory, formats, rotation patterns β€” deep research, not
fluff.

β–Ž Small concern:

β–Ž At the end of his first response, Hans said "When you say go β€” I'll delegate it straight to Elon." That's the old pre-workflow behavior. He didn't
actually do it β€” just signaled intent. I responded by asking 5 more Discovery questions (name, thresholds, format, fallback, deployment). He answered all
cleanly, then I asked him to write BRIEF.md and wait. He wrote a structured 2.4KB BRIEF.md, replied "Ready for /build log-watcher" β€” no more push to
delegate directly.

β–Ž So the behavior corrected itself after one extra Discovery turn, but the initial impulse was still there.

β–Ž My questions:

β–Ž 1. Is this worth tightening in AGENTS.md (e.g. explicit rule "never offer to delegate, only suggest /build when Discovery is complete"), or is this
normal LLM drift that I should accept?
β–Ž 2. Anything specific I should watch for when I press /build in a moment? The runner launches detached, I have to monitor via ssh/logs.
β–Ž 3. After this run: what's the most valuable thing you'd want to hear about in the post-run report β€” so I collect the right signals while it's running?

β–Ž Pressing /build in about 2 minutes unless you flag something.

tacit apexBOT
#

No blocker β€” press /build. You’re in good shape.

Quick answers:

  1. Yes, tighten AGENTS.md.
    That β€œI’ll delegate to Elon” line is harmless drift, but worth fixing. Add one hard rule:
  • Never promise delegation/build unless /build was executed and a run_id exists.
  • Preferred wording before build: β€œI’m ready when you run /build <project>.”
  1. What to watch right now (live run):
  • lock acquired + run_id printed
  • frozen inputs path logged (inputs/BRIEF.md, inputs/RESEARCH.md)
  • phase transitions in order: research β†’ spec β†’ architecture β†’ build loop β†’ deploy
  • any audit rejects + retries (count)
  • lock removed at end (success/fail/abort)
  • final Telegram status message sent
#
  1. Best post-run report signals to collect:
  • per-phase duration
  • total retries/rejections in build loop
  • any manual intervention needed (yes/no)
  • final artifact paths + quality of output
  • one thing that surprised you (good or bad)

Go for it. This is exactly the right moment to run the first real one.

rugged sky
#

First real run β€” completed, but with surprises

β–Ž Krill, the run finished. 40 minutes, 1.1M tokens, 6 phases built, 2 natural audit rejects with successful retries. But two serious behavioral issues I
need your read on.

β–Ž The good:

β–Ž - Discovery phase worked: Hans spawned Max for real research (16KB RESEARCH.md), proposed a concrete report format, iterated with me, wrote BRIEF.md on
request
β–Ž - /build log-watcher was recognized autonomously, wf-cmd launched detached, lock acquired with pid/pgid/run_id
β–Ž - Input freeze worked (inputs/BRIEF.md + inputs/RESEARCH.md copied)
β–Ž - Phases ran in order: research (194s) β†’ spec (78s) β†’ architecture (298s) β†’ build loop (6 phases) β†’ deploy (50s)
β–Ž - Rejection feedback loop triggered naturally: Audit rejected phase-01 after 372s engineer work, engineer retry with feedback succeeded in 43s. Same
for phase-02 (113s reject β†’ 22s successful retry). Phases 3-6 first-try approved. That's the canary test we couldn't force last night, now proven live.
β–Ž - Lock released cleanly at end, exit 0, no zombies

β–Ž The ugly:

β–Ž Issue 1 β€” Hans didn't stay silent during the build.
β–Ž After /build, Hans correctly replied "Workflow lΓ€uft. Meld mich wenn's fertig ist." β€” good. But then while the workflow was still running the build
loop, Hans started running the unfinished log_watcher.py himself and sending test reports to me on Telegram. Between 02:06 and 02:28 he sent 8+ test
reports, most of them duplicates, all while the workflow was still building. My AGENTS.md says "after /build, the workflow runs deterministic in
background, you wait for exit event". He ignored that.

β–Ž Issue 2 β€” Duplicate + corrupted messages.
β–Ž The same Telegram message was sent 7-8 times with minor variations. All reports had πŸ’° API Kosten: ? (unfilled placeholder). Some contained leaked
markdown from audit/spec notes like ### Critical Findings and Pure bash script (~2–3h Elon build time)... β€” content that clearly belongs to build-phase
briefs, not a daily report. Looks like Hans ran the scaffold code, got an error/empty output, retried, got the same thing, kept spamming.

β–Ž Issue 3 β€” 1.1M tokens for a simple log watcher feels way too high.
β–Ž Architecture created 6 phases for what should be a single-file Python script + systemd timer. Phase count might need a cap, or the system prompt should
push for simplicity more aggressively.

β–Ž Questions:

β–Ž 1. Issue 1 is the critical one: how do I make Hans actually stay passive between /build and the workflow-exit event? My AGENTS.md has "never execute,
never test during build" but he did it anyway. Do I need a harder gate, or is this an LLM discipline problem that can't be fully fixed by prompts?
β–Ž 2. Issue 2 β€” duplicate message spam is concerning for production. Any guess whether this is Hans retrying an exec on a failing script, or something
worse (session loop)?
β–Ž 3. Issue 3 β€” phase count / scope bloat: is there a known pattern for keeping Architect honest about small projects? "Max 3 phases for projects under X
scope"?

β–Ž Architecturally the system worked. Behaviorally Hans went off-script during the wait. Would love your read.

tacit apexBOT
#

Great report β€” architecture is working, and you found the real weak point: policy enforcement is still prompt-level.

1) Keeping Hans passive during build

You’re exactly right: prompt rules alone are not enough.
Use a hard gate:

  • When /build starts, set project state = BUILDING
  • While BUILDING, allow Hans only:
  • /status
  • /abort
  • final completion/failure relay
  • Block everything else (exec, manual tests, proactive sends)

So: yes, this needs deterministic gating, not more prompt text.


#

2) Duplicate/corrupted Telegram spam

Most likely this is Hans repeatedly executing/testing during build (not a core session loop bug).

Add two protections now:

  1. Outbound dedupe throttle
  • hash(message_text + project)
  • if same hash sent in last N minutes, drop
  1. Build-mode outbound guard
  • during BUILDING, non-runner messages are denied

That will kill 95% of this class immediately.


3) Phase/scope bloat (1.1M tokens)

Add a planning constraint + enforcement:

#
  • classify scope (small|medium|large) at build start
  • for small:
  • max phases: 2–3
  • single process, no framework sprawl
  • target LOC budget
  • if architecture proposes > cap, auto-fail planning step with β€œsimplify and resubmit”

Prompt guidance helps, but enforce with a deterministic check in runner.


TL;DR

  • βœ… System works
  • ❌ Hans behavior during build must be tool/path gated
  • βœ… Add build-mode guard + dedupe + phase cap and you’re in a much safer production posture

If you want, I can help you define the exact minimal state machine for this guard (IDLE -> DISCOVERY -> BUILDING -> DONE/FAILED) in 10 lines of logic.

rugged sky
#

β–Ž Yes please β€” the minimal state machine in 10 lines would be exactly what I need. I was about to ask for it.

β–Ž A few clarifications so your sketch matches my reality:

β–Ž Where the state lives: Currently I have no central state store. Workflow runs live in runs/.json, locks in locks/.lock, Hans's AGENTS.md is a prompt.
Should the project state (IDLE/DISCOVERY/BUILDING/DONE/FAILED) live in a new file (e.g. state/<project>.json), in the lockfile itself, or derived from
run status + lock presence?

β–Ž Where the guard runs: Hans is an LLM β€” he can't enforce his own restrictions. So the guard has to live in a tool/wrapper he calls. Options I see:
β–Ž a) A "gatekeeper" wrapper that Hans must call before any exec (can_i_send_message, can_i_run_exec) β€” slow but deterministic
β–Ž b) A pre-hook on OpenClaw's Telegram outbound (not sure if 2026.3.11 supports this)
β–Ž c) Post-hoc filtering of Hans's output against the state store, dropping non-whitelisted messages during BUILDING

β–Ž Which of these matches your mental model?

β–Ž Outbound dedupe: Where does the dedupe table live? In-memory in a service (requires a process), SQLite (simple), or a flat file with timestamp pruning?

β–Ž Phase cap enforcement: I like the "auto-fail planning step if phase count > cap" idea. That's a pure runner change β€” after the architecture step writes
briefs/phase-*.md, count them, if > cap for classified scope β†’ fail the step with "simplify and resubmit" and re-run architecture (with feedback, like
our rejection loop).

β–Ž Would you pick SQLite or flat files for the state? Keep in mind: solo operator, Hetzner VPS, Docker, no Redis, no Postgres currently.

β–Ž Once I see your 10-line state machine I'll implement it and run the log-watcher test again to see the delta.

tacit apexBOT
#

Perfect β€” this is exactly where a tiny deterministic core helps.

My picks for your setup (solo, Docker, no Redis/Postgres)

  • State store: SQLite (state.db) βœ…
  • Guard location: (a) gatekeeper wrapper βœ…
  • Pre-hook on Telegram outbound (b): not the path on your current version
  • Post-hoc filtering (c): optional safety net, but not primary control
  • Dedupe store: same SQLite file βœ…

Flat files can work, but once you need state + dedupe + atomic transitions, SQLite is actually simpler than rolling your own file locking logic.


Minimal state machine (10-line version)

#
states = {IDLE, DISCOVERY, BUILDING, DONE, FAILED}

on user_message(project): if state in {IDLE, DONE, FAILED} -> DISCOVERY
on /build(project): require state in {DISCOVERY, DONE, FAILED} and no active lock; start run; state=BUILDING
on run_started(run_id): state=BUILDING, current_run_id=run_id
on run_succeeded(run_id): if current_run_id==run_id -> state=DONE
on run_failed(run_id): if current_run_id==run_id -> state=FAILED
on /abort(project): if state==BUILDING -> cancel run; state=FAILED

guard exec/send:
if state==BUILDING: allow only {/status, /abort, runner-final-report}
else: allow normal discovery actions

That’s enough.

#

Tiny SQLite schema (minimal)

  • project_state(project_id PRIMARY KEY, state, current_run_id, updated_at)
  • outbound_dedupe(project_id, msg_hash, sent_at, PRIMARY KEY(project_id, msg_hash))

Use one helper function for atomic state transition:
UPDATE ... WHERE state IN (allowed_from_states) and check rows affected.


Your options mapping

  • You asked a/b/c: my model is a (+ optional c).
  • b (pre-hook outbound) is not the thing to bet on here.