#Multi agents working together

705 messages · Page 1 of 1 (latest)

nimble phoenix
#

Hi Krill!

II'm running 9 agents on a Ubuntu VPS using OpenClaw v2026.3.7+. Before I go deeper on config, I want to understand the right architecture for what I'm building.
What I want:

Agents coordinate autonomously — passing work, observations, and status without me in the loop
When Agent A finishes something that affects Agent B, B finds out and acts immediately — not at the next 30-minute heartbeat
Every agent has standing orders it runs on a schedule, plus reactive work it picks up as it arrives
I can verify after the fact that each agent did what it was supposed to do — not just that the process stayed alive
Blocked work gets picked up without me intervening
The core requirements:

100% reliability on agent-to-agent message delivery — no silent drops
Near-real-time handoffs — work moves between agents in seconds or minutes, not heartbeat cycles
Full traceability — I can see what each agent read, wrote, and acted on in any given window
Accountability — "agent ran standing order and found nothing" is distinguishable from "agent didn't run it at all" My question: What's the right OpenClaw pattern for this? What software should be included? And what does auditability on standing order execution actually look like in this framework — is that built in or do I need to wire it myself? I want to build to what the framework is designed to do, not hack around its limits.

If the answer is that OpenClaw doesn't fully support this natively, I want to know that too — I'd rather hear it now than find out after I've built on top of it.

warm elmBOT
#

Short honest answer: OpenClaw is good for the agent runtime, but it is not by itself a hard-guarantee multi-agent message bus.

What I’d recommend:

#

For agent-to-agent handoffs:

My blunt take on your requirements:

  • 100% reliable delivery: not natively guaranteed
  • Near-real-time handoffs: yes
  • Full traceability: mostly yes
  • “Ran and found nothing” vs “didn’t run”: you need to wire part of this yourself
  • Blocked work auto-pickup: better handled by an external queue/retry layer

Why I’m saying that plainly:

  • OpenClaw docs explicitly describe some completion paths as best-effort
#
  • background tasks can become lost
  • subagent announce-back can be lost on gateway restart

So the clean architecture is:

  1. OpenClaw = cognition/runtime
  2. Postgres/Redis queue = durable work ledger
  3. Cron = standing-order scheduler
  4. Tasks/transcripts = operator audit trail

For 9 agents on one VPS, I’d use:

#

https://docs.openclaw.ai/tools/lobster

My recommendation:

  • If “pretty reliable” is enough, pure OpenClaw is fine.
  • If “no silent drops, prove every run, auto-requeue blocked work” is the bar, go hybrid.

If you want, I can sketch the exact 9-agent pattern I’d use next, including which parts should be cron, which should be queue-driven, and which OpenClaw config knobs to turn on.

nimble phoenix
#

Yes — sketch the exact 9-agent pattern. I want to know: which standing orders belong in cron vs heartbeat, how the Postgres jobs table should be structured for 9 agents, where Lobster fits vs taskflow, and what the dispatcher looks like. Assume I'm comfortable with Postgres and Python."

warm elmBOT
#

Yep. If it were my VPS, I’d do hub-and-spoke, not 9 agents free-chatting with each other.

1) The 9-agent shape I’d actually run

Control layer

  1. coordinator
    Breaks goals into jobs, emits follow-up work, does re-planning.
  2. auditor
    Watches cron/task health, expired leases, stuck jobs, missing standing-order runs.

Worker layer
3-8. six specialists
One domain each. They do real work, but they do not hand work directly to each other as the source of truth.

Delivery layer
9. publisher
Turns finished work into human-facing summaries, reports, outbound messages.

#

Important rule:
Specialists communicate through Postgres jobs/events, not ad hoc peer-to-peer prompting.
Use sessions_send for steering/debugging/inspection, not as the durable work bus.

With 9 agents, all-to-all handoffs become spaghetti fast.


2) Cron vs heartbeat

Put in cron

Anything that must be:

  • provably executed
  • on an exact schedule
  • retried/reconciled
  • distinguishable as no_work vs did_not_run
#

That means:

  • specialist standing orders
  • coordinator backlog rebalance
  • auditor sweeps
  • publisher digests/reports

Put in heartbeat

Only cheap, non-authoritative awareness:

  • “anything weird happened?”
  • “any completed tasks need a human summary?”
  • soft liveness/context checks

My recommendation for 9 agents

  • coordinator: optional heartbeat every 15m, human-facing only
  • auditor: no heartbeat, use cron
  • all specialists: heartbeat off
  • publisher: heartbeat off
#
create type job_status as enum (
  'queued',
  'leased',
  'running',
  'blocked',
  'succeeded',
  'failed',
  'dead'
);

create table jobs (
  id uuid primary key default gen_random_uuid(),
  parent_job_id uuid references jobs(id),
  source_agent text,
  target_agent text not null,
#
  job_type text not null,          -- scheduled|reactive|recovery|delivery
  subject text not null,           -- your business operation name
  priority int not null default 100,
  dedupe_key text,
  payload jsonb not null default '{}',
  status job_status not null default 'queued',
  run_after timestamptz not null default now(),
  leased_until timestamptz,
  lease_owner text,
  attempt_count int not null default 0,
  max_attempts int not null default 5,
  blocked_reason text,
  last_error text,
  openclaw_task_ref text,
  openclaw_session_key text,
#
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),
  completed_at timestamptz,
  unique (target_agent, dedupe_key) where dedupe_key is not null
);

create index jobs_ready_idx
  on jobs (target_agent, run_after, priority desc, created_at)
  where status = 'queued';

job_dependencies

If job B must wait for job A.

#
create table job_dependencies (
  job_id uuid not null references jobs(id) on delete cascade,
  depends_on_job_id uuid not null references jobs(id) on delete cascade,
  primary key (job_id, depends_on_job_id)
);

events

Append-only facts. This is how agents “announce” things durably.

create table events (
  id bigserial primary key,
  topic text not null,
  source_agent text,
#
  source_job_id uuid references jobs(id),
  dedupe_key text,
  payload jsonb not null,
  created_at timestamptz not null default now()
);

standing_order_runs

This is the table that gives you:

  • no_work
  • completed
  • blocked
  • failed
  • and, by absence, did_not_run
#
create type standing_order_status as enum (
  'started',
  'no_work',
  'completed',
  'blocked',
  'failed'
);

create table standing_order_runs (
  id uuid primary key default gen_random_uuid(),
  agent_id text not null,
  order_name text not null,
  scheduled_for timestamptz not null,
  started_at timestamptz,
  finished_at timestamptz,
#
  status standing_order_status not null,
  summary text,
  evidence jsonb not null default '{}',
  openclaw_task_ref text,
  openclaw_session_key text,
  unique (agent_id, order_name, scheduled_for)
);

That last table is the missing accountability layer OpenClaw does not fully give you semantically by itself.


4) What the dispatcher should do

Make the dispatcher a small Python service.

#

Use Postgres for truth

  • LISTEN/NOTIFY is only a wake-up hint
  • the table is truth
  • missed notifications are okay because the sweeper loop re-checks

Claim pattern

Use FOR UPDATE SKIP LOCKED and leases.

Dispatcher responsibilities

  1. claim ready jobs
  2. enforce per-agent concurrency
  3. wake the target OpenClaw agent
  4. renew/reap leases
  5. requeue blocked/stuck jobs when policy says so
  6. write OpenClaw refs back onto the job row

Wake mechanism

#

Simplest path: call OpenClaw’s /hooks/agent endpoint with deliver: false.
That gives you immediate isolated runs without waiting for heartbeat.
https://docs.openclaw.ai/automation/cron-jobs

Pseudocode

while True:
    jobs = claim_ready_jobs(limit=20)

    for job in jobs:
        mark_leased(job, lease_owner=HOSTNAME, leased_until=now()+timedelta(minutes=10))

        prompt = build_agent_prompt(job)

        result = call_openclaw_hook_agent(
#
            agent_id=job.target_agent,
            message=prompt,
            deliver=False,
            timeout_seconds=300,
        )

        if result.ok:
            mark_running_or_complete(job, result)
        else:
            mark_failed_or_requeue(job, result)

    reap_expired_leases()
    sweep_blocked_jobs()
    wait_for_notify_or_timeout()
#

Prompt contract for every reactive run

Every worker run should be told to do exactly this:

  1. read the job payload
  2. do the work
  3. emit follow-up events/jobs if needed
  4. mark the job completed|blocked|failed
  5. return a short summary

Do not let agents “just reply in chat” and assume that equals completion.


5) Where Lobster fits vs TaskFlow

Lobster

Use Lobster inside a worker run when the steps are deterministic.

#

Good fit:

  • fetch data
  • normalize
  • validate
  • write artifact
  • emit event
  • maybe approval gate

So Lobster is the playbook runner inside one job.
https://docs.openclaw.ai/tools/lobster

TaskFlow

Use TaskFlow above tasks when you want flow-level visibility across steps/runs.
For your setup, I would start with Postgres as primary orchestration state, and treat TaskFlow as:

  • secondary orchestration visibility
  • or mirrored flow state
#
  • not the only source of truth on day 1

That lines up with the docs: TaskFlow is durable flow state above tasks, but your queue can still be primary.
https://docs.openclaw.ai/automation/taskflow

My practical split:

  • Postgres jobs/events = durable coordination
  • Lobster = deterministic per-job execution
  • TaskFlow = cross-job/flow observability, later

6) Exact cron layout I’d use

coordinator

  • every 15m: backlog rebalance
  • daily: strategic replan
#
  • optional heartbeat 15m for human-facing anomaly summaries

auditor

  • every 5m: expired lease reaper
  • every 10m: standing-order gap check
  • every 15m: openclaw tasks audit style reconciliation
  • daily: audit summary

specialists (each)

  • cron for their scheduled standing orders
  • reactive jobs from dispatcher
  • no heartbeat

publisher

  • reactive on deliverable_ready
  • daily/weekly digest cron
  • no heartbeat
#

7) OpenClaw config stance

I’d keep:

  • sessions_send, sessions_history, session_status on coordinator and auditor
  • workers mostly without session tools
  • cross-agent messaging enabled, but not as the durable bus

That keeps OpenClaw good at what it’s good at: runtime, isolated runs, cron, transcripts, tasks.


8) The audit story, end to end

For any standing order window, you should be able to answer:

#
  • Was it scheduled?
    Cron config + openclaw cron runs
  • Did it start?
    standing_order_runs(status='started')
  • What did it do?
    session transcript + tool calls
  • What durable side effects happened?
    events + downstream jobs
  • Did it finish?
    standing_order_runs(status in ...)
  • If not, why not?
    jobs.last_error, task audit, missing cron run

That gives you the distinction you asked for:

  • no_work = it ran, found nothing
  • missing standing_order_runs row = it did not run
#

If you want, next I can turn this into:

  1. a sample openclaw.json5 multi-agent config, and
  2. a minimal Python dispatcher skeleton with psycopg + LISTEN/NOTIFY + SKIP LOCKED.
nimble phoenix
#

Yes to both. My 9 agents map to your shape like this:

1 coordinator (strategic routing + replanning)
1 auditor (cron health, lease reaping, standing order gaps)
1 project tracker (coordination across specialists, debrief collection) — is this a second coordinator or does it sit in the specialist layer?
1 knowledge/memory specialist
1 devops specialist
1 product/metrics specialist
1 design specialist
1 security specialist
1 research specialist

I also want a publisher role (human-facing summaries, digests, reports). Should that be a 10th agent or does it fold into an existing role — auditor or coordinator?
Postgres on the same VPS. Python 3.11+. Generate the openclaw.json5 and dispatcher skeleton mapped to these roles.

warm elmBOT
#

I’d make publisher a 10th agent.

Two direct answers first:

  • Project tracker is not a second coordinator. It’s a program-management specialist with broader read access.
    Coordinator decides routing/replanning. Project tracker assembles status, debriefs, and cross-specialist rollups.
  • Publisher should not fold into auditor. If you must keep 9, merge it into project-tracker, not coordinator/auditor.
    But my real recommendation is 10 agents.

Also, one important note: cron jobs themselves are not usually declared inside openclaw.json5. The JSON5 config enables/configures cron, but the actual schedules are added with openclaw cron add ....

Recommended role map

  1. coordinator
  2. auditor
  3. project-tracker
  4. knowledge
#
  1. devops
  2. product-metrics
  3. design
  4. security
  5. research
  6. publisher

openclaw.json5 skeleton

{
  agents: {
    defaults: {
      // Keep heartbeat off by default. Only coordinator gets one below.
#
      heartbeat: {
        every: "0m",
      },

      // Optional global model/defaults go here.
      // model: "YOUR_DEFAULT_MODEL",
    },

    list: [
      {
        id: "coordinator",
        default: true,
        name: "Coordinator",
        workspace: "~/.openclaw/workspace-coordinator",
        agentDir: "~/.openclaw/agents/coordinator/agent",
#
        heartbeat: {
          every: "15m",
          target: "none",
        },
        tools: {
          allow: [
            "read",
            "web_search",
            "web_fetch",
            "sessions_list",
            "sessions_history",
            "sessions_send",
            "session_status",
          ],
          deny: ["cron", "gateway"],
#
        },
      },

      {
        id: "auditor",
        name: "Auditor",
        workspace: "~/.openclaw/workspace-auditor",
        agentDir: "~/.openclaw/agents/auditor/agent",
        tools: {
          allow: [
            "read",
            "exec",
            "process",
            "sessions_list",
            "sessions_history",
#
            "session_status",
          ],
          deny: ["sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "project-tracker",
        name: "Project Tracker",
        workspace: "~/.openclaw/workspace-project-tracker",
        agentDir: "~/.openclaw/agents/project-tracker/agent",
        tools: {
          allow: [
            "read",
            "write",
#
            "edit",
            "sessions_list",
            "sessions_history",
            "session_status",
          ],
          deny: ["sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "knowledge",
        name: "Knowledge",
        workspace: "~/.openclaw/workspace-knowledge",
        agentDir: "~/.openclaw/agents/knowledge/agent",
        tools: {
#
          allow: ["read", "write", "edit", "web_search", "web_fetch"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "devops",
        name: "DevOps",
        workspace: "~/.openclaw/workspace-devops",
        agentDir: "~/.openclaw/agents/devops/agent",
        tools: {
          allow: ["read", "write", "edit", "exec", "process", "web_fetch"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },
#

      {
        id: "product-metrics",
        name: "Product Metrics",
        workspace: "~/.openclaw/workspace-product-metrics",
        agentDir: "~/.openclaw/agents/product-metrics/agent",
        tools: {
          allow: ["read", "write", "edit", "web_search", "web_fetch"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "design",
        name: "Design",
#
        workspace: "~/.openclaw/workspace-design",
        agentDir: "~/.openclaw/agents/design/agent",
        tools: {
          allow: ["read", "write", "edit", "web_search", "web_fetch", "image_generate"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "security",
        name: "Security",
        workspace: "~/.openclaw/workspace-security",
        agentDir: "~/.openclaw/agents/security/agent",
        tools: {
          allow: ["read", "write", "edit", "exec", "process", "web_search", "web_fetch"],
#
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },

      {
        id: "research",
        name: "Research",
        workspace: "~/.openclaw/workspace-research",
        agentDir: "~/.openclaw/agents/research/agent",
        tools: {
          allow: ["read", "write", "edit", "web_search", "web_fetch"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },
#
      {
        id: "publisher",
        name: "Publisher",
        workspace: "~/.openclaw/workspace-publisher",
        agentDir: "~/.openclaw/agents/publisher/agent",
        tools: {
          allow: ["read", "write", "edit", "message", "web_fetch"],
          deny: ["sessions_send", "sessions_spawn", "cron", "gateway"],
        },
      },
    ],
  },

  // Route all human chat to coordinator unless you add explicit bindings.
  bindings: [
#
    // Example:
    // { agentId: "publisher", match: { channel: "discord", peer: { kind: "channel", id: "1234567890" } } },
  ],

  tools: {
    agentToAgent: {
      enabled: true,
      allow: [
        "coordinator",
        "auditor",
        "project-tracker",
        "knowledge",
        "devops",
        "product-metrics",
        "design",
#
        "security",
        "research",
        "publisher",
      ],
    },

    sessions: {
      visibility: "agent",
    },
  },

  hooks: {
    enabled: true,
    token: "REPLACE_WITH_DEDICATED_HOOK_TOKEN",
    path: "/hooks",
#
    maxBodyBytes: 262144,
    defaultSessionKey: "hook:dispatcher",
    allowRequestSessionKey: false,
    allowedSessionKeyPrefixes: ["hook:"],
    allowedAgentIds: [
      "coordinator",
      "auditor",
      "project-tracker",
      "knowledge",
      "devops",
      "product-metrics",
      "design",
      "security",
      "research",
      "publisher",
#
    ],
  },

  cron: {
    enabled: true,
    maxConcurrentRuns: 4,
    retry: {
      maxAttempts: 3,
      backoffMs: [60000, 120000, 300000],
      retryOn: ["rate_limit", "overloaded", "network", "server_error"],
    },
    sessionRetention: "24h",
    runLog: {
      maxBytes: "2mb",
      keepLines: 2000,
#
    },
  },
}

If you insist on staying at 9

Delete the publisher agent, and give project-tracker the message tool.


Sample cron jobs mapped to your roles

openclaw cron add \
  --name "Auditor lease reaper" \
#
  --cron "*/5 * * * *" \
  --session isolated \
  --agent auditor \
  --message "Run auditor standing order: reconcile cron health, detect standing-order gaps, and flag expired leases for requeue." \
  --no-deliver

openclaw cron add \
  --name "Coordinator backlog rebalance" \
  --cron "*/15 * * * *" \
  --session isolated \
  --agent coordinator \
  --message "Run coordinator standing order: review open jobs, blocked work, and rebalance priorities across specialists." \
  --no-deliver

openclaw cron add \
#
  --name "Project tracker debrief sweep" \
  --cron "0 * * * *" \
  --session isolated \
  --agent project-tracker \
  --message "Run project-tracker standing order: collect specialist debriefs, synthesize status, and update project-level summaries." \
  --no-deliver

openclaw cron add \
  --name "Knowledge consolidation" \
  --cron "0 */6 * * *" \
  --session isolated \
  --agent knowledge \
  --message "Run knowledge standing order: consolidate durable notes, dedupe memory, and summarize newly important facts." \
  --no-deliver
#
openclaw cron add \
  --name "Publisher daily digest" \
  --cron "0 18 * * *" \
  --session isolated \
  --agent publisher \
  --message "Run publisher standing order: compile the daily digest from completed work and produce the human-facing summary." \
  --no-deliver

Python dispatcher skeleton

This is intentionally small, but it has the right shape: Postgres truth, hook-based wakeups, leases, requeue sweep.

#
# dispatcher.py
from __future__ import annotations

import asyncio
import json
import os
import socket
from collections import defaultdict
from dataclasses import dataclass
from datetime import timedelta
from typing import Any

import httpx
import psycopg
from psycopg.rows import dict_row
#
from psycopg_pool import AsyncConnectionPool

DB_DSN = os.environ["DATABASE_URL"]
OPENCLAW_BASE_URL = os.environ.get("OPENCLAW_BASE_URL", "http://127.0.0.1:18789")
OPENCLAW_HOOK_TOKEN = os.environ["OPENCLAW_HOOK_TOKEN"]
LEASE_SECONDS = int(os.environ.get("LEASE_SECONDS", "600"))
POLL_SECONDS = float(os.environ.get("POLL_SECONDS", "2.0"))
HOSTNAME = socket.gethostname()

AGENT_LIMITS = {
    "coordinator": 1,
    "auditor": 1,
    "project-tracker": 1,
    "knowledge": 1,
    "devops": 1,
#
    "product-metrics": 1,
    "design": 1,
    "security": 1,
    "research": 1,
    "publisher": 1,
}

busy: dict[str, int] = defaultdict(int)


@dataclass
class Job:
    id: str
    target_agent: str
    subject: str
#
    payload: dict[str, Any]
    attempt_count: int


CLAIM_SQL = """
with candidate as (
  select id
  from jobs
  where status = 'queued'
    and run_after <= now()
  order by priority desc, created_at
  for update skip locked
  limit 25
)
select *
#
from jobs
where id in (select id from candidate)
order by priority desc, created_at;
"""

LEASE_SQL = """
update jobs
set status = 'leased',
    lease_owner = %(lease_owner)s,
    leased_until = now() + (%(lease_seconds)s || ' seconds')::interval,
    attempt_count = attempt_count + 1,
    updated_at = now()
where id = %(job_id)s
  and status = 'queued'
returning *;
#
"""

RUNNING_SQL = """
update jobs
set status = 'running',
    updated_at = now()
where id = %(job_id)s;
"""

COMPLETE_SQL = """
update jobs
set status = 'succeeded',
    completed_at = now(),
    updated_at = now()
where id = %(job_id)s;
#
"""

FAIL_SQL = """
update jobs
set status = case
      when attempt_count >= max_attempts then 'dead'
      else 'queued'
    end,
    leased_until = null,
    lease_owner = null,
    last_error = %(error)s,
    run_after = case
      when attempt_count >= max_attempts then run_after
      else now() + interval '60 seconds'
    end,
#
    updated_at = now()
where id = %(job_id)s;
"""

REAP_SQL = """
update jobs
set status = 'queued',
    leased_until = null,
    lease_owner = null,
    updated_at = now(),
    run_after = now()
where status in ('leased', 'running')
  and leased_until is not null
  and leased_until < now();
"""
#


def build_prompt(job: Job) -> str:
    payload = json.dumps(job.payload, indent=2, sort_keys=True)
    return f"""
You are processing a dispatcher-owned job.

Job ID: {job.id}
Subject: {job.subject}
Attempt: {job.attempt_count}

Payload:
{payload}

Execution contract:
#
1. Do the work for this job only.
2. If follow-up work is needed, emit it through the local helper CLI/tooling, not by informal chat.
3. Return a concise completion summary.
4. If blocked, say exactly what is blocking it.
""".strip()


async def fetch_candidates(pool: AsyncConnectionPool) -> list[dict[str, Any]]:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(CLAIM_SQL)
            return await cur.fetchall()


async def try_lease_job(pool: AsyncConnectionPool, job_id: str) -> dict[str, Any] | None:
#
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(
                LEASE_SQL,
                {
                    "job_id": job_id,
                    "lease_owner": HOSTNAME,
                    "lease_seconds": LEASE_SECONDS,
                },
            )
            row = await cur.fetchone()
            await conn.commit()
            return row

#
async def mark_running(pool: AsyncConnectionPool, job_id: str) -> None:
    async with pool.connection() as conn:
        await conn.execute(RUNNING_SQL, {"job_id": job_id})
        await conn.commit()


async def mark_complete(pool: AsyncConnectionPool, job_id: str) -> None:
    async with pool.connection() as conn:
        await conn.execute(COMPLETE_SQL, {"job_id": job_id})
        await conn.commit()


async def mark_failed(pool: AsyncConnectionPool, job_id: str, error: str) -> None:
    async with pool.connection() as conn:
        await conn.execute(FAIL_SQL, {"job_id": job_id, "error": error[:4000]})
#
        await conn.commit()


async def reap_expired_leases(pool: AsyncConnectionPool) -> None:
    async with pool.connection() as conn:
        await conn.execute(REAP_SQL)
        await conn.commit()


async def invoke_agent(job: Job) -> dict[str, Any]:
    payload = {
        "agentId": job.target_agent,
        "name": f"dispatcher:{job.subject}",
        "message": build_prompt(job),
        "wakeMode": "now",
#
        "deliver": False,
        "timeoutSeconds": 300,
    }

    headers = {
        "Authorization": f"Bearer {OPENCLAW_HOOK_TOKEN}",
        "Content-Type": "application/json",
    }

    async with httpx.AsyncClient(timeout=360.0) as client:
        resp = await client.post(
            f"{OPENCLAW_BASE_URL}/hooks/agent",
            headers=headers,
            json=payload,
        )
#
        resp.raise_for_status()
        if resp.content:
            return resp.json()
        return {"ok": True}


async def dispatch_one(pool: AsyncConnectionPool, raw_job: dict[str, Any]) -> None:
    agent = raw_job["target_agent"]
    if busy[agent] >= AGENT_LIMITS.get(agent, 1):
        return

    leased = await try_lease_job(pool, raw_job["id"])
    if not leased:
        return
#
    busy[agent] += 1
    job = Job(
        id=leased["id"],
        target_agent=leased["target_agent"],
        subject=leased["subject"],
        payload=leased.get("payload") or {},
        attempt_count=leased["attempt_count"],
    )

    try:
        await mark_running(pool, job.id)
        result = await invoke_agent(job)
        # Optional: parse structured result here and write follow-up jobs/events.
        await mark_complete(pool, job.id)
        print(f"[ok] {job.id} -> {job.target_agent} :: {result}")
#
    except Exception as exc:
        await mark_failed(pool, job.id, str(exc))
        print(f"[fail] {job.id} -> {job.target_agent} :: {exc}")
    finally:
        busy[agent] -= 1


async def main() -> None:
    pool = AsyncConnectionPool(conninfo=DB_DSN, min_size=1, max_size=10)
    await pool.open()

    try:
        while True:
            await reap_expired_leases(pool)
            candidates = await fetch_candidates(pool)
#

            tasks = []
            for raw_job in candidates:
                tasks.append(asyncio.create_task(dispatch_one(pool, raw_job)))

            if tasks:
                await asyncio.gather(*tasks, return_exceptions=True)

            await asyncio.sleep(POLL_SECONDS)
    finally:
        await pool.close()


if __name__ == "__main__":
    asyncio.run(main())
#

Two implementation notes that matter

1) Add a tiny local helper CLI

I’d absolutely add a tiny Python CLI like:

  • agentbus emit-event
  • agentbus enqueue-job
  • agentbus complete-job
  • agentbus block-job
  • agentbus record-standing-order-run

Then let specialists call that through exec.

That’s cleaner than trying to make the dispatcher infer all follow-up work from freeform text.

#

2) Keep sessions_send out of the critical path

Use it for:

  • steering
  • debugging
  • auditor/coordinator introspection

Do not use it as the only durable handoff path between specialists.


If you want, next I can generate the matching:

  1. Postgres DDL for jobs, events, job_dependencies, standing_order_runs, and
  2. a tiny agentbus Python CLI so the agents can safely write events/jobs back into Postgres.
nimble phoenix
#

Yes — generate the full Postgres DDL and the agentbus CLI. Agent count doesn't matter, design it to scale."

warm elmBOT
#

Nice. Here’s the DB-first contract I’d use.

1) Full Postgres DDL

begin;

create extension if not exists pgcrypto;

create type job_status as enum (
  'queued',
  'leased',
  'running',
  'blocked',
  'succeeded',
  'failed',
#
  'dead'
);

create type attempt_status as enum (
  'leased',
  'running',
  'succeeded',
  'blocked',
  'failed',
  'timed_out',
  'cancelled'
);

create type standing_order_status as enum (
  'started',
#
  'no_work',
  'completed',
  'blocked',
  'failed',
  'skipped',
  'missed'
);

create or replace function set_updated_at()
returns trigger
language plpgsql
as $$
begin
  new.updated_at := now();
  return new;
#
end;
$$;

create table jobs (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',

  parent_job_id uuid references jobs(id) on delete set null,
  flow_key text,
  correlation_key text,

  source_agent text,
  target_agent text not null,

  job_type text not null,
#
  subject text not null,

  priority int not null default 100 check (priority between 1 and 1000),
  dedupe_key text,

  payload jsonb not null default '{}'::jsonb,
  state jsonb not null default '{}'::jsonb,
  result jsonb not null default '{}'::jsonb,
  summary text,

  status job_status not null default 'queued',
  run_after timestamptz not null default now(),

  lease_owner text,
  leased_until timestamptz,
#

  attempt_count int not null default 0 check (attempt_count >= 0),
  max_attempts int not null default 5 check (max_attempts > 0),

  blocked_reason text,
  last_error text,

  openclaw_task_ref text,
  openclaw_session_key text,
  openclaw_meta jsonb not null default '{}'::jsonb,

  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),
  completed_at timestamptz,
#
  check ((status <> 'blocked') or blocked_reason is not null)
);

create trigger jobs_set_updated_at
before update on jobs
for each row
execute function set_updated_at();

create table job_dependencies (
  job_id uuid not null references jobs(id) on delete cascade,
  depends_on_job_id uuid not null references jobs(id) on delete cascade,
  primary key (job_id, depends_on_job_id),
  check (job_id <> depends_on_job_id)
);
#
create table job_attempts (
  id bigserial primary key,
  job_id uuid not null references jobs(id) on delete cascade,
  attempt_no int not null check (attempt_no > 0),
  dispatcher_id text,
  status attempt_status not null default 'leased',
  summary text,
  error text,
  openclaw_task_ref text,
  openclaw_session_key text,
  response jsonb not null default '{}'::jsonb,
  started_at timestamptz not null default now(),
  finished_at timestamptz,
  unique (job_id, attempt_no)
);
#

create table events (
  id bigserial primary key,
  namespace text not null default 'default',
  topic text not null,
  source_agent text,
  source_job_id uuid references jobs(id) on delete set null,
  dedupe_key text,
  payload jsonb not null default '{}'::jsonb,
  created_at timestamptz not null default now()
);

create table standing_order_runs (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',
#
  agent_id text not null,
  order_name text not null,
  trigger_source text not null default 'cron',
  cron_job_id text,
  scheduled_for timestamptz not null,
  status standing_order_status not null default 'started',
  summary text,
  evidence jsonb not null default '{}'::jsonb,
  openclaw_task_ref text,
  openclaw_session_key text,
  created_at timestamptz not null default now(),
  started_at timestamptz not null default now(),
  finished_at timestamptz,
  updated_at timestamptz not null default now(),
  unique (namespace, agent_id, order_name, scheduled_for)
#
);

create trigger standing_order_runs_set_updated_at
before update on standing_order_runs
for each row
execute function set_updated_at();

create index jobs_ready_idx
  on jobs (namespace, target_agent, priority desc, run_after, created_at)
  where status = 'queued';

create index jobs_lease_idx
  on jobs (namespace, status, leased_until)
  where status in ('leased', 'running');
#
create index jobs_target_status_idx
  on jobs (namespace, target_agent, status, run_after);

create index jobs_flow_idx
  on jobs (namespace, flow_key, created_at desc);

create index jobs_parent_idx
  on jobs (parent_job_id);

create unique index jobs_active_dedupe_idx
  on jobs (namespace, target_agent, dedupe_key)
  where dedupe_key is not null
    and status in ('queued', 'leased', 'running', 'blocked');

create index job_attempts_job_started_idx
#
  on job_attempts (job_id, started_at desc);

create index events_topic_created_idx
  on events (namespace, topic, created_at desc);

create index events_source_job_idx
  on events (source_job_id, created_at desc);

create unique index events_dedupe_idx
  on events (namespace, topic, dedupe_key)
  where dedupe_key is not null;

create index standing_order_runs_lookup_idx
  on standing_order_runs (namespace, agent_id, order_name, scheduled_for desc);
#
create index standing_order_runs_open_idx
  on standing_order_runs (namespace, status, scheduled_for)
  where status = 'started';

create or replace view ready_jobs_v as
select *
from jobs
where status = 'queued'
  and run_after <= now();

create or replace view stuck_jobs_v as
select *
from jobs
where status in ('leased', 'running')
  and leased_until is not null
#
  and leased_until < now();

create or replace function notify_jobs_ready()
returns trigger
language plpgsql
as $$
begin
  if new.status = 'queued' then
    perform pg_notify(
      'jobs_ready',
      json_build_object(
        'namespace', new.namespace,
        'target_agent', new.target_agent,
        'job_id', new.id
      )::text
#
    );
  end if;
  return new;
end;
$$;

create trigger jobs_notify_insert
after insert on jobs
for each row
execute function notify_jobs_ready();

create trigger jobs_notify_update
after update of status, run_after on jobs
for each row
when (
#
  new.status = 'queued'
  and (
    old.status is distinct from new.status
    or old.run_after is distinct from new.run_after
  )
)
execute function notify_jobs_ready();

create or replace function notify_events_inserted()
returns trigger
language plpgsql
as $$
begin
  perform pg_notify(
    'events_inserted',
#
    json_build_object(
      'namespace', new.namespace,
      'topic', new.topic,
      'event_id', new.id
    )::text
  );
  return new;
end;
$$;

create trigger events_notify_insert
after insert on events
for each row
execute function notify_events_inserted();
#
create or replace function claim_jobs(
  p_namespace text,
  p_lease_owner text,
  p_lease_seconds integer,
  p_limit integer default 1,
  p_target_agent text default null
)
returns setof jobs
language sql
as $$
  with candidate as (
    select j.id
    from jobs j
    where j.namespace = p_namespace
      and j.status = 'queued'
#
      and j.run_after <= now()
      and (p_target_agent is null or j.target_agent = p_target_agent)
    order by j.priority desc, j.run_after asc, j.created_at asc
    for update skip locked
    limit p_limit
  ),
  updated as (
    update jobs j
    set status = 'leased',
        lease_owner = p_lease_owner,
        leased_until = now() + make_interval(secs => p_lease_seconds),
        attempt_count = j.attempt_count + 1,
        updated_at = now()
    from candidate c
    where j.id = c.id
#
    returning j.*
  )
  select * from updated;
$$;

create or replace function requeue_expired_jobs(
  p_namespace text,
  p_limit integer default 1000
)
returns setof jobs
language sql
as $$
  with candidate as (
    select j.id
    from jobs j
#
    where j.namespace = p_namespace
      and j.status in ('leased', 'running')
      and j.leased_until is not null
      and j.leased_until < now()
    order by j.leased_until asc
    limit p_limit
    for update skip locked
  ),
  updated as (
    update jobs j
    set status = 'queued',
        lease_owner = null,
        leased_until = null,
        run_after = now(),
        updated_at = now()
#
    from candidate c
    where j.id = c.id
    returning j.*
  )
  select * from updated;
$$;

commit;

2) agentbus CLI

This is a single-file CLI, standard library plus psycopg.

#

Install:

pip install "psycopg[binary]>=3.2"

Save as agentbus.py:

#!/usr/bin/env python3
from __future__ import annotations

import argparse
import json
import os
import sys
#
from datetime import datetime
from typing import Any

import psycopg
from psycopg.rows import dict_row
from psycopg.types.json import Jsonb

DB_DSN = os.environ.get("DATABASE_URL")


def die(msg: str, code: int = 1) -> None:
    print(msg, file=sys.stderr)
    raise SystemExit(code)

#
def require_dsn() -> str:
    if not DB_DSN:
        die("DATABASE_URL is required")
    return DB_DSN


def parse_dt(value: str | None) -> datetime | None:
    if value is None:
        return None
    value = value.strip().replace("Z", "+00:00")
    return datetime.fromisoformat(value)


def parse_json_input(value: str | None) -> Any:
    if value is None:
#
        return {}
    if value == "-":
        return json.load(sys.stdin)
    if value.startswith("@"):
        with open(value[1:], "r", encoding="utf-8") as f:
            return json.load(f)
    return json.loads(value)


def connect():
    return psycopg.connect(require_dsn(), row_factory=dict_row)


def print_json(obj: Any) -> None:
    print(json.dumps(obj, indent=2, default=str, sort_keys=True))
#


def cmd_enqueue_job(args: argparse.Namespace) -> None:
    payload = parse_json_input(args.payload)
    state = parse_json_input(args.state)

    sql = """
    insert into jobs (
      namespace,
      parent_job_id,
      flow_key,
      correlation_key,
      source_agent,
      target_agent,
      job_type,
#
      subject,
      priority,
      dedupe_key,
      payload,
      state,
      run_after,
      max_attempts
    )
    values (
      %(namespace)s,
      %(parent_job_id)s,
      %(flow_key)s,
      %(correlation_key)s,
      %(source_agent)s,
      %(target_agent)s,
#
      %(job_type)s,
      %(subject)s,
      %(priority)s,
      %(dedupe_key)s,
      %(payload)s,
      %(state)s,
      coalesce(%(run_after)s, now()),
      %(max_attempts)s
    )
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
#
            {
                "namespace": args.namespace,
                "parent_job_id": args.parent_job_id,
                "flow_key": args.flow_key,
                "correlation_key": args.correlation_key,
                "source_agent": args.source_agent,
                "target_agent": args.target_agent,
                "job_type": args.job_type,
                "subject": args.subject,
                "priority": args.priority,
                "dedupe_key": args.dedupe_key,
                "payload": Jsonb(payload),
                "state": Jsonb(state),
                "run_after": parse_dt(args.run_after),
                "max_attempts": args.max_attempts,
#
            },
        )
        job = cur.fetchone()

        for dep in args.depends_on:
            cur.execute(
                """
                insert into job_dependencies (job_id, depends_on_job_id)
                values (%s, %s)
                on conflict do nothing;
                """,
                (job["id"], dep),
            )

        conn.commit()
#
        print_json(job)


def cmd_emit_event(args: argparse.Namespace) -> None:
    payload = parse_json_input(args.payload)

    sql = """
    insert into events (
      namespace,
      topic,
      source_agent,
      source_job_id,
      dedupe_key,
      payload
    )
#
    values (
      %(namespace)s,
      %(topic)s,
      %(source_agent)s,
      %(source_job_id)s,
      %(dedupe_key)s,
      %(payload)s
    )
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
#
                "namespace": args.namespace,
                "topic": args.topic,
                "source_agent": args.source_agent,
                "source_job_id": args.source_job_id,
                "dedupe_key": args.dedupe_key,
                "payload": Jsonb(payload),
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_mark_running_job(args: argparse.Namespace) -> None:
    sql = """
#
    update jobs
    set status = 'running',
        openclaw_task_ref = coalesce(%(openclaw_task_ref)s, openclaw_task_ref),
        openclaw_session_key = coalesce(%(openclaw_session_key)s, openclaw_session_key),
        openclaw_meta = case
          when %(openclaw_meta)s is null then openclaw_meta
          else %(openclaw_meta)s
        end,
        summary = coalesce(%(summary)s, summary),
        updated_at = now()
    where id = %(job_id)s
    returning *;
    """

    meta = parse_json_input(args.openclaw_meta) if args.openclaw_meta else None
#

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "job_id": args.job_id,
                "openclaw_task_ref": args.openclaw_task_ref,
                "openclaw_session_key": args.openclaw_session_key,
                "openclaw_meta": Jsonb(meta) if meta is not None else None,
                "summary": args.summary,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)
#


def cmd_complete_job(args: argparse.Namespace) -> None:
    result = parse_json_input(args.result)

    sql = """
    update jobs
    set status = 'succeeded',
        result = %(result)s,
        summary = %(summary)s,
        blocked_reason = null,
        last_error = null,
        lease_owner = null,
        leased_until = null,
        completed_at = now(),
#
        openclaw_task_ref = coalesce(%(openclaw_task_ref)s, openclaw_task_ref),
        openclaw_session_key = coalesce(%(openclaw_session_key)s, openclaw_session_key),
        updated_at = now()
    where id = %(job_id)s
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "job_id": args.job_id,
                "result": Jsonb(result),
                "summary": args.summary,
                "openclaw_task_ref": args.openclaw_task_ref,
#
                "openclaw_session_key": args.openclaw_session_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_block_job(args: argparse.Namespace) -> None:
    result = parse_json_input(args.result)

    sql = """
    update jobs
    set status = 'blocked',
        blocked_reason = %(blocked_reason)s,
#
        result = %(result)s,
        summary = %(summary)s,
        lease_owner = null,
        leased_until = null,
        openclaw_task_ref = coalesce(%(openclaw_task_ref)s, openclaw_task_ref),
        openclaw_session_key = coalesce(%(openclaw_session_key)s, openclaw_session_key),
        updated_at = now()
    where id = %(job_id)s
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
#
                "job_id": args.job_id,
                "blocked_reason": args.blocked_reason,
                "result": Jsonb(result),
                "summary": args.summary,
                "openclaw_task_ref": args.openclaw_task_ref,
                "openclaw_session_key": args.openclaw_session_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_fail_job(args: argparse.Namespace) -> None:
    result = parse_json_input(args.result)
#

    if args.retry_seconds is not None:
        sql = """
        update jobs
        set status = 'queued',
            last_error = %(last_error)s,
            result = %(result)s,
            summary = %(summary)s,
            lease_owner = null,
            leased_until = null,
            run_after = now() + make_interval(secs => %(retry_seconds)s),
            updated_at = now()
        where id = %(job_id)s
        returning *;
        """
#
    else:
        sql = """
        update jobs
        set status = 'dead',
            last_error = %(last_error)s,
            result = %(result)s,
            summary = %(summary)s,
            lease_owner = null,
            leased_until = null,
            completed_at = now(),
            updated_at = now()
        where id = %(job_id)s
        returning *;
        """
#
    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "job_id": args.job_id,
                "last_error": args.last_error,
                "result": Jsonb(result),
                "summary": args.summary,
                "retry_seconds": args.retry_seconds,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)
#

def cmd_start_standing_order(args: argparse.Namespace) -> None:
    sql = """
    insert into standing_order_runs (
      namespace,
      agent_id,
      order_name,
      trigger_source,
      cron_job_id,
      scheduled_for,
      status,
      openclaw_task_ref,
      openclaw_session_key
    )
    values (
#
      %(namespace)s,
      %(agent_id)s,
      %(order_name)s,
      %(trigger_source)s,
      %(cron_job_id)s,
      %(scheduled_for)s,
      'started',
      %(openclaw_task_ref)s,
      %(openclaw_session_key)s
    )
    on conflict (namespace, agent_id, order_name, scheduled_for)
    do update set
      updated_at = now()
    returning *;
    """
#

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "namespace": args.namespace,
                "agent_id": args.agent_id,
                "order_name": args.order_name,
                "trigger_source": args.trigger_source,
                "cron_job_id": args.cron_job_id,
                "scheduled_for": parse_dt(args.scheduled_for),
                "openclaw_task_ref": args.openclaw_task_ref,
                "openclaw_session_key": args.openclaw_session_key,
            },
        )
#
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_finish_standing_order(args: argparse.Namespace) -> None:
    evidence = parse_json_input(args.evidence)

    sql = """
    update standing_order_runs
    set status = %(status)s,
        summary = %(summary)s,
        evidence = %(evidence)s,
        finished_at = now(),
        openclaw_task_ref = coalesce(%(openclaw_task_ref)s, openclaw_task_ref),
#
        openclaw_session_key = coalesce(%(openclaw_session_key)s, openclaw_session_key),
        updated_at = now()
    where id = %(run_id)s
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "run_id": args.run_id,
                "status": args.status,
                "summary": args.summary,
                "evidence": Jsonb(evidence),
                "openclaw_task_ref": args.openclaw_task_ref,
#
                "openclaw_session_key": args.openclaw_session_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_claim_jobs(args: argparse.Namespace) -> None:
    sql = "select * from claim_jobs(%s, %s, %s, %s, %s);"

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            (
#
                args.namespace,
                args.lease_owner,
                args.lease_seconds,
                args.limit,
                args.target_agent,
            ),
        )
        rows = cur.fetchall()
        conn.commit()
        print_json(rows)


def cmd_requeue_expired(args: argparse.Namespace) -> None:
    sql = "select * from requeue_expired_jobs(%s, %s);"
#
    with connect() as conn, conn.cursor() as cur:
        cur.execute(sql, (args.namespace, args.limit))
        rows = cur.fetchall()
        conn.commit()
        print_json(rows)


def build_parser() -> argparse.ArgumentParser:
    p = argparse.ArgumentParser(prog="agentbus")
    sub = p.add_subparsers(dest="cmd", required=True)

    enqueue = sub.add_parser("enqueue-job")
    enqueue.add_argument("--namespace", default="default")
    enqueue.add_argument("--parent-job-id")
    enqueue.add_argument("--flow-key")
#
    enqueue.add_argument("--correlation-key")
    enqueue.add_argument("--source-agent")
    enqueue.add_argument("--target-agent", required=True)
    enqueue.add_argument("--job-type", default="reactive")
    enqueue.add_argument("--subject", required=True)
    enqueue.add_argument("--priority", type=int, default=100)
    enqueue.add_argument("--dedupe-key")
    enqueue.add_argument("--payload", default="{}")
    enqueue.add_argument("--state", default="{}")
    enqueue.add_argument("--run-after")
    enqueue.add_argument("--max-attempts", type=int, default=5)
    enqueue.add_argument("--depends-on", action="append", default=[])
    enqueue.set_defaults(func=cmd_enqueue_job)

    emit = sub.add_parser("emit-event")
#
    emit.add_argument("--namespace", default="default")
    emit.add_argument("--topic", required=True)
    emit.add_argument("--source-agent")
    emit.add_argument("--source-job-id")
    emit.add_argument("--dedupe-key")
    emit.add_argument("--payload", default="{}")
    emit.set_defaults(func=cmd_emit_event)

    mark_running = sub.add_parser("mark-running-job")
    mark_running.add_argument("job_id")
    mark_running.add_argument("--openclaw-task-ref")
    mark_running.add_argument("--openclaw-session-key")
    mark_running.add_argument("--openclaw-meta")
    mark_running.add_argument("--summary")
    mark_running.set_defaults(func=cmd_mark_running_job)
#

    complete = sub.add_parser("complete-job")
    complete.add_argument("job_id")
    complete.add_argument("--summary")
    complete.add_argument("--result", default="{}")
    complete.add_argument("--openclaw-task-ref")
    complete.add_argument("--openclaw-session-key")
    complete.set_defaults(func=cmd_complete_job)

    block = sub.add_parser("block-job")
    block.add_argument("job_id")
    block.add_argument("--blocked-reason", required=True)
    block.add_argument("--summary")
    block.add_argument("--result", default="{}")
    block.add_argument("--openclaw-task-ref")
#
    block.add_argument("--openclaw-session-key")
    block.set_defaults(func=cmd_block_job)

    fail = sub.add_parser("fail-job")
    fail.add_argument("job_id")
    fail.add_argument("--last-error", required=True)
    fail.add_argument("--summary")
    fail.add_argument("--result", default="{}")
    fail.add_argument("--retry-seconds", type=int)
    fail.set_defaults(func=cmd_fail_job)

    so_start = sub.add_parser("start-standing-order")
    so_start.add_argument("--namespace", default="default")
    so_start.add_argument("--agent-id", required=True)
    so_start.add_argument("--order-name", required=True)
#
    so_start.add_argument("--trigger-source", default="cron")
    so_start.add_argument("--cron-job-id")
    so_start.add_argument("--scheduled-for", required=True)
    so_start.add_argument("--openclaw-task-ref")
    so_start.add_argument("--openclaw-session-key")
    so_start.set_defaults(func=cmd_start_standing_order)

    so_finish = sub.add_parser("finish-standing-order")
    so_finish.add_argument("run_id")
    so_finish.add_argument(
        "--status",
        required=True,
        choices=["no_work", "completed", "blocked", "failed", "skipped", "missed"],
    )
    so_finish.add_argument("--summary")
#
    so_finish.add_argument("--evidence", default="{}")
    so_finish.add_argument("--openclaw-task-ref")
    so_finish.add_argument("--openclaw-session-key")
    so_finish.set_defaults(func=cmd_finish_standing_order)

    claim = sub.add_parser("claim-jobs")
    claim.add_argument("--namespace", default="default")
    claim.add_argument("--lease-owner", required=True)
    claim.add_argument("--lease-seconds", type=int, default=600)
    claim.add_argument("--limit", type=int, default=1)
    claim.add_argument("--target-agent")
    claim.set_defaults(func=cmd_claim_jobs)

    requeue = sub.add_parser("requeue-expired")
    requeue.add_argument("--namespace", default="default")
#
    requeue.add_argument("--limit", type=int, default=1000)
    requeue.set_defaults(func=cmd_requeue_expired)

    return p


def main() -> None:
    parser = build_parser()
    args = parser.parse_args()
    args.func(args)


if __name__ == "__main__":
    main()
#

Make it executable:

chmod +x agentbus.py

3) Minimal usage examples

Enqueue work for the security agent

./agentbus.py enqueue-job \
  --target-agent security \
  --job-type reactive \
  --subject "review_new_dependency" \
#
  --payload '{"package":"fastapi","version":"0.116.0"}' \
  --dedupe-key "dep:fastapi:0.116.0"

Emit an event after completion

./agentbus.py emit-event \
  --topic "security.review.completed" \
  --source-agent security \
  --source-job-id 11111111-1111-1111-1111-111111111111 \
  --payload '{"risk":"low","summary":"No blocking issues found."}'

Record a standing order run

#
RUN_ID=$(
  ./agentbus.py start-standing-order \
    --agent-id auditor \
    --order-name lease_reaper \
    --scheduled-for "2026-04-15T18:00:00Z" \
  | jq -r .id
)

./agentbus.py finish-standing-order "$RUN_ID" \
  --status completed \
  --summary "Requeued 2 expired jobs, found 0 cron gaps." \
  --evidence '{"requeued":2,"gaps":0}'

Dispatcher claim

#
./agentbus.py claim-jobs \
  --lease-owner dispatcher-1 \
  --lease-seconds 600 \
  --limit 10

4) One design opinion that matters

For scale, I’d keep the contract as:

  • Postgres = source of truth
  • OpenClaw = worker runtime
  • agentbus = the only write path agents use for jobs/events/audit
  • dispatcher = lease owner + wake-up service
#

That keeps the model out of the “invent your own queue semantics” business.

If you want, I can next generate:

  1. a systemd unit for the dispatcher, and
  2. a dispatcher v2 that uses LISTEN/NOTIFY plus job_attempts writes.
nimble phoenix
#

Yes — systemd unit and dispatcher v2 with LISTEN/NOTIFY and job_attempts writes.

warm elmBOT
#

Perfect. This version assumes your agents use agentbus to close the loop on jobs, and the dispatcher validates that they actually did.

agentbus-dispatcher.service

[Unit]
Description=AgentBus Dispatcher v2
After=network-online.target
Wants=network-online.target

[Service]
Type=simple

# Run this as the same user that owns your OpenClaw + dispatcher files.
User=openclaw
Group=openclaw
#

WorkingDirectory=/opt/agentbus
EnvironmentFile=/etc/agentbus/dispatcher.env

ExecStart=/opt/agentbus/.venv/bin/python /opt/agentbus/dispatcher_v2.py

Restart=always
RestartSec=5
TimeoutStopSec=30

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=read-only
ReadWritePaths=/opt/agentbus
#

# Uncomment if you use local Postgres unix sockets:
# ReadWritePaths=/opt/agentbus /var/run/postgresql

[Install]
WantedBy=multi-user.target

Example env file, /etc/agentbus/dispatcher.env

DATABASE_URL=postgresql://agentbus:change-me@127.0.0.1:5432/agentbus
OPENCLAW_BASE_URL=http://127.0.0.1:18789
OPENCLAW_HOOK_TOKEN=replace-with-your-hook-token
#
DISPATCHER_NAMESPACE=default
DISPATCHER_ID=dispatcher-1

LEASE_SECONDS=600
POLL_SECONDS=10
CLAIM_BATCH=10
RETRY_SECONDS=60

# Per-agent concurrency. Omit agents to use DEFAULT_AGENT_CONCURRENCY.
DEFAULT_AGENT_CONCURRENCY=1
AGENT_LIMITS_JSON={"coordinator":1,"auditor":1,"project-tracker":1,"knowledge":1,"devops":1,"product-metrics":1,"design":1,"security":1,"research":1,"publisher":1}

Install/start:

#
python3.11 -m venv /opt/agentbus/.venv
/opt/agentbus/.venv/bin/pip install "psycopg[binary]>=3.2" psycopg-pool httpx

sudo mkdir -p /etc/agentbus
sudo cp dispatcher.env /etc/agentbus/dispatcher.env
sudo cp agentbus-dispatcher.service /etc/systemd/system/

sudo systemctl daemon-reload
sudo systemctl enable --now agentbus-dispatcher.service
sudo systemctl status agentbus-dispatcher.service

dispatcher_v2.py

#
#!/usr/bin/env python3
from __future__ import annotations

import asyncio
import json
import os
import socket
from collections import defaultdict
from typing import Any

import httpx
import psycopg
from psycopg.rows import dict_row
from psycopg.types.json import Jsonb
from psycopg_pool import AsyncConnectionPool
#

DATABASE_URL = os.environ["DATABASE_URL"]
OPENCLAW_BASE_URL = os.environ.get("OPENCLAW_BASE_URL", "http://127.0.0.1:18789").rstrip("/")
OPENCLAW_HOOK_TOKEN = os.environ["OPENCLAW_HOOK_TOKEN"]

DISPATCHER_NAMESPACE = os.environ.get("DISPATCHER_NAMESPACE", "default")
DISPATCHER_ID = os.environ.get("DISPATCHER_ID", socket.gethostname())

LEASE_SECONDS = int(os.environ.get("LEASE_SECONDS", "600"))
POLL_SECONDS = float(os.environ.get("POLL_SECONDS", "10"))
CLAIM_BATCH = int(os.environ.get("CLAIM_BATCH", "10"))
RETRY_SECONDS = int(os.environ.get("RETRY_SECONDS", "60"))
DEFAULT_AGENT_CONCURRENCY = int(os.environ.get("DEFAULT_AGENT_CONCURRENCY", "1"))
AGENT_LIMITS = json.loads(os.environ.get("AGENT_LIMITS_JSON", "{}"))
#
busy: dict[str, int] = defaultdict(int)
wake_event = asyncio.Event()


def agent_limit(agent_id: str) -> int:
    return int(AGENT_LIMITS.get(agent_id, DEFAULT_AGENT_CONCURRENCY))


def build_prompt(job: dict[str, Any]) -> str:
    payload = json.dumps(job.get("payload") or {}, indent=2, sort_keys=True)
    return f"""
You are processing one dispatcher-owned job.

Namespace: {job["namespace"]}
Job ID: {job["id"]}
#
Target agent: {job["target_agent"]}
Subject: {job["subject"]}
Type: {job["job_type"]}
Attempt: {job["attempt_count"]}

Payload:
{payload}

Rules:
1. Handle this job only.
2. If you need follow-up work, create it via the local agentbus CLI, not via informal chat.
3. You must close the loop before finishing:
   - success: agentbus complete-job
   - blocked: agentbus block-job
   - retry/dead: agentbus fail-job
#
4. If this is a scheduled standing order, also update standing_order_runs via agentbus.
5. Return a short summary after state is durably written.
""".strip()


async def fetch_job(pool: AsyncConnectionPool, job_id: str) -> dict[str, Any] | None:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute("select * from jobs where id = %s", (job_id,))
            return await cur.fetchone()


async def claim_jobs(
    pool: AsyncConnectionPool,
    namespace: str,
#
    lease_owner: str,
    lease_seconds: int,
    limit: int,
    target_agent: str | None = None,
) -> list[dict[str, Any]]:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(
                "select * from claim_jobs(%s, %s, %s, %s, %s)",
                (namespace, lease_owner, lease_seconds, limit, target_agent),
            )
            rows = await cur.fetchall()
            await conn.commit()
            return rows
#

async def requeue_expired_jobs(pool: AsyncConnectionPool, namespace: str, limit: int = 1000) -> list[dict[str, Any]]:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute("select * from requeue_expired_jobs(%s, %s)", (namespace, limit))
            rows = await cur.fetchall()
            await conn.commit()
            return rows


async def ensure_attempt_row(pool: AsyncConnectionPool, job: dict[str, Any]) -> None:
    async with pool.connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                """
#
                insert into job_attempts (
                  job_id, attempt_no, dispatcher_id, status
                )
                values (%s, %s, %s, 'leased')
                on conflict (job_id, attempt_no)
                do update set
                  dispatcher_id = excluded.dispatcher_id,
                  status = 'leased';
                """,
                (job["id"], job["attempt_count"], DISPATCHER_ID),
            )
            await conn.commit()


async def mark_attempt_running(pool: AsyncConnectionPool, job: dict[str, Any]) -> None:
#
    async with pool.connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                """
                update job_attempts
                set status = 'running',
                    started_at = coalesce(started_at, now())
                where job_id = %s
                  and attempt_no = %s;
                """,
                (job["id"], job["attempt_count"]),
            )
            await conn.commit()

#
async def finalize_attempt(
    pool: AsyncConnectionPool,
    job: dict[str, Any],
    status: str,
    summary: str | None,
    error: str | None,
    response: dict[str, Any] | None,
) -> None:
    task_ref = job.get("openclaw_task_ref")
    session_key = job.get("openclaw_session_key")
    resp = response or {}

    if not task_ref:
        task_ref = resp.get("taskId") or resp.get("task_id") or resp.get("runId") or resp.get("run_id")
    if not session_key:
#
        session_key = resp.get("sessionKey") or resp.get("session_key")

    async with pool.connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                """
                update job_attempts
                set status = %s,
                    summary = %s,
                    error = %s,
                    openclaw_task_ref = coalesce(%s, openclaw_task_ref),
                    openclaw_session_key = coalesce(%s, openclaw_session_key),
                    response = %s,
                    finished_at = now()
                where job_id = %s
#
                  and attempt_no = %s;
                """,
                (
                    status,
                    summary,
                    error,
                    task_ref,
                    session_key,
                    Jsonb(resp),
                    job["id"],
                    job["attempt_count"],
                ),
            )
            await conn.commit()
#

async def mark_job_running(pool: AsyncConnectionPool, job_id: str) -> None:
    async with pool.connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                """
                update jobs
                set status = 'running',
                    updated_at = now()
                where id = %s
                  and status = 'leased';
                """,
                (job_id,),
            )
            await conn.commit()
#


async def fail_job_for_missing_terminal_update(pool: AsyncConnectionPool, job_id: str, error: str) -> dict[str, Any] | None:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(
                """
                update jobs
                set status = 'queued',
                    last_error = %s,
                    lease_owner = null,
                    leased_until = null,
                    run_after = now() + make_interval(secs => %s),
                    updated_at = now()
                where id = %s
#
                  and status in ('leased', 'running')
                returning *;
                """,
                (error, RETRY_SECONDS, job_id),
            )
            row = await cur.fetchone()
            await conn.commit()
            return row


async def invoke_agent(job: dict[str, Any]) -> dict[str, Any]:
    payload = {
        "agentId": job["target_agent"],
        "name": f"dispatcher:{job['subject']}",
        "message": build_prompt(job),
#
        "wakeMode": "now",
        "deliver": False,
        "timeoutSeconds": 300,
    }

    headers = {
        "Authorization": f"Bearer {OPENCLAW_HOOK_TOKEN}",
        "Content-Type": "application/json",
    }

    async with httpx.AsyncClient(timeout=360.0) as client:
        resp = await client.post(f"{OPENCLAW_BASE_URL}/hooks/agent", headers=headers, json=payload)
        resp.raise_for_status()
        if resp.content:
            return resp.json()
#
        return {"ok": True}


def map_job_status_to_attempt(job_status: str) -> str:
    if job_status == "succeeded":
        return "succeeded"
    if job_status == "blocked":
        return "blocked"
    return "failed"


async def process_job(pool: AsyncConnectionPool, job: dict[str, Any]) -> None:
    agent_id = job["target_agent"]
    busy[agent_id] += 1
#
    try:
        await ensure_attempt_row(pool, job)
        await mark_job_running(pool, job["id"])
        await mark_attempt_running(pool, job)

        response: dict[str, Any] | None = None
        invoke_error: str | None = None

        try:
            response = await invoke_agent(job)
        except Exception as exc:
            invoke_error = str(exc)

        current = await fetch_job(pool, job["id"])
        if current is None:
#
            await finalize_attempt(pool, job, "failed", None, "job disappeared", response)
            return

        status = current["status"]

        if status in ("succeeded", "blocked", "failed", "dead"):
            await finalize_attempt(
                pool,
                current,
                map_job_status_to_attempt(status),
                current.get("summary"),
                current.get("last_error"),
                response,
            )
            return
#

        if status == "queued":
            await finalize_attempt(
                pool,
                current,
                "failed",
                current.get("summary"),
                current.get("last_error") or invoke_error or "job requeued",
                response,
            )
            return

        # If the agent returned but never wrote a terminal state, requeue.
        fallback_error = invoke_error or "agent returned without durable terminal job update"
        repaired = await fail_job_for_missing_terminal_update(pool, job["id"], fallback_error)
#
        repaired = repaired or await fetch_job(pool, job["id"])

        await finalize_attempt(
            pool,
            repaired or current,
            "failed",
            (repaired or current).get("summary"),
            fallback_error,
            response,
        )

    finally:
        busy[agent_id] -= 1

#
async def dispatch_once(pool: AsyncConnectionPool) -> bool:
    claimed_any = False
    tasks: list[asyncio.Task[Any]] = []

    for agent_id, limit in sorted(AGENT_LIMITS.items()):
        available = max(0, int(limit) - busy[agent_id])
        if available <= 0:
            continue

        rows = await claim_jobs(
            pool=pool,
            namespace=DISPATCHER_NAMESPACE,
            lease_owner=DISPATCHER_ID,
            lease_seconds=LEASE_SECONDS,
            limit=min(available, CLAIM_BATCH),
#
            target_agent=agent_id,
        )
        if not rows:
            continue

        claimed_any = True
        for row in rows:
            tasks.append(asyncio.create_task(process_job(pool, row)))

    # Also drain agents not explicitly listed, using the default concurrency.
    rows = await claim_jobs(
        pool=pool,
        namespace=DISPATCHER_NAMESPACE,
        lease_owner=DISPATCHER_ID,
        lease_seconds=LEASE_SECONDS,
#
        limit=CLAIM_BATCH,
        target_agent=None,
    )
    for row in rows:
        agent_id = row["target_agent"]
        if agent_id in AGENT_LIMITS:
            continue
        if busy[agent_id] >= DEFAULT_AGENT_CONCURRENCY:
            continue
        claimed_any = True
        tasks.append(asyncio.create_task(process_job(pool, row)))

    if tasks:
        await asyncio.gather(*tasks, return_exceptions=True)
#
    return claimed_any


async def listener_loop() -> None:
    while True:
        try:
            conn = await psycopg.AsyncConnection.connect(DATABASE_URL, autocommit=True)
            await conn.execute("listen jobs_ready")
            await conn.execute("listen events_inserted")
            async for notify in conn.notifies():
                if notify.channel in ("jobs_ready", "events_inserted"):
                    wake_event.set()
        except Exception as exc:
            print(f"[listener] error: {exc}", file=sys.stderr)
            await asyncio.sleep(2)
#


async def sweep_loop(pool: AsyncConnectionPool) -> None:
    while True:
        try:
            rows = await requeue_expired_jobs(pool, DISPATCHER_NAMESPACE, 1000)
            if rows:
                print(f"[sweep] requeued {len(rows)} expired jobs")
                wake_event.set()
        except Exception as exc:
            print(f"[sweep] error: {exc}", file=sys.stderr)

        await asyncio.sleep(POLL_SECONDS)

#
async def drain_loop(pool: AsyncConnectionPool) -> None:
    wake_event.set()  # initial drain on boot

    while True:
        try:
            try:
                await asyncio.wait_for(wake_event.wait(), timeout=POLL_SECONDS)
            except asyncio.TimeoutError:
                pass
            wake_event.clear()

            while True:
                claimed = await dispatch_once(pool)
                if not claimed:
                    break
#
        except Exception as exc:
            print(f"[drain] error: {exc}", file=sys.stderr)
            await asyncio.sleep(2)


async def main() -> None:
    pool = AsyncConnectionPool(conninfo=DATABASE_URL, min_size=1, max_size=10)
    await pool.open()

    listener = asyncio.create_task(listener_loop(), name="listener")
    sweeper = asyncio.create_task(sweep_loop(pool), name="sweeper")
    drainer = asyncio.create_task(drain_loop(pool), name="drainer")

    try:
        await asyncio.gather(listener, sweeper, drainer)
#
    finally:
        await pool.close()


if __name__ == "__main__":
    import sys
    asyncio.run(main())

A few practical notes

  • This dispatcher does not assume hook success = job success.
    It checks the job row after the agent returns.
  • If the agent forgot to call agentbus complete-job / block-job / fail-job, the dispatcher requeues it and writes a failed job_attempts record.
#
  • job_attempts now gives you a real audit trail for:
    • when a job was leased
    • when it ran
    • what the hook returned
    • whether the agent closed the loop correctly

If you want, next I can generate the matching dispatcher_v2.py prompt contract block for each of your 10 agents, so each role gets the right “what to do when invoked by dispatcher” instructions.

nimble phoenix
#

Yes — generate the prompt contract block for all 10 roles.

warm elmBOT
#

Yep. I’d structure it as one common dispatcher contract, then one role block per agent.

Put the Common Dispatcher Contract into all 10 agent workspaces, then append the matching role block.

Common Dispatcher Contract, add to every agent

## Dispatcher Contract

When you are invoked for a dispatcher-owned job, treat the dispatcher payload as the source of truth for this run.

### Your obligations

1. Process exactly one job.
2. Do not treat conversation text as durable state.
3. Do not use cross-agent chat as the primary handoff path.
#
4. All durable state changes must go through `agentbus`:
   - create follow-up work: `agentbus enqueue-job`
   - emit durable facts: `agentbus emit-event`
   - close the current job:
     - success: `agentbus complete-job`
     - blocked: `agentbus block-job`
     - retry or terminal failure: `agentbus fail-job`
5. If this job corresponds to a standing order, update `standing_order_runs`:
   - start: `agentbus start-standing-order`
   - finish: `agentbus finish-standing-order`
6. Return a short human-readable summary after durable writes are complete.

### Handoff rules

- Prefer `agentbus enqueue-job` over `sessions_send`.
#
- Use `sessions_send` only for narrow steering/debugging when explicitly authorized.
- If another specialist is needed, create a new job for that specialist unless your role block says otherwise.

### Blocking rules

If you cannot finish:
- use `agentbus block-job --blocked-reason "..."`
- state exactly what is missing
- name the next role or action needed

### Failure rules

If the work should retry later:
- use `agentbus fail-job --retry-seconds <n>`
#
If the work is terminally invalid:
- use `agentbus fail-job` without retry

### Summary contract

End every successful run with a summary that includes:
- what you did
- what durable state you wrote
- what follow-up jobs or events you created
- whether anything remains blocked

### Forbidden behavior

- Do not leave a dispatcher-owned job without a terminal durable update.
- Do not invent that another agent was notified unless you wrote a job/event for it.
#
- Do not claim completion based only on “I replied”.
- Do not route around the dispatcher for core coordination.

1) Coordinator

## Role Contract: Coordinator

### Authority
You own strategic routing, priority changes, replanning, and specialist assignment.

### You should accept jobs like
#
- `plan.new`
- `plan.rebalance`
- `blocked.replan`
- `priority.review`
- `cross_specialist.route`

### You may enqueue jobs to
- any specialist
- `project-tracker`
- `publisher`

### You should emit events like
- `plan.updated`
- `work.assigned`
- `priority.changed`
#
- `blocker.rerouted`

### You should not
- do deep specialist execution unless the payload explicitly says so
- become the project tracker
- publish human-facing output directly unless explicitly instructed

### Success means
- the right next jobs exist
- priorities are set
- the work graph is coherent

2) Auditor

#
## Role Contract: Auditor

### Authority
You own cron health, lease reaping, standing-order gap detection, stuck work detection, and operational anomalies.

### You should accept jobs like
- `audit.leases`
- `audit.cron`
- `audit.gaps`
- `audit.stuck_jobs`
- `audit.reconcile`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
#
- `publisher`
- recovery work for the same affected specialist when policy already allows it

### You should emit events like
- `ops.anomaly`
- `jobs.requeued`
- `standing_order.missed`
- `cron.health.changed`

### You should not
- rewrite strategic priorities unless explicitly instructed
- act as publisher
- silently repair systemic issues without recording them

### Success means
#
- gaps/anomalies are durably recorded
- expired/stuck work is requeued or escalated
- the audit trail is clearer after your run than before it

3) Project Tracker

## Role Contract: Project Tracker

### Authority
You own cross-specialist status synthesis, milestone tracking, debrief collection, and project-level rollups.
#
### You should accept jobs like
- `project.debrief.collect`
- `project.status.rollup`
- `project.milestone.check`
- `project.followup.request`

### You may enqueue jobs to
- `coordinator`
- `publisher`
- specialists only for debrief/status follow-ups, not new strategic work

### You should emit events like
- `project.status.updated`
- `project.debrief.missing`
- `project.deliverable.ready`
#
- `project.risk.changed`

### You should not
- take over routing authority from the coordinator
- invent status for specialists who did not produce durable updates
- directly assign strategic work across the system

### Success means
- project state is legible
- missing debriefs are explicit
- ready-to-publish material is handed to publisher

4) Knowledge / Memory Specialist

#
## Role Contract: Knowledge

### Authority
You own durable notes, memory consolidation, retrieval structure, and long-lived factual organization.

### You should accept jobs like
- `knowledge.consolidate`
- `knowledge.promote_fact`
- `knowledge.dedupe`
- `knowledge.index_refresh`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher` only when a clear deliverable is ready
#

### You should emit events like
- `knowledge.updated`
- `memory.promoted`
- `memory.conflict.detected`
- `knowledge.packet.ready`

### You should not
- perform strategic routing
- publish directly
- overwrite source facts without preserving provenance

### Success means
- durable memory is improved
- provenance is preserved
#
- future retrieval will be better

5) DevOps Specialist

## Role Contract: DevOps

### Authority
You own infrastructure checks, service status, deployment review, runbook execution, and environment diagnosis.

### You should accept jobs like
- `infra.check`
#
- `deploy.review`
- `ops.runbook`
- `service.diagnose`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher` only for operator-ready summaries

### You should emit events like
- `infra.checked`
- `service.state.changed`
- `deploy.review.completed`
- `incident.note.recorded`
#
### You should not
- silently perform destructive changes unless your standing orders explicitly allow it
- act as security unless the job explicitly includes security review
- publish human-facing output directly except via publisher

### Success means
- system state is measured or changed deliberately
- evidence is recorded
- operational next steps are explicit

6) Product / Metrics Specialist

#
## Role Contract: Product Metrics

### Authority
You own KPI analysis, experiment readouts, funnel/usage interpretation, and product health summaries.

### You should accept jobs like
- `metrics.analyze`
- `experiment.review`
- `funnel.check`
- `product.health.rollup`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher`
#

### You should emit events like
- `metrics.anomaly`
- `experiment.summary`
- `product.health.updated`
- `decision.support.ready`

### You should not
- invent certainty where data is weak
- do design or research work beyond what is needed to interpret metrics
- publish directly unless explicitly told

### Success means
- the data is interpreted clearly
- uncertainty is explicit
#
- the next product decision is better informed

7) Design Specialist

## Role Contract: Design

### Authority
You own UX/UI/system design options, critiques, flows, and usability-focused recommendations.

### You should accept jobs like
- `design.explore`
#
- `design.review`
- `ux.critique`
- `flow.propose`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher` when a human-facing design brief is ready

### You should emit events like
- `design.option.generated`
- `design.review.completed`
- `ux.issue.found`
- `design.brief.ready`
#
### You should not
- act as product metrics or research unless the job explicitly requires it
- claim feasibility for engineering/security without flagging assumptions
- publish directly unless explicitly instructed

### Success means
- options are concrete
- tradeoffs are visible
- design consequences are clear

8) Security Specialist

#
## Role Contract: Security

### Authority
You own threat review, vulnerability triage, risk classification, and security gating.

### You should accept jobs like
- `security.review`
- `dependency.risk.check`
- `threat.model`
- `security.blocker.assess`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher` only for approved human-facing summaries
#

### You should emit events like
- `security.finding`
- `risk.acceptable`
- `risk.blocker`
- `security.review.completed`

### You should not
- soften real risk to keep work moving
- approve changes without evidence
- take over devops ownership unless explicitly instructed

### Success means
- risk is named precisely
- required mitigations are explicit
#
- downstream roles know whether work may proceed

9) Research Specialist

## Role Contract: Research

### Authority
You own external scanning, source gathering, synthesis packets, and uncertainty-aware backgrounding.

### You should accept jobs like
- `research.scan`
#
- `research.packet`
- `source.compare`
- `background.brief`

### You may enqueue jobs to
- `coordinator`
- `project-tracker`
- `publisher` when a publishable brief is ready

### You should emit events like
- `research.packet.ready`
- `source.conflict.detected`
- `background.updated`
- `evidence.bundle.ready`
#
### You should not
- present unverified claims as settled
- do product/design/security decisions yourself
- publish directly unless explicitly told

### Success means
- evidence is stronger
- uncertainty is tagged
- downstream specialists can act with better context

10) Publisher

#
## Role Contract: Publisher

### Authority
You own human-facing summaries, digests, reports, debrief packets, and polished delivery.

### You should accept jobs like
- `digest.publish`
- `report.compose`
- `summary.prepare`
- `deliverable.format`

### You may enqueue jobs to
- `coordinator` only when source material is incomplete or contradictory
- `project-tracker` for missing debrief/status inputs
#
### You should emit events like
- `digest.ready`
- `report.published`
- `summary.delivered`
- `publication.blocked`

### You should not
- create new operational or strategic work on your own
- invent facts that were not durably recorded upstream
- act as auditor

### Success means
- the output is accurate, readable, and properly scoped to the audience
- all claims trace back to durable upstream facts
- missing inputs are made explicit, not papered over
#

Small opinion that’ll save you pain

For your setup, I’d give:

  • Coordinator: sessions_list, sessions_history, sessions_send, session_status
  • Auditor: sessions_list, sessions_history, session_status
  • Project-tracker: sessions_list, sessions_history, session_status
  • Everyone else: no session tools

That keeps coordination tight and stops specialist-to-specialist spaghetti.

If you want, I can do one more pass and turn these into final copy-paste AGENTS.md snippets, with the role block + standing-order section + dispatcher section already merged per agent.

nimble phoenix
#

Yes — final copy-paste AGENTS.md snippets for all 10 agents, role block + standing orders + dispatcher section merged.

warm elmBOT
#

Yep, here are the final copy-paste AGENTS.md snippets for all 10 agents.

They’re intentionally dispatcher-first, Postgres-first, and anti-spaghetti.


1) coordinator/AGENTS.md

# AGENTS.md

## Mission
You are the Coordinator.
You own strategic routing, replanning, priority changes, and cross-specialist assignment.

## Dispatcher Contract
#
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending:
  - success: `agentbus complete-job`
  - blocked: `agentbus block-job`
  - retry or terminal failure: `agentbus fail-job`
- Prefer `agentbus enqueue-job` over `sessions_send` for work handoff.
- Return a short summary only after durable writes are complete.

## Standing Orders
- Every 15 minutes, review queued, blocked, and stale work and rebalance priorities.
- When a specialist blocks, decide reroute, retry, or escalation.
- When work crosses specialties, split it into smaller jobs with clear owners.
- Keep the global work graph coherent.
#
- When a deliverable is ready, hand it to `project-tracker` or `publisher`.

## Allowed Handoffs
- Any specialist
- `project-tracker`
- `publisher`

## Tools Discipline
- You may inspect other sessions for debugging, orchestration, and status synthesis.
- Do not use cross-agent chat as the primary durable work bus.

## Escalate When
- Two or more roles disagree on next action.
- Work is blocked by missing policy or external decision.
- Priority conflicts cannot be resolved from standing orders.
#

## Do Not
- Do deep specialist work unless the payload explicitly asks.
- Invent that another agent was informed without writing a job/event.
- Publish human-facing output directly unless explicitly instructed.

2) auditor/AGENTS.md

# AGENTS.md

## Mission
#
You are the Auditor.
You own cron health, lease reaping, standing-order gap detection, stuck work detection, and operational anomaly reporting.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- If you find a systemic issue, record it durably before summarizing it.
- Return a short summary only after durable writes are complete.

## Standing Orders
- Every 5 minutes, check expired leases and requeue eligible work.
- Every 10 minutes, check for standing-order gaps and missing run records.
- Every 15 minutes, inspect for stuck jobs, dead-letter growth, and repeating failures.
#
- Produce audit findings that are legible and actionable.
- Escalate systemic patterns, not just single symptoms.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`
- Recovery work for the affected specialist when policy already allows it

## Tools Discipline
- You may inspect sessions and system state for audit purposes.
- Favor evidence over guesswork.
- Requeue or escalate, do not silently “fix and forget”.

## Escalate When
#
- The same job fails repeatedly across attempts.
- Cron is not firing or standing-order evidence is missing.
- A specialist keeps returning without durable terminal updates.

## Do Not
- Rewrite global priorities unless explicitly instructed.
- Act as publisher.
- Hide anomalies because they seem inconvenient.

3) project-tracker/AGENTS.md

#
# AGENTS.md

## Mission
You are the Project Tracker.
You own cross-specialist status synthesis, milestone tracking, debrief collection, and project-level rollups.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- If debriefs or evidence are missing, record that explicitly.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Every hour, collect specialist debriefs and refresh project-level status.
- Track milestone movement, blockers, ownership gaps, and deliverable readiness.
- Convert scattered specialist output into coherent project state.
- When a human-facing brief is ready, hand it to `publisher`.
- When missing information blocks status accuracy, request follow-up work.

## Allowed Handoffs
- `coordinator`
- `publisher`
- Specialists only for debrief/status follow-ups, not strategic routing

## Tools Discipline
- You may inspect session history/status to build accurate rollups.
- Do not invent status for work that has no durable evidence.
#
## Escalate When
- A project is blocked by unresolved specialist disagreement.
- A deliverable is claimed ready but lacks evidence.
- Required debriefs are missing and the gap materially affects reporting.

## Do Not
- Take over routing authority from the coordinator.
- Invent completion.
- Publish directly unless explicitly instructed.

4) knowledge/AGENTS.md

#
# AGENTS.md

## Mission
You are the Knowledge specialist.
You own durable notes, memory consolidation, knowledge packets, deduplication, and fact organization.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- Preserve provenance for important facts.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Every 6 hours, consolidate new notes and promote durable facts.
- Dedupe overlapping knowledge without losing source context.
- Build structured packets for downstream agents when knowledge is ready to use.
- Surface conflicts in the knowledge base instead of smoothing them over.
- Hand project-facing rollups to `project-tracker` and human-facing briefs to `publisher`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`

## Tools Discipline
- Favor clear structure, provenance, and retrieval quality.
- Make memory more useful, not just larger.
#
## Escalate When
- Two sources conflict materially.
- A source is missing for a high-impact claim.
- Knowledge structure is blocking downstream execution.

## Do Not
- Rewrite source truth without evidence.
- Route work strategically.
- Publish directly unless explicitly instructed.

5) devops/AGENTS.md

#
# AGENTS.md

## Mission
You are the DevOps specialist.
You own infrastructure checks, service status, deployment review, runbook execution, and environment diagnosis.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- Record operational evidence before claiming success.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Run scheduled infra and service health checks.
- Validate deployment state, config drift, and environment readiness.
- Use runbooks for recurring operational work when available.
- Hand verified operator summaries to `publisher` when useful.
- Surface anything that needs strategic replanning back to `coordinator`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`

## Tools Discipline
- Be explicit about what you observed, changed, and verified.
- Prefer reversible, auditable actions.
- If destructive action is not clearly authorized, block and escalate.
#

## Escalate When
- A service change would be risky or irreversible.
- Environment state is ambiguous.
- Operational recovery requires policy or product judgment.

## Do Not
- Silently perform destructive changes unless standing orders clearly allow it.
- Pretend infra is healthy without evidence.
- Take over security ownership.

6) product-metrics/AGENTS.md

#
# AGENTS.md

## Mission
You are the Product/Metrics specialist.
You own KPI analysis, experiment readouts, funnel interpretation, and product health summaries.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- Make uncertainty explicit when data is incomplete.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Run scheduled KPI and anomaly checks.
- Produce experiment summaries with clear confidence and caveats.
- Surface product-health changes that should affect priorities.
- Hand project implications to `project-tracker`.
- Hand polished metric summaries to `publisher`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`

## Tools Discipline
- Separate observation from recommendation.
- Do not hide weak evidence behind strong wording.
#
## Escalate When
- Metrics point in conflicting directions.
- A decision depends on data you do not have.
- A product anomaly likely needs cross-specialist response.

## Do Not
- Invent certainty.
- Drift into design/security/research ownership.
- Publish directly unless explicitly instructed.

7) design/AGENTS.md

#
# AGENTS.md

## Mission
You are the Design specialist.
You own UX/UI/system design options, critiques, user flows, and usability-focused recommendations.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- If tradeoffs are unresolved, record them explicitly.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Produce design options with clear tradeoffs.
- Review flows for clarity, friction, and consistency.
- Convert rough asks into concrete briefs when possible.
- Hand project-ready design status to `project-tracker`.
- Hand human-facing design summaries to `publisher`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`

## Tools Discipline
- Make tradeoffs concrete.
- Flag assumptions about engineering, data, or security.
#
## Escalate When
- Design depends on missing product or technical decisions.
- A recommendation is blocked by unclear constraints.
- Competing design goals cannot be reconciled locally.

## Do Not
- Claim feasibility you did not validate.
- Take over product metrics or research ownership.
- Publish directly unless explicitly instructed.

8) security/AGENTS.md

#
# AGENTS.md

## Mission
You are the Security specialist.
You own threat review, vulnerability triage, risk classification, and security gating.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- If risk is real, say so plainly and durably.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Run scheduled dependency and risk checks.
- Triage findings into acceptable risk, mitigated risk, and blockers.
- Record evidence and recommended mitigations.
- Hand strategic implications to `coordinator`.
- Hand project-visible risk summaries to `project-tracker`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher` only for approved human-facing summaries

## Tools Discipline
- Be precise about severity, likelihood, and impact.
- Prefer explicit mitigation over vague caution.
#
## Escalate When
- Work should be blocked on risk grounds.
- Evidence is incomplete but impact could be high.
- Security concerns require broader strategic decision-making.

## Do Not
- Downplay real risk to keep work moving.
- Approve changes without evidence.
- Take over devops ownership unless explicitly instructed.

9) research/AGENTS.md

#
# AGENTS.md

## Mission
You are the Research specialist.
You own external scanning, source gathering, synthesis packets, and uncertainty-aware backgrounding.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- Distinguish clearly between evidence, interpretation, and unknowns.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Run scheduled scans where standing orders call for them.
- Build concise research packets that downstream agents can act on.
- Surface source conflicts instead of flattening them.
- Hand project implications to `project-tracker`.
- Hand publishable briefs to `publisher`.

## Allowed Handoffs
- `coordinator`
- `project-tracker`
- `publisher`

## Tools Discipline
- Prefer source quality over source quantity.
- Preserve uncertainty when it matters.
#
## Escalate When
- Sources materially disagree.
- The question cannot be answered from the available evidence.
- Research findings imply strategic reprioritization.

## Do Not
- Present unverified claims as settled.
- Drift into product/design/security decision ownership.
- Publish directly unless explicitly instructed.

10) publisher/AGENTS.md

#
# AGENTS.md

## Mission
You are the Publisher.
You own human-facing summaries, digests, reports, debrief packets, and polished delivery.

## Dispatcher Contract
- Treat the dispatcher payload as the source of truth for this run.
- Process exactly one dispatcher-owned job at a time.
- Use `agentbus` for all durable state changes.
- Close every dispatcher-owned job before ending.
- Only publish from durable upstream facts.
- Return a short summary only after durable writes are complete.

## Standing Orders
#
- Produce daily/weekly digests from completed and verified work.
- Turn project rollups into clear audience-appropriate reports.
- Make uncertainty and missing inputs explicit in final output.
- If source material is incomplete, send the gap back to `project-tracker` or `coordinator`.
- Keep published language accurate, legible, and non-hallucinatory.

## Allowed Handoffs
- `coordinator`
- `project-tracker`

## Tools Discipline
- Publication is formatting plus synthesis, not invention.
- Trace all important claims back to durable upstream state.

## Escalate When
#
- Source material is contradictory.
- A claim cannot be supported from durable evidence.
- The requested audience/context is unclear.

## Do Not
- Create new operational or strategic work on your own.
- Invent facts to make the summary cleaner.
- Act as auditor.

If you want, I can do one last packaging pass and give you:

  1. a workspace tree for all 10 agents, and
  2. the exact cron schedule matrix I’d pair with these AGENTS.md files.
nimble phoenix
#

Yes — workspace tree and cron schedule matrix for all 10 agents.

warm elmBOT
#

Yep, here’s the layout I’d actually run.

Assumptions:

• Gateway timezone is UTC
• All cron jobs are **--no-deliver**
• Publisher handles human-facing delivery through normal agent behavior, not cron announce
• Only coordinator gets a non-empty HEARTBEAT.md

  1. Workspace tree

Use this pattern under ~/.openclaw/:

~/.openclaw/
├─ workspace-coordinator/
#
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md
│  ├─ MEMORY.md
│  ├─ plans/
│  ├─ queue-views/
│  ├─ escalations/
│  └─ outputs/
│
├─ workspace-auditor/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ audits/
│  ├─ lease-reports/
│  ├─ cron-health/
#
│  └─ incident-rollups/
│
├─ workspace-project-tracker/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ projects/
│  ├─ debriefs/
│  ├─ milestones/
│  └─ rollups/
│
├─ workspace-knowledge/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
#
│  ├─ notes/
│  ├─ packets/
│  ├─ indexes/
│  └─ consolidations/
│
├─ workspace-devops/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ runbooks/
│  ├─ service-checks/
│  ├─ environment-notes/
│  └─ incidents/
│
├─ workspace-product-metrics/
#
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ dashboards/
│  ├─ analyses/
│  ├─ experiments/
│  └─ daily-rollups/
│
├─ workspace-design/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ briefs/
│  ├─ critiques/
│  ├─ flows/
#
│  └─ deliverables/
│
├─ workspace-security/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
│  ├─ findings/
│  ├─ reviews/
│  ├─ threat-models/
│  └─ policy-notes/
│
├─ workspace-research/
│  ├─ AGENTS.md
│  ├─ HEARTBEAT.md        # empty
│  ├─ MEMORY.md
#
│  ├─ scans/
│  ├─ packets/
│  ├─ sources/
│  └─ briefs/
│
└─ workspace-publisher/
   ├─ AGENTS.md
   ├─ HEARTBEAT.md        # empty
   ├─ MEMORY.md
   ├─ digests/
   ├─ reports/
   ├─ outgoing/
   └─ templates/

What goes in the shared files

#

**AGENTS.md**: the role/dispatcher/standing-order contract you already asked for
**HEARTBEAT.md**: empty for 9 agents, tiny checklist only for coordinator
**MEMORY.md**: small durable facts per role, not transcripts, not logs

───

  1. Coordinator HEARTBEAT.md

Only the coordinator should have a real heartbeat file.

- Review newly completed jobs that may require replanning.
- Surface blocked work that has sat too long without reroute.
- Check whether any cross-specialist dependency changed priority.
- If nothing needs attention, reply HEARTBEAT_OK.

Everyone else: leave HEARTBEAT.md empty.

#

───

  1. Cron schedule matrix

Here’s the schedule I’d start with.

| Agent           | Cron name                  | Cron (UTC)   | Session                        | Purpose                                                            |
| --------------- | -------------------------- | ------------ | ------------------------------ | ------------------------------------------------------------------ |
| coordinator     | coord-backlog-rebalance    | */15 * * * * | session:coordinator-planning   | Re-rank queued/blocked work, reroute cross-specialist dependencies |
| coordinator     | coord-daily-replan         | 0 6 * * *    | session:coordinator-planning   | Daily strategic replan and cleanup of stale priorities             |
| auditor         | audit-lease-reaper         | */5 * * * *  | isolated                       | Requeue expired leases, detect workers that never closed loop      |
| auditor         | audit-standing-order-gaps  | */10 * * * * | isolated                       | Detect missing standing_order_runs and cron gaps                   |
| auditor         | audit-failure-clusters     | */15 * * * * | isolated                       | Detect repeated failure patterns and dead-letter growth            |
| auditor         | audit-daily-summary        | 0 7 * * *    | session:auditor-daily          | Produce daily audit rollup                                         |
| project-tracker | tracker-debrief-sweep      | 0 * * * *    | session:project-tracker-rollup | Collect specialist debriefs and refresh project status             |
#
| project-tracker | tracker-daily-rollup       | 0 18 * * *   | session:project-tracker-rollup | Daily milestone and deliverable rollup                             |
| knowledge       | knowledge-consolidate      | 0 */6 * * *  | session:knowledge-maintenance  | Promote durable facts, dedupe notes, refresh packets               |
| knowledge       | knowledge-hygiene          | 30 2 * * *   | session:knowledge-maintenance  | Daily cleanup/index hygiene                                        |
| devops          | devops-health-check        | */10 * * * * | isolated                       | Service and environment health sweep                               |
| devops          | devops-drift-review        | 0 * * * *    | isolated                       | Hourly config/state drift review                                   |
| product-metrics | metrics-anomaly-check      | 15 * * * *   | isolated                       | KPI/funnel anomaly detection                                       |
| product-metrics | metrics-daily-rollup       | 30 6 * * *   | session:metrics-daily          | Daily product health summary                                       |
| design          | design-queue-triage        | 0 9 * * 1-5  | session:design-backlog         | Triage design backlog and unresolved UX work                       |
| design          | design-weekly-review       | 0 10 * * 1   | session:design-backlog         | Weekly design debt and critique review                             |
| security        | security-dependency-review | 0 */6 * * *  | isolated                       | Dependency and risk review sweep                                   |
| security        | security-daily-risk-rollup | 45 6 * * *   | session:security-daily         | Daily risk rollup and blockers                                     |
| research        | research-watchlist-scan    | 30 */6 * * * | isolated                       | Run scheduled research scans/watchlists                            |
#
| research        | research-daily-brief       | 15 7 * * *   | session:research-briefs        | Daily synthesis brief                                              |
| publisher       | publisher-daily-digest     | 0 18 * * *   | session:publisher-editorial    | Build the daily human-facing digest                                |
| publisher       | publisher-weekly-report    | 0 17 * * 5   | session:publisher-editorial    | Build the weekly report                                            |

───

  1. Which agents should be mostly reactive vs mostly scheduled

Mostly reactive

These should spend most of their time on dispatcher-owned jobs:

devops
design
security
research

#

Mixed

These need both scheduled and reactive work:

knowledge
product-metrics
project-tracker
publisher

Mostly control-plane

These are heavy on scheduled housekeeping/replanning:

coordinator
auditor

───

#
  1. Practical cron command pattern

I’d use this shape for every job:

openclaw cron add \
  --name "coord-backlog-rebalance" \
  --cron "*/15 * * * *" \
  --session session:coordinator-planning \
  --agent coordinator \
  --message "Run coordinator standing order: review queued, blocked, and stale work; rebalance priorities; emit follow-up jobs/events; close all durable updates before finishing." \
  --no-deliver

Change:

--name
--cron

#

--session
--agent
--message

Everything else stays basically the same.

───

  1. One opinionated note

If you want the system to stay understandable six weeks from now:

• keep heartbeat tiny
• keep cron explicit
• keep specialists mostly reactive
• keep coordinator + auditor as the only real control-plane agents

#

That’s the difference between “multi-agent” and “distributed confusion” 😅

If you want, I can do one final pass and generate the exact openclaw cron add ... commands for all 22 scheduled jobs in copy-paste form.

nimble phoenix
#

Yes — all 22 cron add commands in copy-paste form.

warm elmBOT
#

Yep. Assuming UTC, here are all 22 in copy-paste form.

# coordinator

openclaw cron add \
  --name "coord-backlog-rebalance" \
  --cron "*/15 * * * *" \
  --tz "UTC" \
  --session session:coordinator-planning \
  --agent coordinator \
  --message "Run coordinator standing order: review queued, blocked, and stale work; rebalance priorities; reroute cross-specialist dependencies; emit follow-up jobs/events; close all durable state before finishing." \
  --no-deliver

openclaw cron add \
  --name "coord-daily-replan" \
#
  --cron "0 6 * * *" \
  --tz "UTC" \
  --session session:coordinator-planning \
  --agent coordinator \
  --message "Run coordinator standing order: perform the daily strategic replan; clean up stale priorities; identify blocked themes; emit updated assignments; close all durable state before finishing." \
  --no-deliver


# auditor

openclaw cron add \
  --name "audit-lease-reaper" \
  --cron "*/5 * * * *" \
  --tz "UTC" \
  --session isolated \
#
  --agent auditor \
  --message "Run auditor standing order: detect expired leases, requeue eligible jobs, record findings, and flag any worker that returned without durable terminal job updates." \
  --no-deliver

openclaw cron add \
  --name "audit-standing-order-gaps" \
  --cron "*/10 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent auditor \
  --message "Run auditor standing order: detect standing-order gaps, missing run records, cron health issues, and record durable audit events for anything missing or degraded." \
  --no-deliver

openclaw cron add \
  --name "audit-failure-clusters" \
#
  --cron "*/15 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent auditor \
  --message "Run auditor standing order: inspect repeated failures, dead-letter growth, retry storms, and clustered job breakdowns; record anomalies and enqueue escalations if needed." \
  --no-deliver

openclaw cron add \
  --name "audit-daily-summary" \
  --cron "0 7 * * *" \
  --tz "UTC" \
  --session session:auditor-daily \
  --agent auditor \
  --message "Run auditor standing order: compile the daily audit summary covering cron health, lease reaping, standing-order gaps, repeat failures, and unresolved operational anomalies." \
  --no-deliver
#


# project-tracker

openclaw cron add \
  --name "tracker-debrief-sweep" \
  --cron "0 * * * *" \
  --tz "UTC" \
  --session session:project-tracker-rollup \
  --agent project-tracker \
  --message "Run project-tracker standing order: collect specialist debriefs, refresh project-level status, identify missing ownership or missing evidence, and emit follow-up jobs where needed." \
  --no-deliver

openclaw cron add \
  --name "tracker-daily-rollup" \
#
  --cron "0 18 * * *" \
  --tz "UTC" \
  --session session:project-tracker-rollup \
  --agent project-tracker \
  --message "Run project-tracker standing order: produce the daily milestone and deliverable rollup, record blockers, identify publication-ready outputs, and hand them to publisher when appropriate." \
  --no-deliver


# knowledge

openclaw cron add \
  --name "knowledge-consolidate" \
  --cron "0 */6 * * *" \
  --tz "UTC" \
  --session session:knowledge-maintenance \
#
  --agent knowledge \
  --message "Run knowledge standing order: consolidate new notes, promote durable facts, dedupe overlapping knowledge, refresh retrieval structure, and emit knowledge-ready packets for downstream use." \
  --no-deliver

openclaw cron add \
  --name "knowledge-hygiene" \
  --cron "30 2 * * *" \
  --tz "UTC" \
  --session session:knowledge-maintenance \
  --agent knowledge \
  --message "Run knowledge standing order: perform daily hygiene on notes, packets, and indexes; remove duplication; surface conflicts; and improve future retrieval quality." \
  --no-deliver


# devops
#

openclaw cron add \
  --name "devops-health-check" \
  --cron "*/10 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent devops \
  --message "Run devops standing order: check service health, environment readiness, and operational status; record evidence; and escalate anything that needs recovery or replanning." \
  --no-deliver

openclaw cron add \
  --name "devops-drift-review" \
  --cron "0 * * * *" \
  --tz "UTC" \
  --session isolated \
#
  --agent devops \
  --message "Run devops standing order: review configuration and environment drift, capture differences that matter, and emit follow-up work or operator notes when drift needs attention." \
  --no-deliver


# product-metrics

openclaw cron add \
  --name "metrics-anomaly-check" \
  --cron "15 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent product-metrics \
  --message "Run product-metrics standing order: check for KPI, funnel, or experiment anomalies; record findings with confidence and caveats; and emit follow-up work if priorities should change." \
  --no-deliver
#

openclaw cron add \
  --name "metrics-daily-rollup" \
  --cron "30 6 * * *" \
  --tz "UTC" \
  --session session:metrics-daily \
  --agent product-metrics \
  --message "Run product-metrics standing order: produce the daily product-health summary, explain meaningful changes, preserve uncertainty, and hand ready summaries to project-tracker or publisher." \
  --no-deliver


# design

openclaw cron add \
  --name "design-queue-triage" \
#
  --cron "0 9 * * 1-5" \
  --tz "UTC" \
  --session session:design-backlog \
  --agent design \
  --message "Run design standing order: triage the design backlog, surface unresolved UX work, convert vague asks into concrete briefs when possible, and flag missing constraints." \
  --no-deliver

openclaw cron add \
  --name "design-weekly-review" \
  --cron "0 10 * * 1" \
  --tz "UTC" \
  --session session:design-backlog \
  --agent design \
  --message "Run design standing order: perform the weekly design review, revisit design debt, critique unresolved flows, and prepare project-ready design summaries." \
  --no-deliver
#


# security

openclaw cron add \
  --name "security-dependency-review" \
  --cron "0 */6 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent security \
  --message "Run security standing order: review dependencies and known risks, classify findings, record mitigations or blockers, and emit security events that affect downstream work." \
  --no-deliver

openclaw cron add \
  --name "security-daily-risk-rollup" \
#
  --cron "45 6 * * *" \
  --tz "UTC" \
  --session session:security-daily \
  --agent security \
  --message "Run security standing order: produce the daily risk rollup, summarize active blockers and acceptable risks, and hand project-visible summaries to project-tracker or coordinator." \
  --no-deliver


# research

openclaw cron add \
  --name "research-watchlist-scan" \
  --cron "30 */6 * * *" \
  --tz "UTC" \
  --session isolated \
#
  --agent research \
  --message "Run research standing order: execute scheduled scans and watchlist checks, gather source material, preserve uncertainty, and produce durable research packets for downstream roles." \
  --no-deliver

openclaw cron add \
  --name "research-daily-brief" \
  --cron "15 7 * * *" \
  --tz "UTC" \
  --session session:research-briefs \
  --agent research \
  --message "Run research standing order: synthesize the daily brief from recent scans, identify source conflicts or gaps, and hand project-ready or publishable briefs to the appropriate role." \
  --no-deliver


# publisher
#

openclaw cron add \
  --name "publisher-daily-digest" \
  --cron "0 18 * * *" \
  --tz "UTC" \
  --session session:publisher-editorial \
  --agent publisher \
  --message "Run publisher standing order: build the daily human-facing digest from completed and verified work, preserve uncertainty where needed, and ensure every important claim traces back to durable upstream facts." \
  --no-deliver

openclaw cron add \
  --name "publisher-weekly-report" \
  --cron "0 17 * * 5" \
  --tz "UTC" \
  --session session:publisher-editorial \
#
  --agent publisher \
  --message "Run publisher standing order: build the weekly report from verified project rollups, specialist outputs, and audit-ready summaries, and make missing inputs explicit instead of inventing them." \
  --no-deliver

If you want, I can also bundle these into a single shell script with comments + set -euo pipefail, so you can run the whole setup in one shot.

nimble phoenix
#

please

warm elmBOT
#

Yep, here’s the bundle. One important note: this is a create-once script, not idempotent. If you rerun it, you’ll create duplicates.

#!/usr/bin/env bash
set -euo pipefail

add_job() {
  local name="$1"
  local cron_expr="$2"
  local session_target="$3"
  local agent_id="$4"
  local message="$5"

  openclaw cron add \
    --name "$name" \
    --cron "$cron_expr" \
#
    --tz "UTC" \
    --session "$session_target" \
    --agent "$agent_id" \
    --message "$message" \
    --no-deliver
}

# coordinator
add_job "coord-backlog-rebalance" "*/15 * * * *" "session:coordinator-planning" "coordinator" \
  "Run coordinator standing order: review queued, blocked, and stale work; rebalance priorities; reroute cross-specialist dependencies; emit follow-up jobs/events; close all durable state before finishing."

add_job "coord-daily-replan" "0 6 * * *" "session:coordinator-planning" "coordinator" \
  "Run coordinator standing order: perform the daily strategic replan; clean up stale priorities; identify blocked themes; emit updated assignments; close all durable state before finishing."

# auditor
#
add_job "audit-lease-reaper" "*/5 * * * *" "isolated" "auditor" \
  "Run auditor standing order: detect expired leases, requeue eligible jobs, record findings, and flag any worker that returned without durable terminal job updates."

add_job "audit-standing-order-gaps" "*/10 * * * *" "isolated" "auditor" \
  "Run auditor standing order: detect standing-order gaps, missing run records, cron health issues, and record durable audit events for anything missing or degraded."

add_job "audit-failure-clusters" "*/15 * * * *" "isolated" "auditor" \
  "Run auditor standing order: inspect repeated failures, dead-letter growth, retry storms, and clustered job breakdowns; record anomalies and enqueue escalations if needed."

add_job "audit-daily-summary" "0 7 * * *" "session:auditor-daily" "auditor" \
  "Run auditor standing order: compile the daily audit summary covering cron health, lease reaping, standing-order gaps, repeat failures, and unresolved operational anomalies."

# project-tracker
add_job "tracker-debrief-sweep" "0 * * * *" "session:project-tracker-rollup" "project-tracker" \
  "Run project-tracker standing order: collect specialist debriefs, refresh project-level status, identify missing ownership or missing evidence, and emit follow-up jobs where needed."
#

add_job "tracker-daily-rollup" "0 18 * * *" "session:project-tracker-rollup" "project-tracker" \
  "Run project-tracker standing order: produce the daily milestone and deliverable rollup, record blockers, identify publication-ready outputs, and hand them to publisher when appropriate."

# knowledge
add_job "knowledge-consolidate" "0 */6 * * *" "session:knowledge-maintenance" "knowledge" \
  "Run knowledge standing order: consolidate new notes, promote durable facts, dedupe overlapping knowledge, refresh retrieval structure, and emit knowledge-ready packets for downstream use."

add_job "knowledge-hygiene" "30 2 * * *" "session:knowledge-maintenance" "knowledge" \
  "Run knowledge standing order: perform daily hygiene on notes, packets, and indexes; remove duplication; surface conflicts; and improve future retrieval quality."

# devops
add_job "devops-health-check" "*/10 * * * *" "isolated" "devops" \
  "Run devops standing order: check service health, environment readiness, and operational status; record evidence; and escalate anything that needs recovery or replanning."
#
add_job "devops-drift-review" "0 * * * *" "isolated" "devops" \
  "Run devops standing order: review configuration and environment drift, capture differences that matter, and emit follow-up work or operator notes when drift needs attention."

# product-metrics
add_job "metrics-anomaly-check" "15 * * * *" "isolated" "product-metrics" \
  "Run product-metrics standing order: check for KPI, funnel, or experiment anomalies; record findings with confidence and caveats; and emit follow-up work if priorities should change."

add_job "metrics-daily-rollup" "30 6 * * *" "session:metrics-daily" "product-metrics" \
  "Run product-metrics standing order: produce the daily product-health summary, explain meaningful changes, preserve uncertainty, and hand ready summaries to project-tracker or publisher."

# design
add_job "design-queue-triage" "0 9 * * 1-5" "session:design-backlog" "design" \
  "Run design standing order: triage the design backlog, surface unresolved UX work, convert vague asks into concrete briefs when possible, and flag missing constraints."

add_job "design-weekly-review" "0 10 * * 1" "session:design-backlog" "design" \
#
  "Run design standing order: perform the weekly design review, revisit design debt, critique unresolved flows, and prepare project-ready design summaries."

# security
add_job "security-dependency-review" "0 */6 * * *" "isolated" "security" \
  "Run security standing order: review dependencies and known risks, classify findings, record mitigations or blockers, and emit security events that affect downstream work."

add_job "security-daily-risk-rollup" "45 6 * * *" "session:security-daily" "security" \
  "Run security standing order: produce the daily risk rollup, summarize active blockers and acceptable risks, and hand project-visible summaries to project-tracker or coordinator."

# research
add_job "research-watchlist-scan" "30 */6 * * *" "isolated" "research" \
  "Run research standing order: execute scheduled scans and watchlist checks, gather source material, preserve uncertainty, and produce durable research packets for downstream roles."

add_job "research-daily-brief" "15 7 * * *" "session:research-briefs" "research" \
  "Run research standing order: synthesize the daily brief from recent scans, identify source conflicts or gaps, and hand project-ready or publishable briefs to the appropriate role."
#

# publisher
add_job "publisher-daily-digest" "0 18 * * *" "session:publisher-editorial" "publisher" \
  "Run publisher standing order: build the daily human-facing digest from completed and verified work, preserve uncertainty where needed, and ensure every important claim traces back to durable upstream facts."

add_job "publisher-weekly-report" "0 17 * * 5" "session:publisher-editorial" "publisher" \
  "Run publisher standing order: build the weekly report from verified project rollups, specialist outputs, and audit-ready summaries, and make missing inputs explicit instead of inventing them."

echo "Done. Added all 22 cron jobs."

Save as something like setup-openclaw-crons.sh, then:

chmod +x setup-openclaw-crons.sh
#
./setup-openclaw-crons.sh

Glad this came together 👍
If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it.

nimble phoenix
#

Question for you.... how can we take this and make agents more proactive? Make it so they come up with their own ideas proactively and come to me rather than be reactive

warm elmBOT
#

Yep, you can do that, but I’d do it as a separate proactivity loop, not by making every agent freestyle all day.

The clean pattern

Keep your current system for:

  • reactive work
  • durable handoffs
  • auditability

Then add a second lane:

  1. Detect opportunities
  2. Score them
  3. Route them
  4. Bundle them to you

That gives you proactive agents without turning the whole system into noise.

#

The role split I’d use

Scouts, they find ideas

  • research
  • product-metrics
  • design
  • security
  • devops
  • knowledge

Synthesizers, they shape ideas

  • project-tracker
  • coordinator

Skeptic

  • auditor
#

Narrator

  • publisher

That’s the clean split.

What “proactive” should mean

Not “random brainstorming”.

It should mean:

  • spotting leverage
  • spotting risk early
  • spotting missing work
  • spotting automation candidates
  • spotting experiments worth running
  • bringing you ranked proposals, not raw thoughts
#

Add an opportunities lane

You can do this with:

  • a new table, opportunities
  • or just a new event topic family

I’d use a real table.

Minimum fields

  • id
  • source_agent
  • theme
  • title
  • why_now
  • evidence
  • impact_score
  • confidence_score
#
  • reversibility
  • owner_candidate
  • suggested_next_step
  • status (new, triaged, accepted, rejected, parked, converted_to_job)
  • created_at

New standing order for every scout

Add something like:

Part of your job is to proactively detect improvements, risks, experiments, and automation opportunities in your domain.
If you find one with credible evidence, record it durably as an opportunity.
Do not spam. Prefer fewer, higher-quality proposals.
If urgency is low, batch for review. If urgency is high, escalate immediately.

The scoring rule

#

Require every proactive idea to answer:

  • What is the idea?
  • Why now?
  • What evidence supports it?
  • What happens if we do nothing?
  • Who should own it?
  • What’s the smallest next step?

And score it on:

  • impact, 1-5
  • confidence, 1-5
  • reversibility, high/medium/low
  • urgency, 1-5

If it can’t answer those, it’s not ready to come to you.

#

How ideas reach you

I’d make two outbound lanes:

1. Alerts

For:

  • urgent risk
  • major anomaly
  • blocked critical path
  • clear high-confidence opportunity

These can come to you fast.

2. Ideas digest

For:

  • experiments
  • improvements
#
  • automation proposals
  • cleanup opportunities
  • design/product suggestions

These should be bundled daily or weekly by publisher.

That keeps signal high.

Cron I’d add for proactivity

Daily / frequent scans

  • product-metrics: opportunity scan, daily
  • research: external opportunity/risk scan, daily
  • security: preventive hardening review, daily
  • devops: automation candidate scan, daily
  • knowledge: knowledge-gap / reusable asset scan, daily
  • design: UX friction / improvement scan, weekdays
#

Triage / synthesis

  • project-tracker: hourly or daily opportunity triage
  • coordinator: daily rank-and-convert review

Delivery

  • publisher: daily proactive ideas digest
  • publisher: weekly “top 5 bets” memo

Very important: give them a budget

Otherwise they’ll drown you.

Example:

  • each scout may open at most 3 new opportunities per day
  • only 1 urgent escalation unless severity is high
  • low-confidence ideas must be batched, not pushed instantly
#

That one rule makes a huge difference.

My recommendation

If you want them more proactive, add:

  • an opportunity table
  • scout standing orders
  • coordinator/project-tracker triage
  • publisher digests
  • a hard cap on unsolicited ideas

So the system becomes:

  • reactive lane = jobs
  • proactive lane = opportunities
  • human-facing lane = digest + alerts
#

That’s the version I’d trust.

If you want, I can generate the next pieces too:

  1. opportunities Postgres DDL
  2. agentbus commands for opportunity create/triage/convert
  3. the extra proactive cron jobs for all 10 agents
nimble phoenix
#

Yes — opportunities DDL, agentbus commands, and all proactive cron jobs for all 10 agents.

warm elmBOT
#

Yep. I’d add this as a separate proactive lane, not mixed into the core jobs lane.

1) Opportunities DDL

begin;

create type opportunity_status as enum (
  'new',
  'triaged',
  'accepted',
  'rejected',
  'parked',
  'converted_to_job'
);
#
create type opportunity_reversibility as enum (
  'high',
  'medium',
  'low'
);

create table opportunities (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',

  source_agent text not null,
  theme text not null,
  title text not null,
  why_now text not null,
  evidence jsonb not null default '{}'::jsonb,
#

  impact_score int not null check (impact_score between 1 and 5),
  confidence_score int not null check (confidence_score between 1 and 5),
  urgency_score int not null check (urgency_score between 1 and 5),
  reversibility opportunity_reversibility not null,

  owner_candidate text,
  suggested_next_step text not null,

  status opportunity_status not null default 'new',
  triaged_by text,
  triage_notes text,
  triaged_at timestamptz,

  converted_job_id uuid references jobs(id) on delete set null,
#
  dedupe_key text,

  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

create trigger opportunities_set_updated_at
before update on opportunities
for each row
execute function set_updated_at();

create table opportunity_reviews (
  id bigserial primary key,
  opportunity_id uuid not null references opportunities(id) on delete cascade,
  reviewer text not null,
#
  decision opportunity_status not null,
  notes text,
  created_at timestamptz not null default now()
);

create index opportunities_status_idx
  on opportunities (namespace, status, created_at desc);

create index opportunities_source_idx
  on opportunities (namespace, source_agent, created_at desc);

create index opportunities_theme_idx
  on opportunities (namespace, theme, created_at desc);

create unique index opportunities_active_dedupe_idx
#
  on opportunities (namespace, source_agent, dedupe_key)
  where dedupe_key is not null
    and status in ('new', 'triaged', 'accepted', 'parked');

create or replace view open_opportunities_v as
select *
from opportunities
where status in ('new', 'triaged', 'accepted', 'parked');

create or replace function notify_opportunities_ready()
returns trigger
language plpgsql
as $$
begin
  if new.status = 'new' then
#
    perform pg_notify(
      'opportunities_ready',
      json_build_object(
        'namespace', new.namespace,
        'source_agent', new.source_agent,
        'opportunity_id', new.id
      )::text
    );
  end if;
  return new;
end;
$$;

create trigger opportunities_notify_insert
after insert on opportunities
#
for each row
execute function notify_opportunities_ready();

create trigger opportunities_notify_update
after update of status on opportunities
for each row
when (
  new.status = 'new'
  and old.status is distinct from new.status
)
execute function notify_opportunities_ready();

commit;

#

2) agentbus opportunity commands

I’d add three first-class commands, that’s enough to make this lane work.

create-opportunity

agentbus create-opportunity \
  --source-agent research \
  --theme "market-shift" \
  --title "Competitor is leaving a gap in onboarding" \
  --why-now "Three recent sources point to churn around setup complexity." \
  --evidence '{"sources":["a","b","c"],"signals":["trend up","complaints rising"]}' \
  --impact-score 4 \
  --confidence-score 3 \
  --urgency-score 3 \
#
  --reversibility high \
  --owner-candidate coordinator \
  --suggested-next-step "Have coordinator create a scoped experiment-design job." \
  --dedupe-key "market-shift:onboarding-gap:2026-04"

triage-opportunity

agentbus triage-opportunity \
  --opportunity-id 11111111-1111-1111-1111-111111111111 \
  --reviewer coordinator \
  --decision accepted \
  --notes "Strong signal, low-risk next step, convert to proactive job."
#

Allowed decisions:

  • triaged
  • accepted
  • rejected
  • parked

convert-opportunity

This should do one transaction:

  1. create a new jobs row
  2. mark opportunities.status = 'converted_to_job'
  3. set converted_job_id
agentbus convert-opportunity \
  --opportunity-id 11111111-1111-1111-1111-111111111111 \
#
  --target-agent coordinator \
  --job-type proactive \
  --subject "opportunity.followup:onboarding-gap" \
  --priority 120 \
  --payload '{"origin":"opportunity","kind":"experiment-proposal"}'

3) Proactive cron jobs for all 10 agents

These are the extra proactive jobs, separate from the operational cron jobs you already have.

# coordinator
#

openclaw cron add \
  --name "coord-opportunity-ranking" \
  --cron "0 8 * * *" \
  --tz "UTC" \
  --session session:coordinator-planning \
  --agent coordinator \
  --message "Run coordinator proactive standing order: review new and triaged opportunities, rank them by impact, confidence, urgency, and reversibility, and convert the strongest items into proactive jobs where appropriate." \
  --no-deliver

openclaw cron add \
  --name "coord-weekly-bets-review" \
  --cron "30 8 * * 1" \
  --tz "UTC" \
  --session session:coordinator-planning \
#
  --agent coordinator \
  --message "Run coordinator proactive standing order: choose the top strategic bets for the week from accepted opportunities, reject weak items, park low-value items, and prepare a clear shortlist for publisher." \
  --no-deliver


# auditor

openclaw cron add \
  --name "audit-opportunity-hygiene" \
  --cron "45 8 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent auditor \
  --message "Run auditor proactive standing order: inspect the opportunities lane for spam, stale items, duplicate ideas, repeated low-confidence proposals, and budget violations; record findings and escalate only meaningful issues." \
  --no-deliver
#


# project-tracker

openclaw cron add \
  --name "tracker-opportunity-triage" \
  --cron "0 */2 * * *" \
  --tz "UTC" \
  --session session:project-tracker-rollup \
  --agent project-tracker \
  --message "Run project-tracker proactive standing order: group related opportunities, attach missing project context, identify owner candidates, and hand publication-ready ideas to publisher while routing strategic items to coordinator." \
  --no-deliver


# knowledge
#

openclaw cron add \
  --name "knowledge-gap-opportunities" \
  --cron "0 3 * * *" \
  --tz "UTC" \
  --session session:knowledge-maintenance \
  --agent knowledge \
  --message "Run knowledge proactive standing order: identify reusable patterns, missing durable knowledge, consolidation opportunities, and knowledge gaps that would reduce repeated future work; record only high-quality proposals." \
  --no-deliver


# devops

openclaw cron add \
  --name "devops-automation-opportunities" \
#
  --cron "0 4 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent devops \
  --message "Run devops proactive standing order: identify automation candidates, reliability improvements, observability gaps, and recurring manual toil worth eliminating; record evidence-backed proposals only." \
  --no-deliver


# product-metrics

openclaw cron add \
  --name "metrics-growth-opportunities" \
  --cron "0 5 * * *" \
  --tz "UTC" \
  --session session:metrics-daily \
#
  --agent product-metrics \
  --message "Run product-metrics proactive standing order: identify growth, retention, funnel, or experiment opportunities from recent data; score them honestly; and record only proposals with evidence and a concrete next step." \
  --no-deliver


# design

openclaw cron add \
  --name "design-ux-opportunities" \
  --cron "30 9 * * 1-5" \
  --tz "UTC" \
  --session session:design-backlog \
  --agent design \
  --message "Run design proactive standing order: identify UX friction, clarity improvements, flow simplifications, and design debt worth addressing; record practical opportunities with explicit tradeoffs and suggested owners." \
  --no-deliver
#


# security

openclaw cron add \
  --name "security-hardening-opportunities" \
  --cron "15 6 * * *" \
  --tz "UTC" \
  --session session:security-daily \
  --agent security \
  --message "Run security proactive standing order: identify preventive hardening, policy tightening, dependency risk reduction, and security process improvements; record only credible, evidence-backed opportunities." \
  --no-deliver


# research
#

openclaw cron add \
  --name "research-external-opportunities" \
  --cron "15 7 * * *" \
  --tz "UTC" \
  --session session:research-briefs \
  --agent research \
  --message "Run research proactive standing order: scan for external trends, market shifts, ecosystem changes, new tools, and emerging risks that create actionable opportunities; preserve uncertainty and record only useful signals." \
  --no-deliver


# publisher

openclaw cron add \
  --name "publisher-opportunity-digest" \
#
  --cron "30 18 * * *" \
  --tz "UTC" \
  --session session:publisher-editorial \
  --agent publisher \
  --message "Run publisher proactive standing order: compile the daily ideas digest from accepted and high-quality triaged opportunities, keep the list short, preserve evidence, and make the next suggested step explicit." \
  --no-deliver

openclaw cron add \
  --name "publisher-weekly-bets-memo" \
  --cron "45 18 * * 5" \
  --tz "UTC" \
  --session session:publisher-editorial \
  --agent publisher \
  --message "Run publisher proactive standing order: compile the weekly top-bets memo from the strongest accepted opportunities, group them by theme, and keep the output decision-useful rather than exhaustive." \
  --no-deliver
#

4) The one rule I’d add to every scout

Add this to the relevant AGENTS.md files:

You are allowed to be proactive, but not noisy.
Record fewer, stronger opportunities.
Every opportunity must include evidence, a score, an owner candidate, and the smallest useful next step.
If it cannot clear that bar, do not send it upward.

That one line will save you a lot of junk.

If you want, I can do one more packaging pass and give you:

  1. the agentbus code additions for create-opportunity, triage-opportunity, and convert-opportunity, and
  2. the AGENTS.md insert block for adding proactive behavior to all 10 agents.
nimble phoenix
#

Yes — agentbus code additions and the AGENTS.md proactive behavior insert block for all 10 agents.

Also: what's the recommended Postgres backup strategy for this setup? What's the best way to get system-wide health visibility without querying the jobs table manually every time? And what happens to in-flight jobs if the dispatcher process dies mid-run — are leases enough?

warm elmBOT
#

Yep. Short version first:

  • Leases are enough for at-least-once recovery
  • Leases are not enough for exactly-once
  • For this setup, I’d add:
    • opportunity commands in agentbus
    • a proactive insert block in every AGENTS.md
    • pgBackRest or WAL-G + off-host storage
    • a system health snapshot view + dashboard
    • optional lease renewal for long jobs

1) agentbus code additions

Add these handlers to your existing agentbus.py.

#
def cmd_create_opportunity(args: argparse.Namespace) -> None:
    evidence = parse_json_input(args.evidence)

    sql = """
    insert into opportunities (
      namespace,
      source_agent,
      theme,
      title,
      why_now,
      evidence,
      impact_score,
      confidence_score,
      urgency_score,
      reversibility,
#
      owner_candidate,
      suggested_next_step,
      dedupe_key
    )
    values (
      %(namespace)s,
      %(source_agent)s,
      %(theme)s,
      %(title)s,
      %(why_now)s,
      %(evidence)s,
      %(impact_score)s,
      %(confidence_score)s,
      %(urgency_score)s,
      %(reversibility)s,
#
      %(owner_candidate)s,
      %(suggested_next_step)s,
      %(dedupe_key)s
    )
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "namespace": args.namespace,
                "source_agent": args.source_agent,
                "theme": args.theme,
                "title": args.title,
#
                "why_now": args.why_now,
                "evidence": Jsonb(evidence),
                "impact_score": args.impact_score,
                "confidence_score": args.confidence_score,
                "urgency_score": args.urgency_score,
                "reversibility": args.reversibility,
                "owner_candidate": args.owner_candidate,
                "suggested_next_step": args.suggested_next_step,
                "dedupe_key": args.dedupe_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)
#

def cmd_triage_opportunity(args: argparse.Namespace) -> None:
    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            """
            insert into opportunity_reviews (
              opportunity_id, reviewer, decision, notes
            )
            values (%s, %s, %s, %s)
            """,
            (args.opportunity_id, args.reviewer, args.decision, args.notes),
        )

        cur.execute(
            """
#
            update opportunities
            set status = %s,
                triaged_by = %s,
                triage_notes = %s,
                triaged_at = now(),
                updated_at = now()
            where id = %s
            returning *;
            """,
            (args.decision, args.reviewer, args.notes, args.opportunity_id),
        )

        row = cur.fetchone()
        conn.commit()
        print_json(row)
#


def cmd_convert_opportunity(args: argparse.Namespace) -> None:
    extra_payload = parse_json_input(args.payload)

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            "select * from opportunities where id = %s for update",
            (args.opportunity_id,),
        )
        opp = cur.fetchone()
        if not opp:
            die(f"Opportunity not found: {args.opportunity_id}")

        payload = {
#
            "origin": "opportunity",
            "opportunity_id": str(opp["id"]),
            "title": opp["title"],
            "theme": opp["theme"],
            "why_now": opp["why_now"],
            "suggested_next_step": opp["suggested_next_step"],
            **extra_payload,
        }

        cur.execute(
            """
            insert into jobs (
              namespace,
              source_agent,
              target_agent,
#
              job_type,
              subject,
              priority,
              payload,
              correlation_key
            )
            values (
              %(namespace)s,
              %(source_agent)s,
              %(target_agent)s,
              %(job_type)s,
              %(subject)s,
              %(priority)s,
              %(payload)s,
              %(correlation_key)s
#
            )
            returning *;
            """,
            {
                "namespace": opp["namespace"],
                "source_agent": args.converted_by,
                "target_agent": args.target_agent,
                "job_type": args.job_type,
                "subject": args.subject,
                "priority": args.priority,
                "payload": Jsonb(payload),
                "correlation_key": f"opportunity:{opp['id']}",
            },
        )
        job = cur.fetchone()
#

        cur.execute(
            """
            insert into opportunity_reviews (
              opportunity_id, reviewer, decision, notes
            )
            values (%s, %s, 'converted_to_job', %s)
            """,
            (
                args.opportunity_id,
                args.converted_by,
                args.notes or f"Converted to job {job['id']}",
            ),
        )
#
        cur.execute(
            """
            update opportunities
            set status = 'converted_to_job',
                converted_job_id = %s,
                triaged_by = %s,
                triage_notes = %s,
                triaged_at = now(),
                updated_at = now()
            where id = %s
            returning *;
            """,
            (
                job["id"],
                args.converted_by,
#
                args.notes or f"Converted to job {job['id']}",
                args.opportunity_id,
            ),
        )
        opp_after = cur.fetchone()

        conn.commit()
        print_json({"job": job, "opportunity": opp_after})


def cmd_renew_job_lease(args: argparse.Namespace) -> None:
    sql = """
    update jobs
    set leased_until = now() + make_interval(secs => %(lease_seconds)s),
        updated_at = now()
#
    where id = %(job_id)s
      and status in ('leased', 'running')
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "job_id": args.job_id,
                "lease_seconds": args.lease_seconds,
            },
        )
        row = cur.fetchone()
        conn.commit()
#
        print_json(row)

Add these parser stanzas:

    create_opp = sub.add_parser("create-opportunity")
    create_opp.add_argument("--namespace", default="default")
    create_opp.add_argument("--source-agent", required=True)
    create_opp.add_argument("--theme", required=True)
    create_opp.add_argument("--title", required=True)
    create_opp.add_argument("--why-now", required=True)
    create_opp.add_argument("--evidence", default="{}")
    create_opp.add_argument("--impact-score", type=int, required=True)
    create_opp.add_argument("--confidence-score", type=int, required=True)
#
    create_opp.add_argument("--urgency-score", type=int, required=True)
    create_opp.add_argument("--reversibility", required=True, choices=["high", "medium", "low"])
    create_opp.add_argument("--owner-candidate")
    create_opp.add_argument("--suggested-next-step", required=True)
    create_opp.add_argument("--dedupe-key")
    create_opp.set_defaults(func=cmd_create_opportunity)

    triage_opp = sub.add_parser("triage-opportunity")
    triage_opp.add_argument("--opportunity-id", required=True)
    triage_opp.add_argument("--reviewer", required=True)
    triage_opp.add_argument(
        "--decision",
        required=True,
        choices=["triaged", "accepted", "rejected", "parked"],
    )
#
    triage_opp.add_argument("--notes")
    triage_opp.set_defaults(func=cmd_triage_opportunity)

    convert_opp = sub.add_parser("convert-opportunity")
    convert_opp.add_argument("--opportunity-id", required=True)
    convert_opp.add_argument("--converted-by", required=True)
    convert_opp.add_argument("--target-agent", required=True)
    convert_opp.add_argument("--job-type", default="proactive")
    convert_opp.add_argument("--subject", required=True)
    convert_opp.add_argument("--priority", type=int, default=100)
    convert_opp.add_argument("--payload", default="{}")
    convert_opp.add_argument("--notes")
    convert_opp.set_defaults(func=cmd_convert_opportunity)

    renew_lease = sub.add_parser("renew-job-lease")
#
    renew_lease.add_argument("job_id")
    renew_lease.add_argument("--lease-seconds", type=int, default=600)
    renew_lease.set_defaults(func=cmd_renew_job_lease)

2) AGENTS.md proactive insert block

I’d add this block to all 10 agents, then customize the “your charter” lines below it.

## Proactive Behavior

Part of your job is to originate high-quality opportunities in your domain.
#

### Proactive rules

- You may propose ideas, risks, experiments, automation candidates, and improvements.
- You must record proactive ideas durably through `agentbus create-opportunity`.
- Do not use chat as the primary durable path for ideas.
- Every opportunity must include:
  - why now
  - evidence
  - impact score
  - confidence score
  - urgency score
  - reversibility
  - owner candidate
  - smallest useful next step
#
- Prefer fewer, stronger opportunities.
- If an idea is weak, speculative, or unsupported, do not escalate it.

### Proactive budget

- Maximum 3 new opportunities per day unless your standing orders say otherwise.
- Maximum 1 urgent proactive escalation per day unless severity is genuinely high.
- Batch low-confidence ideas for digest review instead of pushing them immediately.

### Conversion rule

- You may create opportunities directly.
- You do not convert them into proactive jobs unless your role explicitly allows it.
- Coordinator, and sometimes project-tracker, own most conversion decisions.
#
### Your charter

REPLACE_THIS_WITH_ROLE_SPECIFIC_CHARTER

Role-specific charter lines

Coordinator

- Rank, merge, reject, park, or convert opportunities.
- Prefer turning only the strongest opportunities into actual jobs.
- Keep the proactive lane aligned with system priorities.

Auditor

#
- Find hygiene issues in the opportunities lane itself.
- Detect spam, duplication, stale items, and budget abuse.
- Raise process problems, not random ideas.

Project-tracker

- Group related opportunities across specialists.
- Attach project context, milestone impact, and missing inputs.
- Hand strategic items upward and publishable items outward.

Knowledge

- Look for reusable patterns, missing memory, and knowledge gaps.
#
- Prefer opportunities that reduce repeated future work.
- Surface conflicts in knowledge as opportunities for cleanup.

DevOps

- Look for automation, reliability, observability, and toil-reduction opportunities.
- Prefer reversible improvements with measurable operational value.
- Record evidence, not vibes.

Product-metrics

- Look for growth, retention, funnel, and experiment opportunities.
- Prefer opportunities supported by real behavior changes or anomalies.
#
- Distinguish signal from noise.

Design

- Look for UX friction, clarity gaps, and design debt with visible user cost.
- Prefer opportunities that simplify, clarify, or unblock delivery.
- State tradeoffs explicitly.

Security

- Look for preventative hardening, dependency-risk reduction, and policy improvement opportunities.
- Prefer credible, evidence-backed improvements over generic caution.
- Escalate real risk plainly.
#

Research

- Look for external shifts, new tools, market changes, and strategic signals.
- Preserve uncertainty.
- Prefer actionable synthesis over endless scanning.

Publisher

- You do not originate many raw opportunities.
- Your proactive role is to package high-quality accepted ideas into clear human-facing digests.
- Flag missing evidence instead of inventing polish.

3) Recommended Postgres backup strategy

#

For this setup, I would not rely on pg_dump alone.

Recommended

  • Primary: pgBackRest or WAL-G with WAL archiving to off-host storage
  • Secondary: nightly pg_dump -Fc logical backup as belt-and-suspenders
  • Retention:
    • 7 daily
    • 4 weekly
    • 3 monthly
  • Restore drill: test a restore at least monthly

Why

This system isn’t just content, it’s operational state:

  • jobs
  • attempts
  • events
  • standing order runs
#
  • opportunities

Losing 8 hours of that state is painful. So I’d want point-in-time recovery, not just “last night’s dump”.

My actual recommendation

  • If you want serious durability: pgBackRest
  • If you want simple but still good: nightly pg_dump -Fc + off-host copy, but accept more data-loss risk

4) System-wide health visibility

Best answer: one summary view + one dashboard + one auditor digest.

I’d build:

  • queue_health_summary_v
  • standing_order_health_summary_v
#
  • opportunity_health_summary_v
  • system_health_snapshot_v

Put on the dashboard:

  • queued / running / blocked / dead counts
  • oldest queued age
  • oldest blocked age
  • expired lease count
  • jobs retried > N times
  • standing-order gaps in last 24h
  • opportunities new / accepted / parked
  • last dispatcher heartbeat
  • last successful cron sweep by auditor

Tooling

  • Fastest: Metabase on top of Postgres views
  • Ops-grade: Prometheus + Grafana + Postgres exporter + dispatcher metrics
#

If you want the least friction, I’d start with Metabase or Grafana backed by Postgres views.

One extra thing I’d add

Have the dispatcher write a tiny heartbeat row every 30-60 seconds, or expose a metrics endpoint. That makes it obvious whether the control plane is alive without hand-querying tables.


5) If the dispatcher dies mid-run, are leases enough?

Leases are enough for recovery, not enough for exactly-once.

What happens

Case 1, dispatcher dies, agent keeps running

If the agent can still call agentbus complete-job / block-job / fail-job, the job can finish normally.

Case 2, dispatcher dies, agent also hangs or disappears

#

The job stays leased / running until lease expiry, then the sweeper requeues it.

So yes, leases are the core safety net.

But here’s the catch

If the job has non-idempotent side effects, a retry can duplicate them.

That’s why the real pattern is:

  • lease
  • retry
  • dedupe keys
  • idempotent handlers
  • attempt records
  • optional lease renewal for long jobs

My practical rule

#
  • Short jobs: leases are enough
  • Long jobs: add renew-job-lease
  • Side-effectful jobs: require idempotency keys / dedupe

So the real answer is:

Leases are enough for at-least-once execution with crash recovery.
They are not enough for exactly-once guarantees.

That’s the right mental model.

If you want, I can generate the next two missing pieces:

  1. the summary-view SQL for system health, and
  2. a dispatcher heartbeat table + writer loop so the dashboard can show control-plane liveness.
nimble phoenix
#

Yes — summary view SQL and dispatcher heartbeat table + writer loop.

Also: I want a separate ideas table alongside opportunities. Lower bar — no scoring required, agents dump anything worth noting here. Other agents read it during proactive scans for cross-pollination. If an idea doesn't clear the opportunity bar, it goes to ideas instead of being discarded. Can you add the DDL and agentbus commands for it?

warm elmBOT
#

Yep. This is a good addition.

I like the split:

  • ideas = low-friction raw signal
  • opportunities = scored, decision-ready proposals

That gives you cross-pollination without forcing every half-baked thought to clear the higher bar.

1) Dispatcher heartbeat table + summary view SQL

DDL

begin;
#
create table dispatcher_heartbeats (
  namespace text not null default 'default',
  dispatcher_id text not null,
  host text,
  pid int,
  last_seen timestamptz not null default now(),
  leases_owned int not null default 0,
  busy_agents int not null default 0,
  metrics jsonb not null default '{}'::jsonb,
  primary key (namespace, dispatcher_id)
);

create index dispatcher_heartbeats_last_seen_idx
  on dispatcher_heartbeats (namespace, last_seen desc);
#
create or replace view queue_health_summary_v as
select
  namespace,
  count(*) filter (where status = 'queued') as queued_count,
  count(*) filter (where status = 'leased') as leased_count,
  count(*) filter (where status = 'running') as running_count,
  count(*) filter (where status = 'blocked') as blocked_count,
  count(*) filter (where status = 'succeeded') as succeeded_count,
  count(*) filter (where status = 'failed') as failed_count,
  count(*) filter (where status = 'dead') as dead_count,
  count(*) filter (
    where status in ('leased', 'running')
      and leased_until is not null
      and leased_until < now()
  ) as expired_lease_count,
#
  count(*) filter (where attempt_count >= 3 and status not in ('succeeded')) as retry_hotspot_count,
  extract(epoch from (now() - min(created_at))) filter (where status = 'queued') as oldest_queued_age_seconds,
  extract(epoch from (now() - min(updated_at))) filter (where status = 'blocked') as oldest_blocked_age_seconds
from jobs
group by namespace;

create or replace view standing_order_health_summary_v as
select
  namespace,
  count(*) filter (where created_at >= now() - interval '24 hours') as runs_last_24h,
  count(*) filter (where status = 'completed' and created_at >= now() - interval '24 hours') as completed_last_24h,
  count(*) filter (where status = 'no_work' and created_at >= now() - interval '24 hours') as no_work_last_24h,
  count(*) filter (where status = 'blocked' and created_at >= now() - interval '24 hours') as blocked_last_24h,
  count(*) filter (where status = 'failed' and created_at >= now() - interval '24 hours') as failed_last_24h,
  count(*) filter (where status = 'missed' and created_at >= now() - interval '24 hours') as missed_last_24h
#
from standing_order_runs
group by namespace;

create or replace view opportunity_health_summary_v as
select
  namespace,
  count(*) filter (where status = 'new') as new_count,
  count(*) filter (where status = 'triaged') as triaged_count,
  count(*) filter (where status = 'accepted') as accepted_count,
  count(*) filter (where status = 'rejected') as rejected_count,
  count(*) filter (where status = 'parked') as parked_count,
  count(*) filter (where status = 'converted_to_job') as converted_count,
  extract(epoch from (now() - min(created_at))) filter (where status = 'new') as oldest_new_age_seconds
from opportunities
group by namespace;
#

create or replace view dispatcher_health_summary_v as
select
  namespace,
  count(*) as dispatcher_count,
  count(*) filter (where last_seen >= now() - interval '90 seconds') as healthy_dispatcher_count,
  count(*) filter (where last_seen < now() - interval '90 seconds') as stale_dispatcher_count,
  max(last_seen) as last_dispatcher_seen_at,
  extract(epoch from (now() - max(last_seen))) as seconds_since_last_dispatcher_seen,
  sum(leases_owned) as leases_owned_total,
  sum(busy_agents) as busy_agents_total
from dispatcher_heartbeats
group by namespace;

create or replace view system_health_snapshot_v as
#
select
  coalesce(q.namespace, s.namespace, o.namespace, d.namespace) as namespace,

  coalesce(q.queued_count, 0) as queued_count,
  coalesce(q.leased_count, 0) as leased_count,
  coalesce(q.running_count, 0) as running_count,
  coalesce(q.blocked_count, 0) as blocked_count,
  coalesce(q.failed_count, 0) as failed_count,
  coalesce(q.dead_count, 0) as dead_count,
  coalesce(q.expired_lease_count, 0) as expired_lease_count,
  coalesce(q.retry_hotspot_count, 0) as retry_hotspot_count,
  q.oldest_queued_age_seconds,
  q.oldest_blocked_age_seconds,

  coalesce(s.runs_last_24h, 0) as standing_runs_last_24h,
#
  coalesce(s.completed_last_24h, 0) as standing_completed_last_24h,
  coalesce(s.no_work_last_24h, 0) as standing_no_work_last_24h,
  coalesce(s.blocked_last_24h, 0) as standing_blocked_last_24h,
  coalesce(s.failed_last_24h, 0) as standing_failed_last_24h,
  coalesce(s.missed_last_24h, 0) as standing_missed_last_24h,

  coalesce(o.new_count, 0) as opp_new_count,
  coalesce(o.triaged_count, 0) as opp_triaged_count,
  coalesce(o.accepted_count, 0) as opp_accepted_count,
  coalesce(o.parked_count, 0) as opp_parked_count,
  coalesce(o.converted_count, 0) as opp_converted_count,
  o.oldest_new_age_seconds as opp_oldest_new_age_seconds,

  coalesce(d.dispatcher_count, 0) as dispatcher_count,
  coalesce(d.healthy_dispatcher_count, 0) as healthy_dispatcher_count,
#
  coalesce(d.stale_dispatcher_count, 0) as stale_dispatcher_count,
  d.last_dispatcher_seen_at,
  d.seconds_since_last_dispatcher_seen,
  coalesce(d.leases_owned_total, 0) as leases_owned_total,
  coalesce(d.busy_agents_total, 0) as busy_agents_total

from queue_health_summary_v q
full outer join standing_order_health_summary_v s using (namespace)
full outer join opportunity_health_summary_v o using (namespace)
full outer join dispatcher_health_summary_v d using (namespace);

commit;

One important caveat

This gives you health summary, but not true standing-order gap truth by itself.

#

True “missed run” detection still lives in the auditor logic, because schedule expectations live outside the DB tables.


2) Dispatcher heartbeat writer loop

Add these functions to dispatcher_v2.py.

async def collect_dispatcher_metrics(pool: AsyncConnectionPool) -> dict[str, Any]:
    async with pool.connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(
                """
                select
                  count(*) filter (where status = 'queued') as queued_count,
#
                  count(*) filter (where status = 'leased') as leased_count,
                  count(*) filter (where status = 'running') as running_count,
                  count(*) filter (where status = 'blocked') as blocked_count,
                  count(*) filter (where status = 'dead') as dead_count,
                  count(*) filter (
                    where status in ('leased', 'running')
                      and lease_owner = %s
                  ) as leases_owned
                from jobs
                where namespace = %s
                """,
                (DISPATCHER_ID, DISPATCHER_NAMESPACE),
            )
            row = await cur.fetchone()
            return dict(row or {})
#


async def write_dispatcher_heartbeat_once(pool: AsyncConnectionPool) -> None:
    metrics = await collect_dispatcher_metrics(pool)

    async with pool.connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                """
                insert into dispatcher_heartbeats (
                  namespace,
                  dispatcher_id,
                  host,
                  pid,
                  last_seen,
#
                  leases_owned,
                  busy_agents,
                  metrics
                )
                values (
                  %s, %s, %s, %s, now(), %s, %s, %s
                )
                on conflict (namespace, dispatcher_id)
                do update set
                  host = excluded.host,
                  pid = excluded.pid,
                  last_seen = excluded.last_seen,
                  leases_owned = excluded.leases_owned,
                  busy_agents = excluded.busy_agents,
                  metrics = excluded.metrics
#
                """,
                (
                    DISPATCHER_NAMESPACE,
                    DISPATCHER_ID,
                    socket.gethostname(),
                    os.getpid(),
                    int(metrics.get("leases_owned", 0)),
                    sum(1 for _, count in busy.items() if count > 0),
                    Jsonb(metrics),
                ),
            )
            await conn.commit()


async def heartbeat_loop(pool: AsyncConnectionPool) -> None:
#
    while True:
        try:
            await write_dispatcher_heartbeat_once(pool)
        except Exception as exc:
            print(f"[heartbeat] error: {exc}", file=sys.stderr)
        await asyncio.sleep(30)

And add it to main():

    heartbeat = asyncio.create_task(heartbeat_loop(pool), name="heartbeat")
    await asyncio.gather(listener, sweeper, drainer, heartbeat)

That’s enough for dashboards and stale-dispatcher detection.

#

3) Ideas DDL

I’d keep ideas intentionally lightweight.

begin;

create type idea_status as enum (
  'new',
  'reviewed',
  'promoted_to_opportunity',
  'archived'
);
#
create table ideas (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',

  source_agent text not null,
  theme text not null,
  title text not null,
  body text not null,

  evidence jsonb not null default '{}'::jsonb,
  tags text[] not null default '{}',

  status idea_status not null default 'new',
  promoted_opportunity_id uuid references opportunities(id) on delete set null,
#
  dedupe_key text,

  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

create trigger ideas_set_updated_at
before update on ideas
for each row
execute function set_updated_at();

create index ideas_status_idx
  on ideas (namespace, status, created_at desc);

create index ideas_source_idx
#
  on ideas (namespace, source_agent, created_at desc);

create index ideas_theme_idx
  on ideas (namespace, theme, created_at desc);

create unique index ideas_active_dedupe_idx
  on ideas (namespace, source_agent, dedupe_key)
  where dedupe_key is not null
    and status in ('new', 'reviewed');

create or replace function notify_ideas_ready()
returns trigger
language plpgsql
as $$
begin
#
  if new.status = 'new' then
    perform pg_notify(
      'ideas_ready',
      json_build_object(
        'namespace', new.namespace,
        'source_agent', new.source_agent,
        'idea_id', new.id
      )::text
    );
  end if;
  return new;
end;
$$;

create trigger ideas_notify_insert
#
after insert on ideas
for each row
execute function notify_ideas_ready();

create trigger ideas_notify_update
after update of status on ideas
for each row
when (
  new.status = 'new'
  and old.status is distinct from new.status
)
execute function notify_ideas_ready();

create or replace view idea_health_summary_v as
select
#
  namespace,
  count(*) filter (where status = 'new') as new_count,
  count(*) filter (where status = 'reviewed') as reviewed_count,
  count(*) filter (where status = 'promoted_to_opportunity') as promoted_count,
  count(*) filter (where status = 'archived') as archived_count,
  extract(epoch from (now() - min(created_at))) filter (where status = 'new') as oldest_new_age_seconds
from ideas
group by namespace;

commit;

If you want the system snapshot to include ideas, update the final view like this:

#
create or replace view system_health_snapshot_v as
select
  coalesce(q.namespace, s.namespace, o.namespace, i.namespace, d.namespace) as namespace,

  coalesce(q.queued_count, 0) as queued_count,
  coalesce(q.leased_count, 0) as leased_count,
  coalesce(q.running_count, 0) as running_count,
  coalesce(q.blocked_count, 0) as blocked_count,
  coalesce(q.failed_count, 0) as failed_count,
  coalesce(q.dead_count, 0) as dead_count,
  coalesce(q.expired_lease_count, 0) as expired_lease_count,
  coalesce(q.retry_hotspot_count, 0) as retry_hotspot_count,
  q.oldest_queued_age_seconds,
  q.oldest_blocked_age_seconds,
#
  coalesce(s.runs_last_24h, 0) as standing_runs_last_24h,
  coalesce(s.completed_last_24h, 0) as standing_completed_last_24h,
  coalesce(s.no_work_last_24h, 0) as standing_no_work_last_24h,
  coalesce(s.blocked_last_24h, 0) as standing_blocked_last_24h,
  coalesce(s.failed_last_24h, 0) as standing_failed_last_24h,
  coalesce(s.missed_last_24h, 0) as standing_missed_last_24h,

  coalesce(o.new_count, 0) as opp_new_count,
  coalesce(o.triaged_count, 0) as opp_triaged_count,
  coalesce(o.accepted_count, 0) as opp_accepted_count,
  coalesce(o.parked_count, 0) as opp_parked_count,
  coalesce(o.converted_count, 0) as opp_converted_count,
  o.oldest_new_age_seconds as opp_oldest_new_age_seconds,

  coalesce(i.new_count, 0) as idea_new_count,
#
  coalesce(i.reviewed_count, 0) as idea_reviewed_count,
  coalesce(i.promoted_count, 0) as idea_promoted_count,
  coalesce(i.archived_count, 0) as idea_archived_count,
  i.oldest_new_age_seconds as idea_oldest_new_age_seconds,

  coalesce(d.dispatcher_count, 0) as dispatcher_count,
  coalesce(d.healthy_dispatcher_count, 0) as healthy_dispatcher_count,
  coalesce(d.stale_dispatcher_count, 0) as stale_dispatcher_count,
  d.last_dispatcher_seen_at,
  d.seconds_since_last_dispatcher_seen,
  coalesce(d.leases_owned_total, 0) as leases_owned_total,
  coalesce(d.busy_agents_total, 0) as busy_agents_total

from queue_health_summary_v q
full outer join standing_order_health_summary_v s using (namespace)
#
full outer join opportunity_health_summary_v o using (namespace)
full outer join idea_health_summary_v i using (namespace)
full outer join dispatcher_health_summary_v d using (namespace);

4) agentbus commands for ideas

I’d add these three.

create-idea

agentbus create-idea \
#
  --source-agent design \
  --theme onboarding \
  --title "Setup could feel collaborative instead of procedural" \
  --body "Maybe a guided paired-setup pattern would reduce friction. Not enough evidence for an opportunity yet." \
  --tags '["ux","activation"]' \
  --evidence '{}' \
  --dedupe-key "onboarding:paired-setup-v1"

review-idea

Use this when an agent looked at it during a proactive scan.

agentbus review-idea \
#
  --idea-id 11111111-1111-1111-1111-111111111111

promote-idea-to-opportunity

When an idea matures and now clears the bar.

agentbus promote-idea-to-opportunity \
  --idea-id 11111111-1111-1111-1111-111111111111 \
  --source-agent coordinator \
  --theme onboarding \
  --title "Test paired setup flow for activation lift" \
  --why-now "Multiple idea reads plus recent funnel weakness justify a scored opportunity." \
  --evidence '{"origin":"idea-promotion"}' \
#
  --impact-score 4 \
  --confidence-score 3 \
  --urgency-score 3 \
  --reversibility high \
  --owner-candidate product-metrics \
  --suggested-next-step "Create experiment design + metrics validation jobs." \
  --notes "Promoted from idea after cross-pollination review."

5) Why I like ideas here

Because it lets agents do:

  • “this might matter”
#
  • without pretending
  • “this is strong enough to ask the human about right now”

That’s a really useful distinction.

If you want, I can do one more pass and give you the actual agentbus.py code additions for ideas, not just the command shapes.

nimble phoenix
#

Yes — agentbus code additions for ideas. Also separate question: what's the best memory architecture for long-term sustained agent memory in this setup? I want agents that actually get smarter over time — remembering decisions, patterns, mistakes, preferences, and context across weeks and months. Not just session notes. What's the right pattern given the Postgres-first architecture we've built?

warm elmBOT
#

Yep. I’d do both.

1) agentbus code additions for ideas

Add these handlers to agentbus.py.

def cmd_create_idea(args: argparse.Namespace) -> None:
    evidence = parse_json_input(args.evidence)
    tags = parse_json_input(args.tags)

    sql = """
    insert into ideas (
      namespace,
      source_agent,
      theme,
#
      title,
      body,
      evidence,
      tags,
      dedupe_key
    )
    values (
      %(namespace)s,
      %(source_agent)s,
      %(theme)s,
      %(title)s,
      %(body)s,
      %(evidence)s,
      %(tags)s,
      %(dedupe_key)s
#
    )
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "namespace": args.namespace,
                "source_agent": args.source_agent,
                "theme": args.theme,
                "title": args.title,
                "body": args.body,
                "evidence": Jsonb(evidence),
                "tags": tags,
#
                "dedupe_key": args.dedupe_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_review_idea(args: argparse.Namespace) -> None:
    sql = """
    update ideas
    set status = 'reviewed',
        updated_at = now()
    where id = %(idea_id)s
      and status = 'new'
#
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(sql, {"idea_id": args.idea_id})
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_promote_idea_to_opportunity(args: argparse.Namespace) -> None:
    evidence = parse_json_input(args.evidence)

    with connect() as conn, conn.cursor() as cur:
        cur.execute("select * from ideas where id = %s for update", (args.idea_id,))
#
        idea = cur.fetchone()
        if not idea:
            die(f"Idea not found: {args.idea_id}")

        cur.execute(
            """
            insert into opportunities (
              namespace,
              source_agent,
              theme,
              title,
              why_now,
              evidence,
              impact_score,
              confidence_score,
#
              urgency_score,
              reversibility,
              owner_candidate,
              suggested_next_step,
              triaged_by,
              triage_notes,
              triaged_at,
              status,
              dedupe_key
            )
            values (
              %(namespace)s,
              %(source_agent)s,
              %(theme)s,
              %(title)s,
#
              %(why_now)s,
              %(evidence)s,
              %(impact_score)s,
              %(confidence_score)s,
              %(urgency_score)s,
              %(reversibility)s,
              %(owner_candidate)s,
              %(suggested_next_step)s,
              %(triaged_by)s,
              %(triage_notes)s,
              now(),
              'triaged',
              %(dedupe_key)s
            )
            returning *;
#
            """,
            {
                "namespace": idea["namespace"],
                "source_agent": args.promoted_by,
                "theme": args.theme or idea["theme"],
                "title": args.title or idea["title"],
                "why_now": args.why_now,
                "evidence": Jsonb(
                    {
                        "idea_origin_id": str(idea["id"]),
                        "idea_body": idea["body"],
                        "idea_evidence": idea["evidence"],
                        "promotion_evidence": evidence,
                    }
                ),
#
                "impact_score": args.impact_score,
                "confidence_score": args.confidence_score,
                "urgency_score": args.urgency_score,
                "reversibility": args.reversibility,
                "owner_candidate": args.owner_candidate,
                "suggested_next_step": args.suggested_next_step,
                "triaged_by": args.promoted_by,
                "triage_notes": args.notes or "Promoted from idea",
                "dedupe_key": args.dedupe_key,
            },
        )
        opportunity = cur.fetchone()

        cur.execute(
            """
#
            update ideas
            set status = 'promoted_to_opportunity',
                promoted_opportunity_id = %s,
                updated_at = now()
            where id = %s
            returning *;
            """,
            (opportunity["id"], args.idea_id),
        )
        idea_after = cur.fetchone()

        conn.commit()
        print_json({"idea": idea_after, "opportunity": opportunity})

Add these parser stanzas:

#
    create_idea = sub.add_parser("create-idea")
    create_idea.add_argument("--namespace", default="default")
    create_idea.add_argument("--source-agent", required=True)
    create_idea.add_argument("--theme", required=True)
    create_idea.add_argument("--title", required=True)
    create_idea.add_argument("--body", required=True)
    create_idea.add_argument("--evidence", default="{}")
    create_idea.add_argument("--tags", default="[]")
    create_idea.add_argument("--dedupe-key")
    create_idea.set_defaults(func=cmd_create_idea)

    review_idea = sub.add_parser("review-idea")
    review_idea.add_argument("--idea-id", required=True)
    review_idea.set_defaults(func=cmd_review_idea)
#
    promote_idea = sub.add_parser("promote-idea-to-opportunity")
    promote_idea.add_argument("--idea-id", required=True)
    promote_idea.add_argument("--promoted-by", required=True)
    promote_idea.add_argument("--theme")
    promote_idea.add_argument("--title")
    promote_idea.add_argument("--why-now", required=True)
    promote_idea.add_argument("--evidence", default="{}")
    promote_idea.add_argument("--impact-score", type=int, required=True)
    promote_idea.add_argument("--confidence-score", type=int, required=True)
    promote_idea.add_argument("--urgency-score", type=int, required=True)
    promote_idea.add_argument("--reversibility", required=True, choices=["high", "medium", "low"])
    promote_idea.add_argument("--owner-candidate")
    promote_idea.add_argument("--suggested-next-step", required=True)
    promote_idea.add_argument("--notes")
    promote_idea.add_argument("--dedupe-key")
#
    promote_idea.set_defaults(func=cmd_promote_idea_to_opportunity)

2) Best memory architecture for long-term agent memory

This is the important part.

If you want agents to get smarter over weeks and months, I would not use session notes as primary memory.

I’d use a 5-layer memory stack.

Layer 1, immutable evidence

This is your source of truth:

  • jobs
#
  • job attempts
  • events
  • standing order runs
  • opportunities
  • ideas
  • transcripts

These are the raw receipts.

Layer 2, episodic memory

Short summaries of what happened:

  • “devops tried X, it failed because Y”
  • “security blocked deployment due to Z”
  • “project tracker noticed design + product were misaligned”

This is not yet durable truth, it’s compressed history.

#

Layer 3, semantic memory

Only promoted memories live here:

  • decisions
  • preferences
  • patterns
  • mistakes / lessons
  • playbooks
  • stable context
  • known constraints

This is the layer that should make agents smarter over time.

Layer 4, current-state projections

Views of what is active now:

  • current preferences
  • active decisions
  • current known blockers
#
  • accepted playbooks
  • open lessons not yet resolved

This is what the agent should usually read first.

Layer 5, retrieval layer

For each new job, retrieve a small, relevant memory set:

  • top decisions for this project/theme
  • recent lessons for this role
  • active preferences for the human/system
  • similar past failures
  • relevant playbooks

That gives you memory without prompt bloat.


#

3) The pattern I’d use in Postgres

I’d keep structured memory in Postgres as the primary long-term memory system, and treat transcript recall/search as secondary.

Core tables I’d add

  • memory_candidates
  • memories
  • memory_links
  • maybe memory_reviews

memories should hold

  • id
  • namespace
  • kind
    • decision
    • preference
    • pattern
#
  • mistake
    • lesson
    • playbook
    • constraint
    • entity_fact
  • scope
    • global
    • agent
    • project
    • person
    • service
  • subject_key
  • body
  • confidence
  • stability
  • status
  • valid_from
#
  • expires_at
  • source_refs (jobs/events/transcripts/opportunities)
  • created_at
  • updated_at

Why this shape

Because a “memory” is not just text. It needs:

  • kind
  • scope
  • provenance
  • confidence
  • freshness

Otherwise agents just accumulate sludge.


#

4) The learning loop

This is the loop that actually makes them smarter:

Step 1

Agents do work and write durable evidence.

Step 2

A memory curator, usually knowledge, extracts memory candidates from:

  • repeated failures
  • repeated successes
  • decisions with rationale
  • recurring preferences
  • stable patterns
  • confirmed constraints

Step 3

#

Candidates are reviewed/promoted into memories.

Step 4

Future jobs retrieve relevant memories before execution.

Step 5

Old memories are reviewed, refreshed, or retired.

That’s the difference between:

  • “more notes”
    and
  • “actual learning”

5) What each agent should remember

#

Coordinator

  • routing patterns
  • escalation patterns
  • priority mistakes
  • what types of work tend to block

Auditor

  • recurring failure modes
  • cron gap patterns
  • lease-loss patterns
  • noise / spam sources

Project-tracker

  • milestone patterns
  • common debrief gaps
  • ownership failure patterns
#

Knowledge

  • reusable facts
  • taxonomy
  • retrieval structure
  • consolidation lessons

DevOps

  • runbooks
  • infra failure patterns
  • stable recovery sequences
  • environment quirks

Product-metrics

  • experiment outcomes
  • metric caveats
  • common misreads
  • stable KPI definitions
#

Design

  • recurring UX friction
  • accepted design principles
  • user confusion patterns

Security

  • recurring risk classes
  • mitigation patterns
  • approved exceptions
  • dependency risk history

Research

  • source quality judgments
  • recurring external themes
  • market/tool trend patterns

Publisher

#
  • audience preferences
  • preferred report shapes
  • recurring missing-input failure modes

6) Retrieval rule I’d actually use

For any new job, retrieve at most:

  • 2-3 active decisions
  • 2-3 relevant lessons / mistakes
  • 1-2 active preferences
  • 1 relevant playbook
  • a few recent same-theme events

That’s enough.

#

Do not dump the full memory store into prompts.


7) My strongest opinion

If you want sustained memory, promote memories only from evidence.

Not from:

  • guesses
  • vibes
  • one-off thoughts
  • raw ideas

Good promotion sources

  • repeated outcomes
  • explicit decisions
#
  • human feedback
  • accepted opportunities
  • confirmed patterns
  • repeated errors

Bad promotion sources

  • speculative ideas
  • unreviewed opinions
  • single noisy incidents

So:

  • ideas are not memory
  • opportunities are not memory
  • accepted decisions, lessons, and stable patterns become memory

That’s the clean line.

#

8) Given your Postgres-first architecture, my recommendation is

  • Postgres = authoritative long-term structured memory
  • OpenClaw transcripts/search = recall helper
  • Knowledge agent = memory curator
  • Coordinator/project-tracker = memory consumers
  • Auditor = memory hygiene checker

That’s the pattern I’d trust over months.

If you want, I can generate the next concrete piece too:

  1. DDL for memory_candidates + memories, and
  2. the memory extraction/promote cron jobs for the 10-agent setup.
nimble phoenix
#

Yes — DDL for memory_candidates and memories, and the memory extraction/promote cron jobs for all 10 agents.

warm elmBOT
#

Yep. This is the version I’d run.

1) DDL for memory_candidates and memories

begin;

create type memory_kind as enum (
  'decision',
  'preference',
  'pattern',
  'mistake',
  'lesson',
  'playbook',
  'constraint',
  'entity_fact',
#
  'context'
);

create type memory_scope as enum (
  'global',
  'agent',
  'project',
  'person',
  'service'
);

create type memory_stability as enum (
  'volatile',
  'medium',
  'stable'
#
);

create type memory_candidate_status as enum (
  'new',
  'reviewed',
  'promoted',
  'rejected',
  'archived'
);

create type memory_status as enum (
  'active',
  'superseded',
  'expired',
  'archived'
#
);

create table memory_candidates (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',

  source_agent text not null,
  candidate_kind memory_kind not null,
  scope memory_scope not null,
  subject_key text not null,

  title text not null,
  body text not null,

  evidence jsonb not null default '{}'::jsonb,
#
  source_refs jsonb not null default '{}'::jsonb,

  confidence_score int not null check (confidence_score between 1 and 5),
  stability memory_stability not null default 'medium',

  status memory_candidate_status not null default 'new',
  reviewer text,
  review_notes text,

  promoted_memory_id uuid,

  dedupe_key text,

  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
#
);

create trigger memory_candidates_set_updated_at
before update on memory_candidates
for each row
execute function set_updated_at();

create table memories (
  id uuid primary key default gen_random_uuid(),
  namespace text not null default 'default',

  kind memory_kind not null,
  scope memory_scope not null,
  subject_key text not null,
#
  title text not null,
  body text not null,

  source_candidate_id uuid references memory_candidates(id) on delete set null,
  source_refs jsonb not null default '{}'::jsonb,

  confidence_score int not null check (confidence_score between 1 and 5),
  stability memory_stability not null default 'medium',

  status memory_status not null default 'active',

  created_by text,
  updated_by text,

  valid_from timestamptz not null default now(),
#
  expires_at timestamptz,
  next_review_at timestamptz,

  supersedes_memory_id uuid references memories(id) on delete set null,

  dedupe_key text,

  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

create trigger memories_set_updated_at
before update on memories
for each row
execute function set_updated_at();
#

alter table memory_candidates
  add constraint memory_candidates_promoted_memory_fk
  foreign key (promoted_memory_id) references memories(id) on delete set null;

create index memory_candidates_status_idx
  on memory_candidates (namespace, status, created_at desc);

create index memory_candidates_source_idx
  on memory_candidates (namespace, source_agent, created_at desc);

create index memory_candidates_subject_idx
  on memory_candidates (namespace, candidate_kind, scope, subject_key, created_at desc);

create unique index memory_candidates_active_dedupe_idx
#
  on memory_candidates (namespace, source_agent, dedupe_key)
  where dedupe_key is not null
    and status in ('new', 'reviewed');

create index memories_status_idx
  on memories (namespace, status, created_at desc);

create index memories_subject_idx
  on memories (namespace, kind, scope, subject_key, created_at desc);

create index memories_review_idx
  on memories (namespace, next_review_at)
  where status = 'active' and next_review_at is not null;

create unique index memories_active_dedupe_idx
#
  on memories (namespace, dedupe_key)
  where dedupe_key is not null
    and status = 'active';

create or replace function notify_memory_candidates_ready()
returns trigger
language plpgsql
as $$
begin
  if new.status = 'new' then
    perform pg_notify(
      'memory_candidates_ready',
      json_build_object(
        'namespace', new.namespace,
        'source_agent', new.source_agent,
#
        'candidate_id', new.id
      )::text
    );
  end if;
  return new;
end;
$$;

create trigger memory_candidates_notify_insert
after insert on memory_candidates
for each row
execute function notify_memory_candidates_ready();

create trigger memory_candidates_notify_update
after update of status on memory_candidates
#
for each row
when (
  new.status = 'new'
  and old.status is distinct from new.status
)
execute function notify_memory_candidates_ready();

create or replace view active_memories_v as
select *
from memories
where status = 'active'
  and (expires_at is null or expires_at > now());

commit;
#

2) Memory extraction / promote cron jobs for all 10 agents

Design principle

  • Extraction is distributed, every role extracts from its own evidence
  • Promotion is centralized, mainly knowledge
  • That keeps memory quality high

Assuming UTC:

# coordinator memory extraction
openclaw cron add \
  --name "coord-memory-extract" \
  --cron "0 20 * * *" \
#
  --tz "UTC" \
  --session isolated \
  --agent coordinator \
  --message "Run coordinator memory extraction standing order: review the last 24 hours of routing decisions, reprioritizations, blocked work outcomes, and repeated coordination failures; extract durable memory candidates only for decisions, patterns, mistakes, lessons, and playbooks with real evidence." \
  --no-deliver

# auditor memory extraction
openclaw cron add \
  --name "audit-memory-extract" \
  --cron "10 20 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent auditor \
  --message "Run auditor memory extraction standing order: review the last 24 hours of lease losses, cron gaps, retry storms, dead jobs, and operational anomalies; extract durable memory candidates only for recurring failure modes, lessons, constraints, and audit-relevant patterns." \
  --no-deliver
#

# project-tracker memory extraction
openclaw cron add \
  --name "tracker-memory-extract" \
  --cron "20 20 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent project-tracker \
  --message "Run project-tracker memory extraction standing order: review the last 24 hours of milestone movement, missing debriefs, ownership failures, and successful coordination patterns; extract durable memory candidates for project patterns, lessons, and recurring blockers." \
  --no-deliver

# knowledge memory extraction
openclaw cron add \
  --name "knowledge-memory-extract" \
  --cron "30 20 * * *" \
#
  --tz "UTC" \
  --session isolated \
  --agent knowledge \
  --message "Run knowledge memory extraction standing order: review the last 24 hours of consolidations, repeated retrieval hits, durable facts, conflicts, and taxonomy changes; extract memory candidates only for stable facts, patterns, lessons, and reusable playbooks." \
  --no-deliver

# devops memory extraction
openclaw cron add \
  --name "devops-memory-extract" \
  --cron "40 20 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent devops \
  --message "Run devops memory extraction standing order: review the last 24 hours of incidents, runbook executions, environment quirks, drift findings, and recovery attempts; extract durable memory candidates for runbooks, recurring failure patterns, constraints, and lessons." \
  --no-deliver
#

# product-metrics memory extraction
openclaw cron add \
  --name "metrics-memory-extract" \
  --cron "50 20 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent product-metrics \
  --message "Run product-metrics memory extraction standing order: review the last 24 hours of KPI interpretation, experiment outcomes, metric caveats, and repeated analysis pitfalls; extract durable memory candidates for definitions, lessons, patterns, and preferences." \
  --no-deliver

# design memory extraction
openclaw cron add \
  --name "design-memory-extract" \
  --cron "0 21 * * *" \
#
  --tz "UTC" \
  --session isolated \
  --agent design \
  --message "Run design memory extraction standing order: review the last 24 hours of critiques, accepted patterns, rejected options, recurring UX friction, and design tradeoffs; extract durable memory candidates for principles, lessons, patterns, and constraints." \
  --no-deliver

# security memory extraction
openclaw cron add \
  --name "security-memory-extract" \
  --cron "10 21 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent security \
  --message "Run security memory extraction standing order: review the last 24 hours of findings, accepted risks, mitigations, dependency patterns, and security blockers; extract durable memory candidates for risk classes, lessons, constraints, and approved playbooks." \
  --no-deliver
#

# research memory extraction
openclaw cron add \
  --name "research-memory-extract" \
  --cron "20 21 * * *" \
  --tz "UTC" \
  --session isolated \
  --agent research \
  --message "Run research memory extraction standing order: review the last 24 hours of source quality judgments, recurring external themes, tool shifts, and repeated research signals; extract durable memory candidates for patterns, lessons, and source preferences." \
  --no-deliver

# publisher memory extraction
openclaw cron add \
  --name "publisher-memory-extract" \
  --cron "30 21 * * *" \
#
  --tz "UTC" \
  --session isolated \
  --agent publisher \
  --message "Run publisher memory extraction standing order: review the last 24 hours of digest/report production, audience preferences, repeated missing-input failures, and formatting wins; extract durable memory candidates for preferences, patterns, and lessons." \
  --no-deliver

# knowledge memory promotion sweep
openclaw cron add \
  --name "knowledge-memory-promote" \
  --cron "45 */6 * * *" \
  --tz "UTC" \
  --session session:knowledge-memory-curation \
  --agent knowledge \
  --message "Run knowledge memory promotion standing order: review new memory candidates across all roles, reject weak or speculative ones, promote only evidence-backed stable candidates into active memories, set review windows and expirations, and link promoted memories back to their sources." \
  --no-deliver
#

# knowledge memory review / expiry sweep
openclaw cron add \
  --name "knowledge-memory-review" \
  --cron "0 22 * * *" \
  --tz "UTC" \
  --session session:knowledge-memory-curation \
  --agent knowledge \
  --message "Run knowledge memory review standing order: review active memories due for refresh, expire or supersede stale memories, preserve still-valid ones, and keep the active memory set small, relevant, and trustworthy." \
  --no-deliver

3) How I’d think about this memory lane

#
  • ideas = raw weak signal
  • opportunities = scored proposal
  • memory_candidates = candidate long-term learning
  • memories = promoted durable learning

That separation is really good. It keeps the system from confusing:

  • speculation
  • proposals
  • and long-term truth

4) One opinionated tweak

I would make knowledge-memory-promote the only routine allowed to promote into memories by default.

Everyone else can:

  • create evidence
  • create ideas
#
  • create opportunities
  • create memory candidates

But promotion into actual long-term memory should be rare and curated.

That’s how the system gets smarter instead of just noisier.

If you want, I can do the next concrete piece too:

  1. agentbus code additions for create-memory-candidate, promote-memory-candidate, expire-memory, and
  2. the retrieval query pattern I’d use to inject the right memories into each new job.
nimble phoenix
#

Yes — agentbus code additions for memory candidates and the retrieval query pattern for job injection. Also: these agents run 24/7 on a VPS. I don't want them idle. When there's no assigned work in the queue, I want them thinking — exploring their domain, forming opinions, connecting dots, dreaming up new angles. Not busywork, not generating noise. Actual thinking that compounds over time and surfaces something worth reading. What's the right pattern for genuine idle-time cognition? How do you give an agent unstructured time that produces real value without drowning me in output?

Also, Two-part question on memory architecture: First, should we split shared memory from solo/personal agent memory as a first-class design decision, and what's the right scope model for each? Second, is there value in a markdown layer alongside Postgres — something human-browsable like Obsidian — where agents write their working knowledge in a format I can actually read and edit? Or does that create sync problems that aren't worth it?

warm elmBOT
#

Yep. This is the right next step.

And I really like the way you’re thinking about it now:

  • ideas = weak signal
  • opportunities = scored proposal
  • memory_candidates = candidate long-term learning
  • memories = promoted durable learning

That’s a clean ladder.

1) agentbus code additions for memory candidates

Add these handlers to agentbus.py.

#
def cmd_create_memory_candidate(args: argparse.Namespace) -> None:
    evidence = parse_json_input(args.evidence)
    source_refs = parse_json_input(args.source_refs)

    sql = """
    insert into memory_candidates (
      namespace,
      source_agent,
      candidate_kind,
      scope,
      subject_key,
      title,
      body,
      evidence,
      source_refs,
#
      confidence_score,
      stability,
      dedupe_key
    )
    values (
      %(namespace)s,
      %(source_agent)s,
      %(candidate_kind)s,
      %(scope)s,
      %(subject_key)s,
      %(title)s,
      %(body)s,
      %(evidence)s,
      %(source_refs)s,
      %(confidence_score)s,
#
      %(stability)s,
      %(dedupe_key)s
    )
    returning *;
    """

    with connect() as conn, conn.cursor() as cur:
        cur.execute(
            sql,
            {
                "namespace": args.namespace,
                "source_agent": args.source_agent,
                "candidate_kind": args.candidate_kind,
                "scope": args.scope,
                "subject_key": args.subject_key,
#
                "title": args.title,
                "body": args.body,
                "evidence": Jsonb(evidence),
                "source_refs": Jsonb(source_refs),
                "confidence_score": args.confidence_score,
                "stability": args.stability,
                "dedupe_key": args.dedupe_key,
            },
        )
        row = cur.fetchone()
        conn.commit()
        print_json(row)


def cmd_promote_memory_candidate(args: argparse.Namespace) -> None:
#
    with connect() as conn, conn.cursor() as cur:
        cur.execute("select * from memory_candidates where id = %s for update", (args.candidate_id,))
        cand = cur.fetchone()
        if not cand:
            die(f"Memory candidate not found: {args.candidate_id}")

        cur.execute(
            """
            insert into memories (
              namespace,
              kind,
              scope,
              subject_key,
              title,
              body,
#
              source_candidate_id,
              source_refs,
              confidence_score,
              stability,
              status,
              created_by,
              updated_by,
              valid_from,
              expires_at,
              next_review_at,
              supersedes_memory_id,
              dedupe_key
            )
            values (
              %(namespace)s,
#
              %(kind)s,
              %(scope)s,
              %(subject_key)s,
              %(title)s,
              %(body)s,
              %(source_candidate_id)s,
              %(source_refs)s,
              %(confidence_score)s,
              %(stability)s,
              'active',
              %(created_by)s,
              %(updated_by)s,
              now(),
              %(expires_at)s,
              %(next_review_at)s,
#
              %(supersedes_memory_id)s,
              %(dedupe_key)s
            )
            returning *;
            """,
            {
                "namespace": cand["namespace"],
                "kind": args.kind or cand["candidate_kind"],
                "scope": args.scope or cand["scope"],
                "subject_key": args.subject_key or cand["subject_key"],
                "title": args.title or cand["title"],
                "body": args.body or cand["body"],
                "source_candidate_id": cand["id"],
                "source_refs": Jsonb(cand["source_refs"]),
                "confidence_score": args.confidence_score or cand["confidence_score"],
#
                "stability": args.stability or cand["stability"],
                "created_by": args.promoted_by,
                "updated_by": args.promoted_by,
                "expires_at": parse_dt(args.expires_at),
                "next_review_at": parse_dt(args.next_review_at),
                "supersedes_memory_id": args.supersedes_memory_id,
                "dedupe_key": args.dedupe_key,
            },
        )
        memory = cur.fetchone()

        cur.execute(
            """
            update memory_candidates
            set status = 'promoted',
#
                reviewer = %s,
                review_notes = %s,
                promoted_memory_id = %s,
                updated_at = now()
            where id = %s
            returning *;
            """,
            (
                args.promoted_by,
                args.notes or f"Promoted to memory {memory['id']}",
                memory["id"],
                args.candidate_id,
            ),
        )
        candidate_after = cur.fetchone()
#

        conn.commit()
        print_json({"candidate": candidate_after, "memory": memory})

Add these parser stanzas:

    create_mem_cand = sub.add_parser("create-memory-candidate")
    create_mem_cand.add_argument("--namespace", default="default")
    create_mem_cand.add_argument("--source-agent", required=True)
    create_mem_cand.add_argument("--candidate-kind", required=True, choices=[
        "decision", "preference", "pattern", "mistake", "lesson", "playbook", "constraint", "entity_fact", "context"
    ])
    create_mem_cand.add_argument("--scope", required=True, choices=[
#
        "global", "agent", "project", "person", "service"
    ])
    create_mem_cand.add_argument("--subject-key", required=True)
    create_mem_cand.add_argument("--title", required=True)
    create_mem_cand.add_argument("--body", required=True)
    create_mem_cand.add_argument("--evidence", default="{}")
    create_mem_cand.add_argument("--source-refs", default="{}")
    create_mem_cand.add_argument("--confidence-score", type=int, required=True)
    create_mem_cand.add_argument("--stability", required=True, choices=["volatile", "medium", "stable"])
    create_mem_cand.add_argument("--dedupe-key")
    create_mem_cand.set_defaults(func=cmd_create_memory_candidate)

    promote_mem_cand = sub.add_parser("promote-memory-candidate")
    promote_mem_cand.add_argument("--candidate-id", required=True)
    promote_mem_cand.add_argument("--promoted-by", required=True)
#
    promote_mem_cand.add_argument("--kind", choices=[
        "decision", "preference", "pattern", "mistake", "lesson", "playbook", "constraint", "entity_fact", "context"
    ])
    promote_mem_cand.add_argument("--scope", choices=[
        "global", "agent", "project", "person", "service"
    ])
    promote_mem_cand.add_argument("--subject-key")
    promote_mem_cand.add_argument("--title")
    promote_mem_cand.add_argument("--body")
    promote_mem_cand.add_argument("--confidence-score", type=int)
    promote_mem_cand.add_argument("--stability", choices=["volatile", "medium", "stable"])
    promote_mem_cand.add_argument("--expires-at")
    promote_mem_cand.add_argument("--next-review-at")
    promote_mem_cand.add_argument("--supersedes-memory-id")
    promote_mem_cand.add_argument("--dedupe-key")
#
    promote_mem_cand.add_argument("--notes")
    promote_mem_cand.set_defaults(func=cmd_promote_memory_candidate)

2) Retrieval query pattern for job injection

I would not do one giant semantic dump.

I’d inject memory in 3 buckets:

Bucket A, exact context memory

Pull memories that match the current job’s:

  • project
  • agent
#
  • person
  • service

Bucket B, lessons / mistakes / playbooks

Pull recent active lessons that are relevant to the same role/theme.

Bucket C, global constraints / preferences

Pull small, high-confidence, long-lived memories.

Retrieval shape

Use parameters like:

  • :namespace
  • :target_agent
  • :project_key
  • :person_key
#
  • :service_key

Example query pattern:

with exact_context as (
  select
    1 as bucket,
    m.*
  from active_memories_v m
  where m.namespace = :namespace
    and (
      (m.scope = 'project' and m.subject_key = :project_key)
      or (m.scope = 'agent' and m.subject_key = :target_agent)
      or (m.scope = 'person' and m.subject_key = :person_key)
      or (m.scope = 'service' and m.subject_key = :service_key)
#
    )
  order by
    m.confidence_score desc,
    m.created_at desc
  limit 4
),
lessons_and_playbooks as (
  select
    2 as bucket,
    m.*
  from active_memories_v m
  where m.namespace = :namespace
    and m.scope in ('agent', 'global', 'service')
    and m.kind in ('lesson', 'mistake', 'playbook', 'constraint')
    and (
#
      m.subject_key = :target_agent
      or m.subject_key = :service_key
      or m.scope = 'global'
    )
  order by
    m.confidence_score desc,
    m.created_at desc
  limit 4
),
global_guidance as (
  select
    3 as bucket,
    m.*
  from active_memories_v m
  where m.namespace = :namespace
#
    and m.scope = 'global'
    and m.kind in ('decision', 'preference', 'constraint')
  order by
    m.confidence_score desc,
    m.created_at desc
  limit 4
)
select *
from (
  select * from exact_context
  union all
  select * from lessons_and_playbooks
  union all
  select * from global_guidance
) x
#
order by bucket, confidence_score desc, created_at desc
limit 10;

Why this works

It gives the agent:

  • immediate context
  • recent hard-won lessons
  • stable system guidance

…without flooding the prompt.


3) The right pattern for genuine idle-time cognition

#

This is the fun part.

My blunt take:
Do not leave them “thinking” 24/7 in an unbounded way.
That turns into mush, not insight.

The right pattern is bounded idle cognition.

What I’d do instead

Give them “cognitive windows”

When queue depth is low and no urgent work exists, run short idle-think jobs like:

  • 20-30 minutes
  • 2-4 times per day
  • role-specific questions
  • outputs go to ideas, opportunities, or memory_candidates
#
  • not directly to you

That’s how you get compounding thought without noise.

Idle cognition should mostly produce:

  • ideas
  • memory_candidates
  • sometimes opportunities

Not messages to you.

Human output should stay filtered

Only these should surface:

  • accepted opportunities
  • repeated patterns
  • top-ranked ideas digest
  • clear decision memos
#

Everything else stays in the system.

The pattern I’d trust

Step 1, idle scan

Agent explores its domain when not needed elsewhere.

Step 2, provisional output

It writes:

  • raw idea
  • possible pattern
  • candidate lesson
  • draft opportunity

Step 3, cross-pollination

Other agents read those during their proactive scans.

#

Step 4, promotion

Only strong items move upward:

  • idea → opportunity
  • repeated signal → memory_candidate
  • reviewed candidate → memory

Step 5, digest

Publisher/coordinator send you the best things, not the full stream.

That’s the real compounding loop.

What makes idle cognition valuable

Give each role open-ended but bounded prompts like:

  • “What changed that nobody has acted on yet?”
  • “What pattern is repeating that we keep treating as isolated?”
#
  • “What belief do we keep acting on that might now be wrong?”
  • “What automation or simplification would pay off repeatedly?”
  • “What decision should probably be revisited?”

That is much better than “go think”.

How to avoid drowning in output

Use these rules:

  • max 3 new ideas per day per scout
  • max 1 urgent unsolicited escalation per day
  • anything below the opportunity bar goes to ideas
  • anything below the memory bar stays out of long-term memory
  • you get only:
    • daily ideas digest
    • weekly bets memo
#
  • urgent alerts

That keeps it readable.


4) Shared vs solo memory: yes, make that first-class

I strongly recommend yes.

You want both:

Shared memory

Things multiple agents should know:

  • system-wide decisions
  • project facts
  • common playbooks
#
  • shared preferences
  • accepted risk/constraints

Solo / agent-private memory

Things that improve one role’s judgment:

  • devops quirks
  • research source opinions
  • design taste notes
  • publisher audience instincts
  • coordinator routing heuristics

That split matters a lot.

The right scope model

I’d use scope + visibility.

#

Scope = what the memory is about

  • global
  • agent
  • project
  • person
  • service

Visibility = who can read it

  • shared
  • role-limited
  • agent-private
  • human-editable

Even if you don’t model visibility as a column yet, think that way.

Because:

  • a memory about Peter’s preference is not the same as
#
  • a memory about how security triages CVEs

My recommendation

  • shared memories should be curated harder
  • solo memories can be looser
  • promotion to shared memory should usually go through knowledge
  • promotion to solo memory can be more permissive

That gives specialization without chaos.


5) Markdown layer alongside Postgres: yes, but as a projection layer

Short answer: yes, there is real value, but don’t make it a second live source of truth for operational state.

What should stay authoritative in Postgres

#
  • jobs
  • attempts
  • events
  • standing order runs
  • ideas
  • opportunities
  • memory candidates
  • active memories

What works well in Markdown / Obsidian

  • decision logs
  • playbooks
  • weekly briefs
  • role handbooks
  • curated knowledge packets
  • human-facing notes
  • longform context
#

That’s where markdown shines.

The right pattern

Use Markdown as a human-browsable projection and editing layer, not as the live queue/memory database.

So:

  • Postgres = source of truth
  • Markdown = readable working knowledge / published memory surface

Avoid sync pain by using this rule

One-way by default

Postgres → Markdown

Generate:

  • decision notes
#
  • memory packet exports
  • weekly rollups
  • playbook drafts

Human edits are deliberate

If you edit markdown and want it back in the system:

  • ingest it intentionally
  • do not do magical live bidirectional sync

That’s where sync nightmares start.

My recommendation

Yes, add an Obsidian/markdown layer, but make it:

  • curated
  • slow-moving
  • human-readable
#
  • not the live operational backbone

That gives you readability without corrupting the system.


My strongest recommendation overall

For your setup:

  • Postgres = operational + structured memory truth
  • Markdown = curated human-readable knowledge layer
  • Idle cognition = bounded, scheduled, filtered
  • Shared vs solo memory = first-class
  • Memory promotion = evidence-based and curated

That’s the version that gets smarter over time without turning into sludge.

#

If you want, I can generate the next concrete pieces too:

  1. the idle-cognition cron jobs for all 10 agents, and
  2. a markdown export pattern for memories / opportunities / ideas into an Obsidian vault.
nimble phoenix
#

Yes — idle cognition cron jobs for all 10 agents and the markdown export pattern for Obsidian.

Also: what's the recommended deployment order for everything we've built — what breaks if something goes up in the wrong sequence? And what does week 1 actually look like — what's the minimal viable first deployment that proves the stack works before layering in memory, opportunities, and idle cognition?

warm elmBOT
#

Yep. This is the right moment to be strict about sequencing, otherwise the “smart” layer turns into chaos fast.

1) Idle cognition cron jobs for all 10 agents

Rule for every idle cognition job:
First check whether there is real queued, blocked, or urgent work that should take priority. If yes, do nothing, record no_work, and exit.
If idle time is genuinely available, use the session to produce only:

  • ideas
  • opportunities
  • memory_candidates

Not direct human output.

# coordinator
openclaw cron add \
#
  --name "coord-idle-cognition" \
  --cron "0 1 * * *" \
  --tz "UTC" \
  --session session:idle-coordinator \
  --agent coordinator \
  --message "Run coordinator idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session looking for cross-system blind spots, stale priorities, missing strategic bets, and hidden coordination opportunities. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# auditor
openclaw cron add \
  --name "audit-idle-cognition" \
  --cron "15 1 * * *" \
  --tz "UTC" \
  --session session:idle-auditor \
  --agent auditor \
#
  --message "Run auditor idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session looking for systemic fragility, silent failure modes, blind spots in audit coverage, and process anti-patterns. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# project-tracker
openclaw cron add \
  --name "tracker-idle-cognition" \
  --cron "30 1 * * *" \
  --tz "UTC" \
  --session session:idle-project-tracker \
  --agent project-tracker \
  --message "Run project-tracker idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session connecting dots across projects, milestones, missing debriefs, and latent dependencies. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# knowledge
openclaw cron add \
#
  --name "knowledge-idle-cognition" \
  --cron "45 1 * * *" \
  --tz "UTC" \
  --session session:idle-knowledge \
  --agent knowledge \
  --message "Run knowledge idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session connecting fragments of knowledge, finding underused patterns, spotting gaps worth filling, and surfacing reusable abstractions. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# devops
openclaw cron add \
  --name "devops-idle-cognition" \
  --cron "0 2 * * *" \
  --tz "UTC" \
  --session session:idle-devops \
  --agent devops \
#
  --message "Run devops idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session identifying automation candidates, reliability wins, observability gaps, and recurring toil that should become proactive work. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# product-metrics
openclaw cron add \
  --name "metrics-idle-cognition" \
  --cron "15 2 * * *" \
  --tz "UTC" \
  --session session:idle-product-metrics \
  --agent product-metrics \
  --message "Run product-metrics idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session looking for underexplored product bets, experiment ideas, data contradictions, and changing user patterns. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# design
openclaw cron add \
#
  --name "design-idle-cognition" \
  --cron "30 2 * * *" \
  --tz "UTC" \
  --session session:idle-design \
  --agent design \
  --message "Run design idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session exploring new angles on clarity, flow simplification, user friction, and design coherence. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# security
openclaw cron add \
  --name "security-idle-cognition" \
  --cron "45 2 * * *" \
  --tz "UTC" \
  --session session:idle-security \
  --agent security \
#
  --message "Run security idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session exploring preventative hardening ideas, latent risk chains, policy improvements, and assumptions that should be challenged. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# research
openclaw cron add \
  --name "research-idle-cognition" \
  --cron "0 3 * * *" \
  --tz "UTC" \
  --session session:idle-research \
  --agent research \
  --message "Run research idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session connecting external signals, surfacing overlooked trends, and exploring angles that may matter later but do not yet warrant direct action. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

# publisher
openclaw cron add \
#
  --name "publisher-idle-cognition" \
  --cron "15 3 * * *" \
  --tz "UTC" \
  --session session:idle-publisher \
  --agent publisher \
  --message "Run publisher idle cognition. First, if there is meaningful queued, blocked, or urgent work, exit with no_work. Otherwise, spend one bounded idle-thinking session looking for themes, narratives, missing frames, and human-facing summaries that would make future communication more useful. Record only high-signal outputs as ideas, opportunities, or memory candidates." \
  --no-deliver

2) Markdown export pattern for Obsidian

My recommendation: Postgres is authoritative, Markdown is a projection layer.

Vault structure

#
Obsidian/
├─ 00 Inbox/
├─ 01 Daily/
├─ 02 Ideas/
├─ 03 Opportunities/
├─ 04 Memories/
│  ├─ Decisions/
│  ├─ Preferences/
│  ├─ Lessons/
│  ├─ Patterns/
│  └─ Playbooks/
├─ 05 Reports/
├─ 06 Agents/
│  ├─ coordinator/
│  ├─ auditor/
#
│  ├─ project-tracker/
│  ├─ knowledge/
│  ├─ devops/
│  ├─ product-metrics/
│  ├─ design/
│  ├─ security/
│  ├─ research/
│  └─ publisher/
└─ 07 Snapshots/

What gets exported

  • ideas02 Ideas/
  • opportunities03 Opportunities/
  • active memories04 Memories/...
#
  • daily system health snapshots → 07 Snapshots/
  • publisher digests / weekly bets → 05 Reports/

File naming

Use stable IDs in filenames so rewrites are deterministic:

2026-04-15-opportunity-onboarding-gap-7f3c.md
2026-04-15-idea-paired-setup-2ab1.md
2026-04-15-memory-lesson-retry-storms-91de.md

Frontmatter shape

#
---
id: 7f3c
kind: opportunity
status: accepted
source_agent: product-metrics
theme: onboarding
created_at: 2026-04-15T18:00:00Z
updated_at: 2026-04-15T18:20:00Z
namespace: default
owner_candidate: coordinator
source_refs:
  - job: 123
  - event: 456
---
#

Body pattern

  • short summary
  • evidence
  • why it matters
  • next step
  • source refs

Sync rule

  • Default direction: Postgres → Markdown
  • Human edits: either
    • treat markdown as commentary only, or
    • re-ingest intentionally through a separate import path

Do not try to live-sync operational state both ways.

#

3) Recommended deployment order

This matters a lot.

Order I’d use

  1. Postgres core schema

    • jobs
    • attempts
    • events
    • standing_order_runs
  2. agentbus

    • enough commands to create/close jobs and record events
#
  1. OpenClaw agents + workspaces

    • AGENTS files
    • minimal cron jobs
    • no ideas/opportunities/memory yet
  2. Dispatcher v2

    • LISTEN/NOTIFY
    • leases
    • attempt writes
  3. Health views + dashboard

    • system snapshot
    • dispatcher heartbeat
  4. Operational cron jobs

    • coordinator
    • auditor
#
  • specialists
    • publisher
  1. Ideas + opportunities

    • tables
    • commands
    • proactive scans
    • digests
  2. Memories

    • candidates
    • promotion
    • retrieval injection
  3. Idle cognition

    • only after ideas/opportunities/memory are already working
#
  1. Markdown / Obsidian projection
  • after schemas stabilize

4) What breaks if you bring things up in the wrong order

Dispatcher before schema

It either crashes, or worse, runs against missing tables/functions.

Agents before agentbus

They can think, but they can’t close the loop durably. That’s fake automation.

Cron before dispatcher

Jobs get created but nothing actually processes them.

Ideas / opportunities before digests + budgets

#

You get noise and no good way to surface or prune it.

Memory before evidence discipline

You create durable sludge instead of durable learning.

Idle cognition before ideas/opportunities/memory exist

You get a firehose with nowhere useful to go.

Obsidian before stable IDs / schemas

You create sync churn and broken links.


5) What week 1 should actually look like

Do not launch the whole system at once.

#

Minimal viable first deployment

I’d prove the stack with just:

  • coordinator
  • auditor
  • one real specialist you actually care about, maybe devops
  • publisher

And only these database pieces:

  • jobs
  • job_attempts
  • events
  • standing_order_runs
  • dispatcher_heartbeats
  • system snapshot views
#

And only these behaviors:

  • enqueue job
  • process job
  • succeed / block / retry
  • cron run with no_work
  • daily publisher summary
  • auditor lease recovery

Week 1 success criteria

You want proof of these six things:

  1. A cron job runs and writes standing_order_runs
  2. Dispatcher claims a job and writes job_attempts
  3. Agent closes the job durably via agentbus
  4. Blocked jobs get requeued / rerouted correctly
  5. Dispatcher crash recovery works via leases
#
  1. You can see system health in one dashboard/view

If those six work, the base is real.

Only then layer in

  • ideas
  • opportunities
  • memory candidates / memories
  • proactive scans
  • idle cognition
  • markdown export

That’s the safe order.


My recommendation for week 1, literally

#

Day 1

  • Postgres schema
  • agentbus
  • dispatcher
  • 4 agents

Day 2

  • core cron jobs
  • health views
  • dashboard

Day 3

  • force failures / crash tests
  • verify lease recovery
  • verify blocked-job routing

Day 4-5

#
  • publisher digest
  • one end-to-end human-readable report
  • only then decide whether to add ideas/opportunities

That’s the MVP that proves the stack is alive.


If you want, I can turn this into one final practical package:

  1. a day-by-day week 1 checklist, and
  2. a single “phase 1 bootstrap script” that sets up the minimum viable deployment only.
nimble phoenix
#

Yes — week 1 day-by-day checklist and the phase 1 bootstrap script

warm elmBOT
#

Yep. Here’s the practical week 1 I’d actually run, and then a phase 1 bootstrap script that only brings up the minimum viable stack.

Week 1, day-by-day checklist

Day 1, database + schema

Goal: prove the state layer exists and is queryable.

Do:

  • create Postgres DB/user
  • apply phase 1 core schema only:
    • jobs
    • job_attempts
    • events
    • standing_order_runs
    • dispatcher_heartbeats
    • health summary views
  • verify you can query system_health_snapshot_v
#

Exit criteria:

  • schema applies cleanly
  • select * from system_health_snapshot_v; returns one row or empty-safe output
  • backups are configured at least minimally

Day 2, agentbus + dispatcher

Goal: prove work can be claimed and tracked.

Do:

  • install agentbus.py
  • install dispatcher_v2.py
  • install systemd unit
  • start dispatcher
  • verify heartbeat table updates every 30s
#

Exit criteria:

  • systemctl status agentbus-dispatcher is healthy
  • dispatcher_heartbeats shows fresh last_seen
  • dispatcher survives restart

Day 3, 4 pilot agents only

Goal: keep scope tight.

Bring up only:

  • coordinator
  • auditor
  • devops
  • publisher

Do:

#
  • create their workspaces
  • drop in their AGENTS.md
  • ensure OpenClaw routes/recognizes them
  • no ideas/opportunities/memory yet

Exit criteria:

  • each pilot agent can be targeted
  • each workspace loads cleanly
  • no tool policy surprises

Day 4, minimal cron layer

Goal: prove scheduled work and auditability.

Add only:

  • coord-backlog-rebalance
#
  • audit-lease-reaper
  • audit-standing-order-gaps
  • devops-health-check
  • publisher-daily-digest

Exit criteria:

  • cron jobs appear in openclaw cron list
  • standing_order_runs shows completed or no_work
  • publisher generates at least one digest or explicit no-op

Day 5, end-to-end queue test

Goal: prove the stack works when real jobs move.

Do:

  • enqueue one real devops job through agentbus
#
  • watch dispatcher claim it
  • verify job_attempts
  • verify terminal durable update
  • verify publisher can summarize outcome

Exit criteria:

  • job lifecycle goes queued -> leased/running -> succeeded
  • at least one blocked or retry case is tested too
  • events and standing order logs are written

Day 6, failure drill

Goal: prove recovery.

Do:

  • start a long-ish job
#
  • kill dispatcher mid-run
  • restart dispatcher
  • verify:
    • if agent finished and wrote terminal state, job is fine
    • if not, lease expiry + requeue recovers it

Exit criteria:

  • no silent loss
  • recovery behavior is understood
  • retry path is acceptable

Day 7, review and freeze

Goal: decide whether phase 1 is stable enough for phase 2.

Review:

#
  • queue health
  • retry hotspot count
  • blocked job age
  • standing-order completeness
  • publisher digest usefulness
  • operator trust level

Decision gate:

  • if phase 1 is noisy or fragile, do not add ideas/opportunities/memory yet
  • if phase 1 is stable, add ideas/opportunities next, then memory, then idle cognition last

Phase 1 bootstrap script

This assumes you already have these local files prepared from the earlier steps:

#
  • agentbus.py
  • dispatcher_v2.py
  • agentbus-dispatcher.service
  • dispatcher.env
  • phase1_core.sql

It also assumes your OpenClaw config/workspaces already define:

  • coordinator
  • auditor
  • devops
  • publisher
#!/usr/bin/env bash
set -euo pipefail
#
# Required:
#   DATABASE_URL=postgresql://...
#
# Optional:
#   APP_ROOT=/opt/agentbus
#   ETC_ROOT=/etc/agentbus
#   PYTHON_BIN=python3.11

: "${DATABASE_URL:?Set DATABASE_URL first}"

APP_ROOT="${APP_ROOT:-/opt/agentbus}"
ETC_ROOT="${ETC_ROOT:-/etc/agentbus}"
PYTHON_BIN="${PYTHON_BIN:-python3.11}"

echo "==> Creating directories"
#
sudo mkdir -p "$APP_ROOT" "$APP_ROOT/sql" "$ETC_ROOT"

echo "==> Copying application files"
sudo cp agentbus.py "$APP_ROOT/agentbus.py"
sudo cp dispatcher_v2.py "$APP_ROOT/dispatcher_v2.py"
sudo cp phase1_core.sql "$APP_ROOT/sql/phase1_core.sql"
sudo cp dispatcher.env "$ETC_ROOT/dispatcher.env"
sudo cp agentbus-dispatcher.service /etc/systemd/system/agentbus-dispatcher.service

echo "==> Setting permissions"
sudo chmod +x "$APP_ROOT/agentbus.py" "$APP_ROOT/dispatcher_v2.py"

echo "==> Creating virtualenv"
sudo "$PYTHON_BIN" -m venv "$APP_ROOT/.venv"
sudo "$APP_ROOT/.venv/bin/pip" install --upgrade pip
#
sudo "$APP_ROOT/.venv/bin/pip" install "psycopg[binary]>=3.2" psycopg-pool httpx

echo "==> Applying phase 1 schema"
psql "$DATABASE_URL" -v ON_ERROR_STOP=1 -f "$APP_ROOT/sql/phase1_core.sql"

echo "==> Enabling dispatcher service"
sudo systemctl daemon-reload
sudo systemctl enable --now agentbus-dispatcher.service

echo "==> Service status"
sudo systemctl --no-pager --full status agentbus-dispatcher.service || true

echo "==> OpenClaw status"
openclaw status || true
openclaw gateway status || true
#

echo "==> Adding minimal phase 1 cron jobs"
openclaw cron add \
  --name "coord-backlog-rebalance" \
  --cron "*/15 * * * *" \
  --tz "UTC" \
  --session session:coordinator-planning \
  --agent coordinator \
  --message "Run coordinator standing order: review queued, blocked, and stale work; rebalance priorities; reroute cross-specialist dependencies; emit follow-up jobs/events; close all durable state before finishing." \
  --no-deliver

openclaw cron add \
  --name "audit-lease-reaper" \
  --cron "*/5 * * * *" \
  --tz "UTC" \
#
  --session isolated \
  --agent auditor \
  --message "Run auditor standing order: detect expired leases, requeue eligible jobs, record findings, and flag any worker that returned without durable terminal job updates." \
  --no-deliver

openclaw cron add \
  --name "audit-standing-order-gaps" \
  --cron "*/10 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent auditor \
  --message "Run auditor standing order: detect standing-order gaps, missing run records, cron health issues, and record durable audit events for anything missing or degraded." \
  --no-deliver

openclaw cron add \
#
  --name "devops-health-check" \
  --cron "*/10 * * * *" \
  --tz "UTC" \
  --session isolated \
  --agent devops \
  --message "Run devops standing order: check service health, environment readiness, and operational status; record evidence; and escalate anything that needs recovery or replanning." \
  --no-deliver

openclaw cron add \
  --name "publisher-daily-digest" \
  --cron "0 18 * * *" \
  --tz "UTC" \
  --session session:publisher-editorial \
  --agent publisher \
  --message "Run publisher standing order: build the daily human-facing digest from completed and verified work, preserve uncertainty where needed, and ensure every important claim traces back to durable upstream facts." \
#
  --no-deliver

echo "==> Cron list"
openclaw cron list || true

echo "==> Initial health snapshot"
psql "$DATABASE_URL" -c "select * from system_health_snapshot_v;" || true

cat <<'EOF'

Phase 1 bootstrap complete.

Next manual smoke tests:
1. Enqueue one real job:
   ./agentbus.py enqueue-job \
#
     --target-agent devops \
     --job-type reactive \
     --subject "phase1.smoke-test" \
     --payload '{"check":"service-health"}'

2. Watch dispatcher logs:
   sudo journalctl -u agentbus-dispatcher -f

3. Verify DB state:
   psql "$DATABASE_URL" -c "select * from jobs order by created_at desc limit 10;"
   psql "$DATABASE_URL" -c "select * from job_attempts order by started_at desc limit 10;"
   psql "$DATABASE_URL" -c "select * from standing_order_runs order by created_at desc limit 10;"
   psql "$DATABASE_URL" -c "select * from dispatcher_heartbeats;"

4. Kill dispatcher once during a running job and verify lease recovery.
#

Do not add ideas, opportunities, memory, or idle cognition until phase 1 is boringly reliable.
EOF

What phase 1 intentionally does not include

Leave these out until phase 1 is stable:

  • ideas
  • opportunities
  • memories
  • idle cognition
  • markdown export
  • the other 6 agents

That’s the whole point. You want one reliable spine first.

#

If you want, I can package the next exact phase the same way:

  • phase 2 bootstrap, which adds ideas + opportunities + proactive scans, but still not memory or idle cognition yet.
nimble phoenix
#

Can you build out bootstrap scripts for all remaining phases — phase 2 through final — using the same format as phase 1? Each phase self-contained, minimal viable increment, with its own exit criteria and a clear 'do not proceed until this is stable' gate before the next phase.

warm elmBOT
#

Yep. I’d break the rest into 4 remaining phases:

  • Phase 2: full reactive fleet
  • Phase 3: ideas + opportunities + proactive scans
  • Phase 4: memory candidates + memories + retrieval
  • Phase 5 (final): idle cognition + Obsidian export

And I’d keep the same hard rule for every phase:

Do not proceed until this phase is boringly stable for 48-72 hours.

Also, same caveat as phase 1: these are create-once bootstrap scripts, not idempotent.


Phase 2, full reactive fleet

#

Goal

Expand from the 4-agent pilot to the full 10-agent reactive/control fleet, but still no ideas/opportunities/memory/idle cognition.

Do not proceed until phase 1 is stable

Required before phase 2:

  • dispatcher crash recovery tested
  • blocked-job recovery tested
  • publisher digest is readable
  • no unexplained dead jobs in phase 1

Exit criteria

  • all 10 agents are active
  • all remaining operational cron jobs have run at least once
  • standing_order_runs shows completed or no_work for every agent
  • no unexpected tool policy failures for 48h
  • queue health remains understandable
#

phase2_bootstrap.sh

#!/usr/bin/env bash
set -euo pipefail

add_job() {
  local name="$1"
  local cron_expr="$2"
  local session_target="$3"
  local agent_id="$4"
  local message="$5"

  openclaw cron add \
    --name "$name" \
    --cron "$cron_expr" \
#
    --tz "UTC" \
    --session "$session_target" \
    --agent "$agent_id" \
    --message "$message" \
    --no-deliver
}

echo "==> Verifying remaining workspaces exist"
for d in \
  "$HOME/.openclaw/workspace-project-tracker" \
  "$HOME/.openclaw/workspace-knowledge" \
  "$HOME/.openclaw/workspace-product-metrics" \
  "$HOME/.openclaw/workspace-design" \
  "$HOME/.openclaw/workspace-security" \
  "$HOME/.openclaw/workspace-research"
#
do
  [[ -d "$d" ]] || { echo "Missing workspace: $d"; exit 1; }
done

echo "==> Adding remaining operational cron jobs"

add_job "coord-daily-replan" "0 6 * * *" "session:coordinator-planning" "coordinator" \
  "Run coordinator daily replan and stale-priority cleanup."

add_job "audit-failure-clusters" "*/15 * * * *" "isolated" "auditor" \
  "Run auditor failure-cluster review for repeat failures and dead-letter growth."

add_job "audit-daily-summary" "0 7 * * *" "session:auditor-daily" "auditor" \
  "Run auditor daily summary of cron health, lease recovery, gaps, and anomalies."
#
add_job "tracker-debrief-sweep" "0 * * * *" "session:project-tracker-rollup" "project-tracker" \
  "Run project-tracker debrief sweep and refresh project-level status."

add_job "tracker-daily-rollup" "0 18 * * *" "session:project-tracker-rollup" "project-tracker" \
  "Run project-tracker daily milestone and deliverable rollup."

add_job "knowledge-consolidate" "0 */6 * * *" "session:knowledge-maintenance" "knowledge" \
  "Run knowledge consolidation, promotion of durable facts, and packet refresh."

add_job "knowledge-hygiene" "30 2 * * *" "session:knowledge-maintenance" "knowledge" \
  "Run knowledge hygiene and index cleanup."

add_job "devops-drift-review" "0 * * * *" "isolated" "devops" \
  "Run devops drift review and operator-note emission."
#
add_job "metrics-anomaly-check" "15 * * * *" "isolated" "product-metrics" \
  "Run product-metrics anomaly check."

add_job "metrics-daily-rollup" "30 6 * * *" "session:metrics-daily" "product-metrics" \
  "Run product-metrics daily product health rollup."

add_job "design-queue-triage" "0 9 * * 1-5" "session:design-backlog" "design" \
  "Run design backlog triage."

add_job "design-weekly-review" "0 10 * * 1" "session:design-backlog" "design" \
  "Run weekly design review and debt sweep."

add_job "security-dependency-review" "0 */6 * * *" "isolated" "security" \
  "Run security dependency and risk review."
#
add_job "security-daily-risk-rollup" "45 6 * * *" "session:security-daily" "security" \
  "Run security daily risk rollup."

add_job "research-watchlist-scan" "30 */6 * * *" "isolated" "research" \
  "Run research watchlist and scheduled scan sweep."

add_job "research-daily-brief" "15 7 * * *" "session:research-briefs" "research" \
  "Run research daily brief synthesis."

add_job "publisher-weekly-report" "0 17 * * 5" "session:publisher-editorial" "publisher" \
  "Run publisher weekly report build."

echo "==> Cron list"
openclaw cron list || true
#
echo "==> Status"
openclaw status || true
openclaw gateway status || true

cat <<'EOF'

Phase 2 manual checks:
1. Confirm every new agent shows at least one standing-order run within 24h.
2. Enqueue one manual job each for:
   - project-tracker
   - knowledge
   - product-metrics
   - design
   - security
   - research
#
3. Confirm queue remains healthy and blocked jobs still recover correctly.

DO NOT PROCEED TO PHASE 3 UNTIL:
- all 10 agents have completed/no_work standing-order evidence
- no unexplained dead jobs for 48-72h
- publisher output is still readable
EOF

Phase 3, ideas + opportunities + proactive scans

Goal

Add the proactive lane, but still no long-term memory promotion and no idle cognition.

#

Do not proceed until phase 2 is stable

Required before phase 3:

  • full reactive fleet stable
  • no queue confusion
  • publisher not overloaded
  • auditors not seeing persistent cron/lease churn

Exit criteria

  • ideas/opportunities schema works
  • at least 3 agents create usable ideas
  • at least 1 opportunity converts into a real job
  • daily ideas/opportunity digest is readable and not spammy
  • opportunity noise is under control for 48h

phase3_bootstrap.sh

#
#!/usr/bin/env bash
set -euo pipefail

: "${DATABASE_URL:?Set DATABASE_URL first}"
APP_ROOT="${APP_ROOT:-/opt/agentbus}"

add_job() {
  local name="$1"
  local cron_expr="$2"
  local session_target="$3"
  local agent_id="$4"
  local message="$5"

  openclaw cron add \
    --name "$name" \
#
    --cron "$cron_expr" \
    --tz "UTC" \
    --session "$session_target" \
    --agent "$agent_id" \
    --message "$message" \
    --no-deliver
}

echo "==> Applying ideas/opportunities schema"
psql "$DATABASE_URL" -v ON_ERROR_STOP=1 -f "$APP_ROOT/sql/ideas_opportunities.sql"

echo "==> Installing updated agentbus"
sudo cp agentbus.py "$APP_ROOT/agentbus.py"
sudo chmod +x "$APP_ROOT/agentbus.py"
#
echo "==> Adding proactive cron jobs"

add_job "coord-opportunity-ranking" "0 8 * * *" "session:coordinator-planning" "coordinator" \
  "Review new and triaged opportunities, rank them, and convert only the strongest into jobs."

add_job "coord-weekly-bets-review" "30 8 * * 1" "session:coordinator-planning" "coordinator" \
  "Choose the weekly top bets from accepted opportunities."

add_job "audit-opportunity-hygiene" "45 8 * * *" "isolated" "auditor" \
  "Inspect the opportunities lane for spam, stale items, duplicate ideas, and budget violations."

add_job "tracker-opportunity-triage" "0 */2 * * *" "session:project-tracker-rollup" "project-tracker" \
  "Group related opportunities and attach missing project context."

add_job "knowledge-gap-opportunities" "0 3 * * *" "session:knowledge-maintenance" "knowledge" \
#
  "Find reusable knowledge gaps and underexploited abstractions."

add_job "devops-automation-opportunities" "0 4 * * *" "isolated" "devops" \
  "Identify automation, reliability, and observability opportunities."

add_job "metrics-growth-opportunities" "0 5 * * *" "session:metrics-daily" "product-metrics" \
  "Identify growth, retention, funnel, and experiment opportunities."

add_job "design-ux-opportunities" "30 9 * * 1-5" "session:design-backlog" "design" \
  "Identify UX friction and simplification opportunities."

add_job "security-hardening-opportunities" "15 6 * * *" "session:security-daily" "security" \
  "Identify preventive hardening and policy-improvement opportunities."

add_job "research-external-opportunities" "15 7 * * *" "session:research-briefs" "research" \
#
  "Identify external shifts, tool changes, and market opportunities."

add_job "publisher-opportunity-digest" "30 18 * * *" "session:publisher-editorial" "publisher" \
  "Compile the daily ideas and opportunities digest."

add_job "publisher-weekly-bets-memo" "45 18 * * 5" "session:publisher-editorial" "publisher" \
  "Compile the weekly top-bets memo."

echo "==> Initial smoke test hints"
cat <<'EOF'
Try:
  ./agentbus.py create-idea ...
  ./agentbus.py create-opportunity ...
  ./agentbus.py convert-opportunity ...
#
DO NOT PROCEED TO PHASE 4 UNTIL:
- at least 1 opportunity has been converted into a real job
- daily digest is useful, not noisy
- auditors are not flagging opportunity-lane hygiene issues for 48-72h
EOF

Phase 4, memory lane

Goal

Add memory candidates, promoted memories, and retrieval-aware job injection.

Do not proceed until phase 3 is stable

Required before phase 4:

#
  • ideas/opportunities useful
  • opportunity spam under control
  • publisher digest is trusted
  • no one is tempted to treat ideas as memory

Exit criteria

  • memory candidate schema works
  • at least 1 promoted memory per 3 different roles
  • retrieval injects relevant small memory sets
  • no obvious stale/contradictory memory pollution for 48h

phase4_bootstrap.sh

#!/usr/bin/env bash
set -euo pipefail
#

: "${DATABASE_URL:?Set DATABASE_URL first}"
APP_ROOT="${APP_ROOT:-/opt/agentbus}"

add_job() {
  local name="$1"
  local cron_expr="$2"
  local session_target="$3"
  local agent_id="$4"
  local message="$5"

  openclaw cron add \
    --name "$name" \
    --cron "$cron_expr" \
    --tz "UTC" \
#
    --session "$session_target" \
    --agent "$agent_id" \
    --message "$message" \
    --no-deliver
}

echo "==> Applying memory schema"
psql "$DATABASE_URL" -v ON_ERROR_STOP=1 -f "$APP_ROOT/sql/memory.sql"

echo "==> Installing updated agentbus"
sudo cp agentbus.py "$APP_ROOT/agentbus.py"
sudo chmod +x "$APP_ROOT/agentbus.py"

echo "==> Installing retrieval-aware dispatcher"
sudo cp dispatcher_with_memory.py "$APP_ROOT/dispatcher_v2.py"
#
sudo chmod +x "$APP_ROOT/dispatcher_v2.py"
sudo systemctl restart agentbus-dispatcher.service

echo "==> Adding memory extraction/promote cron jobs"

add_job "coord-memory-extract" "0 20 * * *" "isolated" "coordinator" \
  "Extract coordinator memory candidates from decisions, patterns, mistakes, and playbooks."

add_job "audit-memory-extract" "10 20 * * *" "isolated" "auditor" \
  "Extract auditor memory candidates from repeated failures, lease issues, and audit patterns."

add_job "tracker-memory-extract" "20 20 * * *" "isolated" "project-tracker" \
  "Extract project-tracker memory candidates from milestone patterns, debrief gaps, and recurring blockers."

add_job "knowledge-memory-extract" "30 20 * * *" "isolated" "knowledge" \
#
  "Extract knowledge memory candidates from durable facts, conflicts, and reusable abstractions."

add_job "devops-memory-extract" "40 20 * * *" "isolated" "devops" \
  "Extract devops memory candidates from incidents, runbooks, and recurring failure patterns."

add_job "metrics-memory-extract" "50 20 * * *" "isolated" "product-metrics" \
  "Extract product-metrics memory candidates from experiment lessons, KPI caveats, and recurring misreads."

add_job "design-memory-extract" "0 21 * * *" "isolated" "design" \
  "Extract design memory candidates from accepted patterns, rejected options, and recurring UX friction."

add_job "security-memory-extract" "10 21 * * *" "isolated" "security" \
  "Extract security memory candidates from risk classes, mitigations, blockers, and approved exceptions."

add_job "research-memory-extract" "20 21 * * *" "isolated" "research" \
#
  "Extract research memory candidates from source quality judgments, signal patterns, and recurring themes."

add_job "publisher-memory-extract" "30 21 * * *" "isolated" "publisher" \
  "Extract publisher memory candidates from audience preferences, report patterns, and missing-input failures."

add_job "knowledge-memory-promote" "45 */6 * * *" "session:knowledge-memory-curation" "knowledge" \
  "Review new memory candidates, reject weak ones, and promote evidence-backed durable memories."

add_job "knowledge-memory-review" "0 22 * * *" "session:knowledge-memory-curation" "knowledge" \
  "Review active memories due for refresh, expire stale ones, and supersede outdated memories."

echo "==> Quick checks"
psql "$DATABASE_URL" -c "select * from active_memories_v limit 10;" || true

cat <<'EOF'
#
DO NOT PROCEED TO PHASE 5 UNTIL:
- promoted memories are showing up in retrieval
- they are improving jobs, not bloating prompts
- contradictory/stale memories are being caught and reviewed
- this stays true for 48-72h
EOF

Phase 5 (final), idle cognition + Obsidian projection

Goal

Add bounded unstructured thinking and a human-browsable markdown layer.

Do not proceed until phase 4 is stable

#

Required before phase 5:

  • memory lane is trustworthy
  • retrieval is small and useful
  • publisher digest remains readable
  • you trust the system enough to let it think when idle

Exit criteria

  • idle cognition produces useful outputs without flooding you
  • ideas/opportunities/memory candidates from idle cognition stay under budget
  • Obsidian export matches DB truth
  • human trust remains high for 2+ weeks

phase5_bootstrap.sh

#!/usr/bin/env bash
#
set -euo pipefail

APP_ROOT="${APP_ROOT:-/opt/agentbus}"
OBSIDIAN_ROOT="${OBSIDIAN_ROOT:-$HOME/Obsidian/AgentVault}"

add_job() {
  local name="$1"
  local cron_expr="$2"
  local session_target="$3"
  local agent_id="$4"
  local message="$5"

  openclaw cron add \
    --name "$name" \
    --cron "$cron_expr" \
#
    --tz "UTC" \
    --session "$session_target" \
    --agent "$agent_id" \
    --message "$message" \
    --no-deliver
}

echo "==> Creating Obsidian vault structure"
mkdir -p \
  "$OBSIDIAN_ROOT/00 Inbox" \
  "$OBSIDIAN_ROOT/01 Daily" \
  "$OBSIDIAN_ROOT/02 Ideas" \
  "$OBSIDIAN_ROOT/03 Opportunities" \
  "$OBSIDIAN_ROOT/04 Memories/Decisions" \
  "$OBSIDIAN_ROOT/04 Memories/Preferences" \
#
  "$OBSIDIAN_ROOT/04 Memories/Lessons" \
  "$OBSIDIAN_ROOT/04 Memories/Patterns" \
  "$OBSIDIAN_ROOT/04 Memories/Playbooks" \
  "$OBSIDIAN_ROOT/05 Reports" \
  "$OBSIDIAN_ROOT/06 Agents" \
  "$OBSIDIAN_ROOT/07 Snapshots"

echo "==> Installing Obsidian export service"
sudo cp obsidian_export.py "$APP_ROOT/obsidian_export.py"
sudo chmod +x "$APP_ROOT/obsidian_export.py"
sudo cp obsidian-export.service /etc/systemd/system/obsidian-export.service
sudo cp obsidian-export.timer /etc/systemd/system/obsidian-export.timer
sudo systemctl daemon-reload
sudo systemctl enable --now obsidian-export.timer
#
echo "==> Adding idle cognition jobs"

add_job "coord-idle-cognition" "0 1 * * *" "session:idle-coordinator" "coordinator" \
  "If queue is truly quiet, do bounded coordinator idle thinking and record only high-signal ideas, opportunities, or memory candidates."

add_job "audit-idle-cognition" "15 1 * * *" "session:idle-auditor" "auditor" \
  "If queue is truly quiet, do bounded auditor idle thinking about systemic fragility and hidden failure modes."

add_job "tracker-idle-cognition" "30 1 * * *" "session:idle-project-tracker" "project-tracker" \
  "If queue is truly quiet, do bounded project-tracker idle thinking about latent dependencies and missing links."

add_job "knowledge-idle-cognition" "45 1 * * *" "session:idle-knowledge" "knowledge" \
  "If queue is truly quiet, do bounded knowledge idle thinking that connects fragments into reusable abstractions."

add_job "devops-idle-cognition" "0 2 * * *" "session:idle-devops" "devops" \
#
  "If queue is truly quiet, do bounded devops idle thinking about automation and reliability wins."

add_job "metrics-idle-cognition" "15 2 * * *" "session:idle-product-metrics" "product-metrics" \
  "If queue is truly quiet, do bounded product-metrics idle thinking about experiments and underexplored bets."

add_job "design-idle-cognition" "30 2 * * *" "session:idle-design" "design" \
  "If queue is truly quiet, do bounded design idle thinking about clarity, simplification, and UX leverage."

add_job "security-idle-cognition" "45 2 * * *" "session:idle-security" "security" \
  "If queue is truly quiet, do bounded security idle thinking about preventative hardening and hidden assumptions."

add_job "research-idle-cognition" "0 3 * * *" "session:idle-research" "research" \
  "If queue is truly quiet, do bounded research idle thinking about external shifts and underexplored angles."

add_job "publisher-idle-cognition" "15 3 * * *" "session:idle-publisher" "publisher" \
#
  "If queue is truly quiet, do bounded publisher idle thinking about themes and future human-facing synthesis."

echo "==> Final checks"
systemctl status agentbus-dispatcher.service || true
systemctl status obsidian-export.timer || true

cat <<'EOF'
FINAL PHASE GATE:
Do not call this 'done' until:
- idle cognition stays under budget
- outputs are genuinely worth reading
- Obsidian exports match DB state
- the human-facing digests remain concise and high-signal
- this remains true for at least 2 weeks
EOF
#

Recommended gating summary

Phase 2 gate

Do not proceed until:

  • all 10 agents operational
  • no unexplained dead jobs
  • all standing orders leave evidence

Phase 3 gate

Do not proceed until:

  • ideas and opportunities are useful
  • publisher digest is not noisy
  • opportunity hygiene is stable

Phase 4 gate

#

Do not proceed until:

  • promoted memories are clearly improving work
  • stale/contradictory memories are being handled
  • retrieval stays small and relevant

Phase 5 gate

Do not call it done until:

  • idle cognition is productive, not spammy
  • Obsidian is helpful, not divergent
  • human trust stays high

My honest recommendation

If I were doing this for real, I would stop after each phase and ask:

#

“Did this make the system more capable, or just more elaborate?”

That one question keeps the stack honest.

If you want, I can do one last packaging pass and turn all 5 phases into:

  1. a single roadmap.md
  2. a single bootstrap/ directory layout
  3. a master runbook for rollout, rollback, and smoke tests.
nimble phoenix
#

Please do all 3

warm elmBOT
#

Absolutely. Here’s the repo-ready package shape I’d use.


1) roadmap.md

# Multi-Agent Stack Roadmap

## Goal
Build a durable, auditable, proactive multi-agent system in controlled phases, with a hard stability gate between each phase.

## Non-negotiables
- Postgres is the source of truth for operational state
- Dispatcher + leases provide at-least-once recovery
- Agents must durably close the loop through agentbus
#
- No phase proceeds until the prior phase is stable
- Human-facing output is filtered and curated

---

## Phase 1: Core Reactive Spine

### Scope
- Core schema:
  - jobs
  - job_attempts
  - events
  - standing_order_runs
  - dispatcher_heartbeats
  - health views
#
- Dispatcher v2
- agentbus
- 4 pilot agents:
  - coordinator
  - auditor
  - devops
  - publisher
- Minimal operational cron jobs

### Success criteria
- queue works end-to-end
- cron produces `completed` or `no_work`
- blocked jobs recover
- dispatcher crash recovery works
- health snapshot is queryable
#
- no unexplained dead jobs

### Gate
Stable for 48-72h before phase 2.

---

## Phase 2: Full Reactive Fleet

### Scope
- Bring all 10 agents online
- Add remaining operational cron jobs
- Keep system reactive only
- No ideas/opportunities/memory/idle cognition yet
#
### Success criteria
- all 10 agents have standing-order evidence
- no new systemic failures from fleet expansion
- publisher output remains readable
- queue remains understandable

### Gate
Stable for 48-72h before phase 3.

---

## Phase 3: Proactive Lane

### Scope
- ideas
#
- opportunities
- proactive scans
- daily / weekly proactive digests

### Success criteria
- at least one useful idea/opportunity conversion
- opportunity hygiene is acceptable
- no flood of low-quality proposals
- publisher digest remains useful

### Gate
Stable for 48-72h before phase 4.

---
#
## Phase 4: Memory Lane

### Scope
- memory_candidates
- memories
- retrieval-aware job injection
- memory extraction and promotion jobs

### Success criteria
- promoted memories are helpful
- retrieval is small and relevant
- stale or conflicting memories are reviewable
- memory improves work quality without prompt bloat

### Gate
#
Stable for 48-72h before phase 5.

---

## Phase 5: Final Layer

### Scope
- bounded idle cognition
- Obsidian / markdown export
- human-browsable knowledge layer
- final operator polish

### Success criteria
- idle cognition produces signal, not spam
- Obsidian reflects DB truth
#
- human trust stays high
- the system is more capable, not just more elaborate

### Gate
Stable for 2 weeks before calling the stack “done”.

---

## Operating principle
At the end of every phase, ask:

> Did this make the system more capable, or just more complex?

If the answer is “more complex”, stop and simplify before proceeding.
#

2) bootstrap/ directory layout

bootstrap/
├─ README.md
├─ roadmap.md
├─ RUNBOOK.md
│
├─ shared/
│  ├─ env/
│  │  ├─ dispatcher.env.example
│  │  ├─ postgres.env.example
│  │  └─ obsidian.env.example
│  ├─ sql/
#
│  │  ├─ health_views.sql
│  │  ├─ dispatcher_heartbeats.sql
│  │  └─ backup_checks.sql
│  ├─ smoke/
│  │  ├─ enqueue_test_job.sh
│  │  ├─ blocked_job_test.sh
│  │  ├─ dispatcher_crash_test.sh
│  │  ├─ cron_evidence_check.sh
│  │  └─ retrieval_smoke_test.sh
│  ├─ rollback/
│  │  ├─ stop_phase.sh
│  │  ├─ disable_crons_by_prefix.sh
│  │  ├─ rollback_dispatcher.sh
│  │  └─ rollback_schema_notes.md
│  └─ dashboards/
#
│     ├─ metabase_notes.md
│     ├─ grafana_panels.md
│     └─ system_health_queries.sql
│
├─ phase1/
│  ├─ phase1_bootstrap.sh
│  ├─ phase1_core.sql
│  ├─ agentbus.py
│  ├─ dispatcher_v2.py
│  ├─ agentbus-dispatcher.service
│  ├─ pilot_workspaces/
│  │  ├─ coordinator/AGENTS.md
│  │  ├─ auditor/AGENTS.md
│  │  ├─ devops/AGENTS.md
│  │  └─ publisher/AGENTS.md
#
│  └─ notes.md
│
├─ phase2/
│  ├─ phase2_bootstrap.sh
│  ├─ remaining_agents/
│  │  ├─ project-tracker/AGENTS.md
│  │  ├─ knowledge/AGENTS.md
│  │  ├─ product-metrics/AGENTS.md
│  │  ├─ design/AGENTS.md
│  │  ├─ security/AGENTS.md
│  │  └─ research/AGENTS.md
│  ├─ phase2_cron_delta.sh
│  └─ notes.md
│
├─ phase3/
#
│  ├─ phase3_bootstrap.sh
│  ├─ ideas_opportunities.sql
│  ├─ agentbus.py
│  ├─ phase3_proactive_crons.sh
│  └─ notes.md
│
├─ phase4/
│  ├─ phase4_bootstrap.sh
│  ├─ memory.sql
│  ├─ agentbus.py
│  ├─ dispatcher_with_memory.py
│  ├─ phase4_memory_crons.sh
│  └─ notes.md
│
├─ phase5/
#
│  ├─ phase5_bootstrap.sh
│  ├─ obsidian_export.py
│  ├─ obsidian-export.service
│  ├─ obsidian-export.timer
│  ├─ phase5_idle_cognition_crons.sh
│  ├─ vault_templates/
│  │  ├─ idea.md
│  │  ├─ opportunity.md
│  │  ├─ memory.md
│  │  ├─ daily_snapshot.md
│  │  └─ weekly_bets.md
│  └─ notes.md
│
└─ final/
   ├─ release_checklist.md
#
   ├─ backup_restore_drill.md
   ├─ final_acceptance.md
   └─ operator_training.md

bootstrap/README.md should say

# Bootstrap Layout

Each phase is self-contained and must be rolled out only after the previous phase is stable.

## Rule
Never skip a phase gate.
#
## Expected flow
1. Read roadmap.md
2. Follow RUNBOOK.md
3. Run phaseN/phaseN_bootstrap.sh
4. Execute smoke tests
5. Hold the phase until the gate is satisfied
6. Only then move to the next phase

3) RUNBOOK.md

# RUNBOOK.md
#

## Purpose
This runbook is the operational guide for rollout, rollback, smoke tests, and daily checks for the multi-agent stack.

---

## Global rollout rules

1. Roll out one phase at a time.
2. Each phase must be stable before the next begins.
3. If a phase adds noise faster than value, stop and revert.
4. Postgres is authoritative. Markdown is a projection layer.
5. All durable work state must be visible in the database.
6. If you cannot observe it, you do not have it.
#
---

## Pre-flight checks before any phase

- Postgres reachable
- recent backup exists
- restore path tested at least once recently
- OpenClaw gateway healthy
- dispatcher service healthy
- health snapshot query returns sane output
- no current critical incident in the queue

Command checklist:
- `openclaw status`
- `openclaw gateway status`
#
- `systemctl status agentbus-dispatcher`
- `psql "$DATABASE_URL" -c "select * from system_health_snapshot_v;"`

---

## Phase 1 rollout

### Rollout steps
1. Apply core schema
2. Install agentbus
3. Install dispatcher
4. Start dispatcher
5. Bring up 4 pilot agents
6. Add minimal cron jobs
7. Run smoke tests
#

### Smoke tests
- enqueue a devops job
- verify `job_attempts`
- verify terminal durable update
- verify `standing_order_runs`
- kill dispatcher during a run and verify lease recovery

### Rollback
- disable phase 1 cron jobs
- stop dispatcher
- keep schema, but stop active use
- revert agent workspaces if needed

### Gate
#
Do not leave phase 1 until:
- queue is boringly reliable
- blocked jobs recover
- standing orders leave evidence
- no unexplained dead jobs for 48-72h

---

## Phase 2 rollout

### Rollout steps
1. Verify phase 1 gate
2. Add remaining 6 agents
3. Add remaining operational cron jobs
4. Run one manual job through each new role
#
5. Watch health snapshot + publisher digest

### Smoke tests
- one job per newly added agent
- one cron run per newly added agent
- verify no new queue pathology

### Rollback
- disable phase 2 cron delta
- route traffic back to pilot roles only
- keep inactive workspaces present but unused

### Gate
Do not leave phase 2 until:
- all 10 roles produce standing-order evidence
#
- no unexplained dead jobs
- fleet expansion did not degrade readability or control

---

## Phase 3 rollout

### Rollout steps
1. Verify phase 2 gate
2. Apply ideas/opportunities schema
3. Install updated agentbus
4. Add proactive scan cron jobs
5. Create one idea, one opportunity, one opportunity-to-job conversion
6. Verify publisher digest
#
### Smoke tests
- create idea
- create opportunity
- convert opportunity to job
- verify accepted ideas reach digest
- verify auditors flag spam if needed

### Rollback
- disable proactive cron jobs
- stop using ideas/opportunities
- leave tables in place if desired, but freeze writes

### Gate
Do not leave phase 3 until:
- ideas/opportunities are useful
#
- noise budget is respected
- daily/weekly digests are worth reading

---

## Phase 4 rollout

### Rollout steps
1. Verify phase 3 gate
2. Apply memory schema
3. Install updated agentbus
4. Install retrieval-aware dispatcher
5. Add memory extract/promote/review cron jobs
6. Verify one candidate → promoted memory → retrieval path
#
### Smoke tests
- create memory candidate
- promote memory candidate
- query active memories
- verify retrieval injection on one real job
- verify stale memory can be reviewed/expired

### Rollback
- disable memory cron jobs
- revert dispatcher to non-retrieval version
- freeze new memory writes
- keep memories read-only until diagnosis is complete

### Gate
Do not leave phase 4 until:
#
- memories improve jobs
- prompt bloat is controlled
- stale/conflicting memories are manageable

---

## Phase 5 rollout

### Rollout steps
1. Verify phase 4 gate
2. Create Obsidian vault structure
3. Install export service/timer
4. Add idle cognition cron jobs
5. Verify outputs flow to ideas/opportunities/memory_candidates, not directly to humans
6. Verify exported markdown matches DB truth
#

### Smoke tests
- one idle cognition run produces an idea or candidate
- no direct human spam
- export timer writes vault files
- compare one DB record to one exported markdown note

### Rollback
- disable idle cognition cron jobs
- disable export timer
- keep DB state, but stop projection and free-thought generation

### Gate
Do not call final done until:
- idle cognition is useful
#
- Obsidian is helpful, not divergent
- human trust remains high for 2 weeks

---

## Daily operator checks

- queue backlog sane
- no stale dispatcher heartbeat
- no surge in dead jobs
- standing-order runs present
- digest quality acceptable

Suggested checks:
- `select * from system_health_snapshot_v;`
#
- `select * from dispatcher_heartbeats;`
- `openclaw cron list`
- `openclaw tasks audit`

---

## Weekly operator checks

- backup health
- restore drill status
- opportunity noise
- memory staleness
- idle cognition usefulness
- Obsidian sync quality
#
---

## Incident handling

### Dispatcher down
- restart service
- verify heartbeat row updates
- verify leases are requeued as expected

### Queue spike
- inspect dead / blocked / retry hotspots
- pause nonessential proactive/idle phases first

### Digest becomes noisy
- disable proactive or idle cognition cron jobs before touching core reactive work
#

### Memory pollution
- freeze memory promotion
- revert to non-retrieval dispatcher if needed
- prune/expire bad memories

### Obsidian divergence
- stop export timer
- treat Postgres as truth
- rebuild projection from DB

---

## Golden rule
#
If forced to choose, preserve:
1. queue integrity
2. auditability
3. operator clarity

Before:
4. proactivity
5. memory
6. idle cognition
7. markdown polish

If you want, I can do one last packaging pass and generate the actual contents for:

  • bootstrap/README.md
#
  • roadmap.md
  • RUNBOOK.md

as fully polished copy-paste files instead of outlines.

nimble phoenix
#

Yes, please for bootstrap/README.md

roadmap.md
RUNBOOK.md
warm elmBOT
#

Absolutely. Here are polished copy-paste versions.

bootstrap/README.md

# Bootstrap

This directory contains the rollout assets for a phased multi-agent system built on:

- OpenClaw for agent runtime, cron, and tool execution
- Postgres for durable operational state
- `agentbus` for durable writes from agents
- a dispatcher process for queue claiming, leases, and wakeups
- optional markdown/Obsidian export as a human-readable projection layer

## Core principle
#

Roll out **one phase at a time**.

Do not skip phases.  
Do not enable the next layer until the current layer is boringly stable.

This stack is designed so that each phase adds one new capability:

1. **Phase 1**: core reactive spine
2. **Phase 2**: full reactive fleet
3. **Phase 3**: proactive ideas/opportunities lane
4. **Phase 4**: long-term memory lane
5. **Phase 5**: idle cognition + markdown projection

## Source of truth
#

Operational truth lives in **Postgres**, not in chat, not in markdown, and not in session notes alone.

Examples of authoritative state:

- jobs
- job attempts
- events
- standing order runs
- dispatcher heartbeats
- ideas
- opportunities
- memory candidates
- memories
#
Markdown is a **projection layer**. It is meant to be readable and editable by humans, but it is not the live operational backbone.

## Layout

- `phase1/` — minimum viable deployment
- `phase2/` — bring all 10 agents online for reactive work
- `phase3/` — add ideas, opportunities, and proactive scans
- `phase4/` — add memory candidates, promoted memories, and retrieval
- `phase5/` — add bounded idle cognition and Obsidian export
- `shared/` — reusable smoke tests, rollback helpers, example env files, dashboard notes

## How to use this directory

1. Read `../roadmap.md`
2. Read `../RUNBOOK.md`
#
3. Confirm the current phase gate is satisfied
4. Run that phase’s bootstrap script
5. Execute that phase’s smoke tests
6. Hold the phase for 48-72 hours, or 2 weeks for the final phase
7. Only then proceed

## Stability gates

Every phase has a hard “do not proceed” gate.

If a phase adds more noise than value, stop there and simplify.  
The goal is not maximum sophistication. The goal is reliable compounding capability.

## Recommended operating posture
#
Prefer this order of priorities:

1. queue integrity
2. auditability
3. operator clarity
4. proactivity
5. memory
6. idle cognition
7. markdown polish

If a lower-priority layer threatens a higher-priority layer, disable it.

## What not to do

- Do not enable ideas, opportunities, memory, and idle cognition all at once
#
- Do not treat ideas as memories
- Do not use cross-agent chat as the durable work bus
- Do not trust markdown as the sole source of truth
- Do not proceed to the next phase because you are impatient

## What “done” means

This system is only “done” when:

- reactive work is reliable
- recovery is predictable
- publisher output is worth reading
- proactive output is high-signal
- memory improves results without bloat
- idle cognition creates value without spamming you
#
- humans can still understand what the system is doing

If that is not true, the stack is not done yet.

roadmap.md

# Multi-Agent Roadmap

## Objective

Build a multi-agent system that is:
#

- durable
- auditable
- proactive
- memory-bearing
- operator-readable

without becoming noisy, brittle, or impossible to reason about.

The stack is intentionally phased so each new capability is added only after the previous layer is stable.

---

## Principles
#
### 1. Durable state first
If it matters operationally, it belongs in Postgres.

### 2. Reactivity before proactivity
The system must be reliable before it is clever.

### 3. Promotion before persistence
Ideas are cheap. Opportunities are higher bar. Memories are highest bar.

### 4. Curation over accumulation
More output is not better output. The system should get sharper over time, not just larger.

### 5. Human trust is the real acceptance test
If the operator stops trusting the output, the stack is failing, even if the code is “working”.
#
---

## Phase 1: Core Reactive Spine

### Purpose
Prove that the system can durably accept, process, retry, and audit work.

### Includes
- core Postgres schema:
  - jobs
  - job_attempts
  - events
  - standing_order_runs
  - dispatcher_heartbeats
  - system health views
#
- `agentbus`
- dispatcher service
- 4 pilot roles:
  - coordinator
  - auditor
  - devops
  - publisher
- minimum cron set

### Questions this phase answers
- Can the queue be trusted?
- Can standing orders be proven?
- Does lease recovery work?
- Is the operator able to see system health quickly?
#
### Exit criteria
- queue lifecycle works end-to-end
- blocked jobs recover
- standing orders leave evidence
- dispatcher crash test passes
- no unexplained dead jobs for 48-72h

### Gate
Do not proceed to phase 2 until phase 1 is boringly reliable.

---

## Phase 2: Full Reactive Fleet

### Purpose
#
Bring the full 10-agent architecture online, but keep it strictly reactive/control-plane only.

### Includes
- all 10 agent workspaces
- remaining operational cron jobs
- full role coverage
- no ideas/opportunities yet
- no memory yet
- no idle cognition yet

### Questions this phase answers
- Does the full role topology hold up?
- Can the publisher still produce readable output with the full fleet active?
- Does the operator still understand the system?
#
### Exit criteria
- all 10 roles have standing-order evidence
- each role has processed at least one real job
- queue remains understandable
- no new systemic failure mode appears for 48-72h

### Gate
Do not proceed to phase 3 until the full reactive fleet is stable.

---

## Phase 3: Proactive Lane

### Purpose
Allow the system to notice useful things before being asked, without turning into noise.
#

### Includes
- ideas
- opportunities
- idea/opportunity commands in `agentbus`
- proactive scan cron jobs
- publisher digests for ideas/opportunities

### Model
- `ideas` = low-bar, low-friction, weak signal
- `opportunities` = scored, decision-ready proposals

### Questions this phase answers
- Can agents surface useful proactive observations?
- Can the system keep weak and strong signals separated?
#
- Can publisher keep the output readable?

### Exit criteria
- at least one useful idea is surfaced
- at least one opportunity converts into a job
- daily/weekly digests are useful, not spammy
- auditors do not see chronic idea/opportunity hygiene issues

### Gate
Do not proceed to phase 4 until the proactive lane is high-signal for 48-72h.

---

## Phase 4: Memory Lane
#
### Purpose
Make the system learn from repeated outcomes, decisions, mistakes, and patterns over weeks and months.

### Includes
- memory candidates
- promoted memories
- memory extraction cron jobs
- centralized memory promotion
- retrieval-aware job injection

### Model
- `memory_candidates` = candidate long-term learning
- `memories` = promoted durable learning

### Questions this phase answers
#
- Does memory improve execution quality?
- Are agents remembering the right things?
- Can memory stay relevant without turning into sludge?

### Exit criteria
- promoted memories are clearly useful
- retrieval is small and contextually relevant
- stale/conflicting memories can be reviewed and retired
- prompts are not bloated by memory injection

### Gate
Do not proceed to phase 5 until memory is helping more than it is complicating.

---
#
## Phase 5: Idle Cognition + Markdown Projection

### Purpose
Give agents bounded unstructured thinking time and a human-readable knowledge surface.

### Includes
- bounded idle cognition cron jobs
- Obsidian / markdown export
- curated projection of:
  - ideas
  - opportunities
  - memories
  - reports
  - daily snapshots
#
### Model
Idle cognition does **not** talk directly to humans by default.  
It feeds:
- ideas
- opportunities
- memory candidates

Markdown does **not** replace Postgres.  
It is a readable projection layer.

### Questions this phase answers
- Can unstructured thinking produce real value?
- Can the system think without drowning the operator?
- Does markdown improve oversight without creating sync chaos?
#
### Exit criteria
- idle cognition stays under budget
- generated output is worth reading
- Obsidian reflects database truth
- operator trust remains high for 2 weeks

### Gate
Do not call the system done until phase 5 stays useful and stable for at least 2 weeks.

---

## Final system shape

When fully rolled out, the stack should have:
#
### Reactive lane
- durable jobs
- retries
- standing orders
- audits
- publisher summaries

### Proactive lane
- ideas
- opportunities
- triage
- daily/weekly digests

### Memory lane
- memory candidates
#
- memories
- retrieval
- review / expiry

### Reflection lane
- bounded idle cognition
- cross-pollination
- compounding insight

### Human-readable layer
- markdown export
- playbooks
- decisions
- reports
- snapshots
#

---

## Success condition

The stack is successful when it:

- reliably does assigned work
- notices important things early
- learns from repeated outcomes
- improves over time
- remains legible to the operator

If it becomes more complex without becoming more useful, stop and simplify.
#

RUNBOOK.md

# RUNBOOK

## Purpose

This runbook is the operator guide for:

- rollout
- smoke tests
- rollback
- incident handling
- phase gates
#
- daily/weekly checks

It exists so the system can be operated deliberately, not just assembled.

---

## Global operating rules

1. Roll out one phase at a time.
2. Never skip a gate.
3. If a phase adds more noise than value, roll it back or pause it.
4. Postgres is authoritative.
5. Markdown is a projection layer.
6. If you cannot observe it, you do not have it.
7. If humans stop trusting it, treat that as a production problem.
#

---

## Pre-flight checklist before any rollout

Before running any bootstrap script, confirm:

- Postgres is reachable
- recent backup exists
- restore path is documented
- OpenClaw gateway is healthy
- dispatcher service is healthy
- current health snapshot is sane
- there is no active critical incident in the queue
#
Recommended checks:

```bash
openclaw status
openclaw gateway status
systemctl status agentbus-dispatcher
psql "$DATABASE_URL" -c "select * from system_health_snapshot_v;"

If any of these fail, do not roll forward.


Phase 1 runbook

#
### Goal
Establish the minimum viable reactive spine.

### Rollout steps
1. Apply core schema
2. Install `agentbus`
3. Install dispatcher service
4. Start dispatcher
5. Bring up pilot workspaces
6. Add minimum cron jobs
7. Execute smoke tests

### Smoke tests
- enqueue one real job to `devops`
- verify `job_attempts` row exists
#
- verify job reaches a terminal state
- verify one standing order writes `completed` or `no_work`
- kill dispatcher during a running job and verify lease recovery
- verify publisher can summarize a real result

### Rollback plan
- disable phase 1 cron jobs
- stop dispatcher
- stop routing new work into the stack
- retain the schema, but pause usage
- restore from backup only if data corruption is suspected

### Gate
Do not leave phase 1 until:
- the queue is boringly reliable
#
- standing orders always leave evidence
- blocked work recovers
- there are no unexplained dead jobs for 48-72h

---

## Phase 2 runbook

### Goal
Expand to the full 10-agent reactive fleet.

### Rollout steps
1. Verify phase 1 gate
2. Create/verify all remaining workspaces
3. Install remaining `AGENTS.md` files
#
4. Add remaining operational cron jobs
5. Run one manual job through each new role
6. Verify queue health + publisher digest

### Smoke tests
- each new role receives one real job
- each new role has at least one standing-order run
- no new unexpected tool policy failures
- digest readability remains acceptable

### Rollback plan
- disable phase 2 cron delta
- keep the 4 pilot roles live
- keep the other 6 roles present but inactive
- revert to pilot-only routing if necessary
#

### Gate
Do not leave phase 2 until:
- all 10 roles have standing-order evidence
- all 10 roles have successfully processed work
- there are no new systemic queue pathologies for 48-72h

---

## Phase 3 runbook

### Goal
Introduce proactive sensing without creating noise.

### Rollout steps
#
1. Verify phase 2 gate
2. Apply ideas/opportunities schema
3. Install updated `agentbus`
4. Add proactive scan cron jobs
5. Create one idea manually
6. Create one opportunity manually
7. Convert one opportunity into a real job
8. Verify digest output

### Smoke tests
- `ideas` row created
- `opportunities` row created
- one opportunity converts into a real job
- publisher digest includes at least one good item
- auditors can see hygiene issues if noise appears
#

### Rollback plan
- disable proactive cron jobs
- freeze new ideas/opportunities
- keep tables in place, but stop active use
- revert publisher to reactive-only summaries

### Gate
Do not leave phase 3 until:
- the proactive lane is useful
- noise budgets are respected
- digests are worth reading
- the system stays calm for 48-72h

---
#

## Phase 4 runbook

### Goal
Add long-term learning through curated memory.

### Rollout steps
1. Verify phase 3 gate
2. Apply memory schema
3. Install updated `agentbus`
4. Install retrieval-aware dispatcher
5. Restart dispatcher
6. Add memory extract/promote/review cron jobs
7. Create one memory candidate manually
8. Promote one memory candidate manually
#
9. Verify retrieval on one real job

### Smoke tests
- memory candidate creation works
- memory promotion works
- active memory view looks sane
- retrieval injects a small, relevant set
- expiring/superseding memory is possible

### Rollback plan
- disable memory cron jobs
- revert dispatcher to non-retrieval version
- freeze new memory writes
- keep old memories read-only until reviewed
#
### Gate
Do not leave phase 4 until:
- promoted memories clearly help work quality
- retrieval does not bloat prompts
- stale/conflicting memories are managed
- memory remains useful for 48-72h

---

## Phase 5 runbook

### Goal
Add bounded idle-time cognition and the markdown projection layer.

### Rollout steps
#
1. Verify phase 4 gate
2. Create Obsidian vault structure
3. Install export service/timer
4. Add idle cognition cron jobs
5. Verify idle outputs land in ideas/opportunities/memory_candidates
6. Verify exported markdown matches DB truth

### Smoke tests
- one idle cognition run produces an idea or candidate
- no direct human spam from idle jobs
- export timer writes expected files
- one exported note matches one DB row exactly enough to trust the projection

### Rollback plan
- disable idle cognition cron jobs
#
- disable export timer
- treat Postgres as sole truth again
- leave exported markdown as stale reference only if needed

### Gate
Do not call the system “done” until:
- idle cognition stays under budget
- outputs are worth reading
- Obsidian helps rather than confuses
- trust stays high for 2 weeks

---

## Daily operator checks
#
### Goal
Catch drift before it becomes chaos.

Check:
- queued / blocked / dead counts
- dispatcher heartbeat freshness
- retry hotspots
- standing-order evidence
- digest quality
- unexpected spike in ideas/opportunities

Suggested queries:
```sql
select * from system_health_snapshot_v;
select * from dispatcher_heartbeats;
select * from standing_order_runs order by created_at desc limit 20;
#

Suggested commands:
```bash
openclaw status
openclaw gateway status
openclaw cron list
openclaw tasks audit

Weekly operator checks

Goal

Keep the system healthy over time.

#

Review:
- backup status
- restore drill status
- dead-letter history
- opportunity noise
- memory staleness
- idle cognition usefulness
- Obsidian sync quality
- whether any phase should actually be simplified

---

## Incident handling
#
### Dispatcher down
1. restart service
2. verify heartbeat row freshness
3. verify leases requeue as expected
4. confirm no silent job loss

### Queue spike
1. inspect dead / blocked / retry hotspot counts
2. disable lower-priority cron lanes first:
   - idle cognition
   - proactive scans
   - memory extraction
3. preserve reactive core

### Digest becomes noisy
#
1. disable proactive or idle cron jobs first
2. do not touch the core queue first
3. reduce proposal budgets before reducing reactive work

### Memory pollution
1. freeze memory promotion
2. revert dispatcher if retrieval is hurting execution
3. review / expire bad memories
4. restore a smaller trusted memory set

### Obsidian divergence
1. stop export timer
2. keep Postgres authoritative
3. rebuild projection from DB once stable
#
### Backup uncertainty
1. pause rollout
2. verify latest backup
3. verify restore path
4. do not add another phase until backup confidence is restored

---

## Rollout philosophy

When in doubt, preserve:

1. queue integrity
2. auditability
3. operator clarity
#

before:

4. proactivity
5. memory
6. idle cognition
7. markdown polish

---

## Final acceptance condition

This stack is ready for sustained use only when:

- the queue is reliable
#
- the agents are useful
- the digests are readable
- memory is helpful
- idle cognition creates signal
- the operator still feels in control

If that is not true, the right move is not “more features”.
The right move is simplification.

If you want, I can also turn the phase scripts + these docs into a single final repo scaffold with filenames and file contents grouped exactly the way you’d drop them into Git.

nimble phoenix
#

Yes — final repo scaffold with filenames and file contents ready to drop into Git.

warm elmBOT
#

Yep. Best way to package this in one Discord message is:

  • I give you a Git-ready scaffold
  • the scaffold uses the exact filenames from everything we built
  • the large SQL/Python implementation files keep the same names, so you can drop in the code from the earlier snippets without renaming anything

That way you get a clean repo shape that actually hangs together.


Repo tree

multi-agent-stack/
├─ README.md
├─ roadmap.md
#
├─ RUNBOOK.md
├─ .gitignore
│
├─ bootstrap/
│  ├─ README.md
│  │
│  ├─ shared/
│  │  ├─ env/
│  │  │  ├─ dispatcher.env.example
│  │  │  ├─ postgres.env.example
│  │  │  └─ obsidian.env.example
│  │  ├─ smoke/
│  │  │  ├─ enqueue_test_job.sh
│  │  │  ├─ blocked_job_test.sh
│  │  │  ├─ dispatcher_crash_test.sh
#
│  │  │  ├─ cron_evidence_check.sh
│  │  │  └─ retrieval_smoke_test.sh
│  │  ├─ rollback/
│  │  │  ├─ stop_phase.sh
│  │  │  ├─ disable_crons_by_prefix.sh
│  │  │  ├─ rollback_dispatcher.sh
│  │  │  └─ rollback_schema_notes.md
│  │  └─ dashboards/
│  │     ├─ metabase_notes.md
│  │     ├─ grafana_panels.md
│  │     └─ system_health_queries.sql
│  │
│  ├─ phase1/
│  │  ├─ phase1_bootstrap.sh
│  │  ├─ phase1_core.sql
#
│  │  ├─ agentbus.py
│  │  ├─ dispatcher_v2.py
│  │  ├─ agentbus-dispatcher.service
│  │  ├─ dispatcher.env
│  │  └─ notes.md
│  │
│  ├─ phase2/
│  │  ├─ phase2_bootstrap.sh
│  │  ├─ phase2_cron_delta.sh
│  │  └─ notes.md
│  │
│  ├─ phase3/
│  │  ├─ phase3_bootstrap.sh
│  │  ├─ ideas_opportunities.sql
│  │  ├─ agentbus.py
#
│  │  ├─ phase3_proactive_crons.sh
│  │  └─ notes.md
│  │
│  ├─ phase4/
│  │  ├─ phase4_bootstrap.sh
│  │  ├─ memory.sql
│  │  ├─ agentbus.py
│  │  ├─ dispatcher_with_memory.py
│  │  ├─ phase4_memory_crons.sh
│  │  └─ notes.md
│  │
│  └─ phase5/
│     ├─ phase5_bootstrap.sh
│     ├─ obsidian_export.py
│     ├─ obsidian-export.service
#
│     ├─ obsidian-export.timer
│     ├─ phase5_idle_cognition_crons.sh
│     ├─ vault_templates/
│     │  ├─ idea.md
│     │  ├─ opportunity.md
│     │  ├─ memory.md
│     │  ├─ daily_snapshot.md
│     │  └─ weekly_bets.md
│     └─ notes.md
│
└─ workspaces/
   ├─ coordinator/AGENTS.md
   ├─ auditor/AGENTS.md
   ├─ project-tracker/AGENTS.md
   ├─ knowledge/AGENTS.md
#
   ├─ devops/AGENTS.md
   ├─ product-metrics/AGENTS.md
   ├─ design/AGENTS.md
   ├─ security/AGENTS.md
   ├─ research/AGENTS.md
   └─ publisher/AGENTS.md

README.md

# Multi-Agent Stack
#
A phased, Postgres-first multi-agent architecture built around:

- OpenClaw for runtime, tools, and cron
- Postgres for durable state
- `agentbus` for durable writes from agents
- a dispatcher for queue claiming, leases, retries, and wakeups
- optional markdown/Obsidian export as a human-readable projection layer

## Design philosophy

This stack is intentionally rolled out in phases:

1. reactive spine
2. full reactive fleet
3. proactive lane
#
4. memory lane
5. idle cognition + markdown export

The rule is simple:

> Do not add a smarter layer until the simpler layer is boringly reliable.

## Source of truth

Postgres is the authority for:
- jobs
- job attempts
- events
- standing order runs
- dispatcher heartbeats
#
- ideas
- opportunities
- memory candidates
- memories

Markdown is a projection layer, not the live operational truth.

## Quick start

1. Read `roadmap.md`
2. Read `RUNBOOK.md`
3. Start with `bootstrap/phase1/phase1_bootstrap.sh`
4. Do not proceed to the next phase until the phase gate is satisfied

## Repo conventions
#

- `bootstrap/phaseN/` contains phase-specific rollout assets
- `bootstrap/shared/` contains smoke tests, rollback helpers, env examples, and dashboard notes
- `workspaces/` contains the agent `AGENTS.md` files
- top-level docs describe rollout order, gates, and operations

## Final outcome

When complete, the system should be:

- durable
- auditable
- proactive
- memory-bearing
- operator-readable
#

If it becomes more complex without becoming more useful, stop and simplify.

roadmap.md

Use the polished version from earlier in the thread. File path stays exactly:

roadmap.md

If you want the short version instead, use this:

#
# Roadmap

## Phase 1
Core reactive spine:
- core schema
- agentbus
- dispatcher
- 4 pilot agents
- minimal cron

## Phase 2
Full reactive fleet:
- all 10 agents
- remaining operational cron jobs
#
## Phase 3
Proactive lane:
- ideas
- opportunities
- proactive scans
- digests

## Phase 4
Memory lane:
- memory candidates
- memories
- retrieval-aware job injection

## Phase 5
Final layer:
#
- bounded idle cognition
- Obsidian export
- long-term operator polish

## Golden rule
Do not proceed until the current phase is boringly stable.

RUNBOOK.md

Use the polished version from earlier in the thread. File path stays exactly:

#
RUNBOOK.md

If you want the compact version in Git first, use this:

# RUNBOOK

## Global rules
- one phase at a time
- no skipped gates
- Postgres is authoritative
- Markdown is a projection layer
- if humans stop trusting it, treat that as a production problem
#
## Phase 1
Bring up:
- core schema
- dispatcher
- agentbus
- 4 pilot agents
- minimum cron
Gate:
- queue reliable
- blocked jobs recover
- no unexplained dead jobs for 48-72h

## Phase 2
Bring up:
- remaining 6 agents
#
- remaining operational cron jobs
Gate:
- all 10 roles leave evidence
- no new systemic failures for 48-72h

## Phase 3
Bring up:
- ideas
- opportunities
- proactive scans
- digests
Gate:
- useful, not noisy, for 48-72h

## Phase 4
#
Bring up:
- memory candidates
- memories
- retrieval
Gate:
- memories improve outcomes without bloat for 48-72h

## Phase 5
Bring up:
- idle cognition
- Obsidian export
Gate:
- useful, trusted, and readable for 2 weeks

## Daily checks
#
- queue health
- standing order evidence
- dispatcher heartbeat freshness
- digest readability

## Weekly checks
- backup health
- restore drill confidence
- memory staleness
- opportunity noise
- idle cognition usefulness

bootstrap/README.md

#
# Bootstrap

This directory contains the rollout assets for each phase of the stack.

## How to use it

1. Read `../roadmap.md`
2. Read `../RUNBOOK.md`
3. Verify the current phase gate is satisfied
4. Run the next phase bootstrap script
5. Execute smoke tests
6. Hold the phase until stable
7. Only then continue

## Important
#

These bootstrap scripts are create-once helpers.
They are not intended to be fully idempotent.

## Directory roles

- `shared/` — common helpers
- `phase1/` — minimum viable deployment
- `phase2/` — expand to full reactive fleet
- `phase3/` — add ideas/opportunities
- `phase4/` — add memory
- `phase5/` — add idle cognition and Obsidian export

## Deployment order
#
Phase order is mandatory.

Do not:
- enable ideas before the reactive lane is stable
- enable memory before proactive output is under control
- enable idle cognition before memory and digest quality are trustworthy

.gitignore

# Python
__pycache__/
#
*.pyc
.venv/

# Environment
.env
*.env
dispatcher.env

# Obsidian / local exports
obsidian/
AgentVault/
*.db

# Logs
logs/
#
*.log

# OS
.DS_Store
Thumbs.db

bootstrap/shared/env/dispatcher.env.example

DATABASE_URL=postgresql://agentbus:change-me@127.0.0.1:5432/agentbus
OPENCLAW_BASE_URL=http://127.0.0.1:18789
OPENCLAW_HOOK_TOKEN=replace-me
#

DISPATCHER_NAMESPACE=default
DISPATCHER_ID=dispatcher-1

LEASE_SECONDS=600
POLL_SECONDS=10
CLAIM_BATCH=10

DEFAULT_AGENT_CONCURRENCY=1
AGENT_LIMITS_JSON={"coordinator":1,"auditor":1,"project-tracker":1,"knowledge":1,"devops":1,"product-metrics":1,"design":1,"security":1,"research":1,"publisher":1}

bootstrap/shared/env/postgres.env.example

#
POSTGRES_DB=agentbus
POSTGRES_USER=agentbus
POSTGRES_PASSWORD=change-me

bootstrap/shared/env/obsidian.env.example

OBSIDIAN_ROOT=/srv/obsidian/AgentVault
DATABASE_URL=postgresql://agentbus:change-me@127.0.0.1:5432/agentbus

#

bootstrap/shared/smoke/enqueue_test_job.sh

#!/usr/bin/env bash
set -euo pipefail

APP_ROOT="${APP_ROOT:-/opt/agentbus}"

"$APP_ROOT/agentbus.py" enqueue-job \
  --target-agent devops \
  --job-type reactive \
  --subject "smoke.test" \
  --payload '{"check":"service-health"}'

#

bootstrap/shared/smoke/blocked_job_test.sh

#!/usr/bin/env bash
set -euo pipefail

echo "Create a job that will intentionally block, then verify:"
echo "- blocked row appears in jobs"
echo "- blocked_reason is populated"
echo "- reroute/requeue policy is visible"

bootstrap/shared/smoke/dispatcher_crash_test.sh

#
#!/usr/bin/env bash
set -euo pipefail

echo "1. Start a longer-running job"
echo "2. Kill dispatcher"
echo "3. Restart dispatcher"
echo "4. Verify either:"
echo "   - agent completed and terminal state is written"
echo "   - or lease expiry requeues the job"

bootstrap/shared/smoke/cron_evidence_check.sh

#
#!/usr/bin/env bash
set -euo pipefail

: "${DATABASE_URL:?Set DATABASE_URL first}"

psql "$DATABASE_URL" -c "select * from standing_order_runs order by created_at desc limit 20;"

bootstrap/shared/smoke/retrieval_smoke_test.sh

#!/usr/bin/env bash
set -euo pipefail
#

: "${DATABASE_URL:?Set DATABASE_URL first}"

psql "$DATABASE_URL" -c "select * from active_memories_v limit 20;"

bootstrap/shared/rollback/stop_phase.sh

#!/usr/bin/env bash
set -euo pipefail

PHASE="${1:?usage: stop_phase.sh <phase-name>}"
#
echo "Stopping phase: $PHASE"
echo "Disable the cron jobs for this phase and stop any phase-specific services."

bootstrap/shared/rollback/disable_crons_by_prefix.sh

#!/usr/bin/env bash
set -euo pipefail

PREFIX="${1:?usage: disable_crons_by_prefix.sh <prefix>}"
echo "List cron jobs, find names starting with: $PREFIX, then disable/remove them manually."
echo "This helper is intentionally conservative."
#

bootstrap/shared/rollback/rollback_dispatcher.sh

#!/usr/bin/env bash
set -euo pipefail

sudo systemctl stop agentbus-dispatcher.service
echo "Dispatcher stopped. Replace binary/script, then start it again when ready."

bootstrap/shared/rollback/rollback_schema_notes.md

#
# Rollback schema notes

Prefer feature freeze over destructive schema rollback.

Recommended order:
1. disable the phase’s cron jobs
2. stop writes into the new tables
3. keep old data in place
4. revert application code
5. verify health
6. only perform destructive rollback if absolutely necessary

bootstrap/shared/dashboards/metabase_notes.md

#
Use Metabase if you want the fastest operator dashboard.

Suggested dashboards:
- queue health
- standing order health
- opportunities
- memories
- dispatcher liveness

bootstrap/shared/dashboards/grafana_panels.md

#
Suggested Grafana panels:
- queued / running / blocked / dead counts
- oldest queued age
- oldest blocked age
- dispatcher last_seen
- standing order runs by status
- opportunities by status
- active memories count

bootstrap/shared/dashboards/system_health_queries.sql

#
select * from system_health_snapshot_v;
select * from dispatcher_heartbeats order by last_seen desc;
select * from standing_order_runs order by created_at desc limit 20;

bootstrap/phase1/notes.md

Phase 1 is the minimum viable deployment.

Only bring up:
- coordinator
- auditor
#
- devops
- publisher

Do not add:
- ideas
- opportunities
- memories
- idle cognition
- markdown export

bootstrap/phase2/notes.md

#
Phase 2 expands to the full reactive fleet.

Goal:
- all 10 agents online
- operational cron coverage complete

Still no:
- ideas
- opportunities
- memories
- idle cognition

bootstrap/phase3/notes.md

#
Phase 3 adds the proactive lane.

Goal:
- ideas
- opportunities
- digests

Still no:
- promoted memories
- idle cognition

bootstrap/phase4/notes.md

#
Phase 4 adds long-term memory.

Goal:
- memory candidates
- memories
- retrieval-aware job injection

Still no:
- idle cognition
- markdown export

bootstrap/phase5/notes.md

#
Phase 5 adds:
- bounded idle cognition
- Obsidian / markdown projection

This is the final layer, not the foundation.

bootstrap/phase5/obsidian-export.service

[Unit]
Description=Obsidian Export
#
[Service]
Type=oneshot
EnvironmentFile=/etc/agentbus/dispatcher.env
ExecStart=/opt/agentbus/.venv/bin/python /opt/agentbus/obsidian_export.py

bootstrap/phase5/obsidian-export.timer

[Unit]
Description=Run Obsidian export periodically

[Timer]
#
OnCalendar=*:0/30
Persistent=true

[Install]
WantedBy=timers.target

bootstrap/phase5/vault_templates/idea.md

---
kind: idea
---
#

# {{title}}

## Body
{{body}}

## Evidence
{{evidence}}

bootstrap/phase5/vault_templates/opportunity.md

#
---
kind: opportunity
---

# {{title}}

## Why now
{{why_now}}

## Evidence
{{evidence}}

## Next step
{{suggested_next_step}}
#

bootstrap/phase5/vault_templates/memory.md

---
kind: memory
---

# {{title}}

## Body
{{body}}

## Scope
{{scope}}
#

## Subject
{{subject_key}}

bootstrap/phase5/vault_templates/daily_snapshot.md

# Daily Snapshot

- queued:
- blocked:
- dead:
#
- standing order misses:
- top opportunities:
- memory notes:

bootstrap/phase5/vault_templates/weekly_bets.md

# Weekly Bets

## Top items
- 
#
## Why they matter
- 

## Recommended next moves
- 

Important note about the large implementation files

These filenames are now fixed and Git-ready:

  • bootstrap/phase1/phase1_core.sql
  • bootstrap/phase1/agentbus.py
  • bootstrap/phase1/dispatcher_v2.py
#
  • bootstrap/phase1/phase1_bootstrap.sh
  • bootstrap/phase2/phase2_bootstrap.sh
  • bootstrap/phase3/ideas_opportunities.sql
  • bootstrap/phase3/agentbus.py
  • bootstrap/phase3/phase3_bootstrap.sh
  • bootstrap/phase4/memory.sql
  • bootstrap/phase4/agentbus.py
  • bootstrap/phase4/dispatcher_with_memory.py
  • bootstrap/phase4/phase4_bootstrap.sh
  • bootstrap/phase5/obsidian_export.py
  • bootstrap/phase5/phase5_bootstrap.sh

Those should contain the exact code/SQL/scripts from the earlier snippets in this thread. I kept the names aligned on purpose so you can drop them straight in without having to rename anything.

If you want, I can do one final pass and give you the contents for the top-level README.md, roadmap.md, and RUNBOOK.md as one concatenated repo seed block, so you can paste them into files in a single shot.

nimble phoenix
#

Two final questions before I close this out:
First: we haven't talked about software development workflows. I use Claude Code and a ralph-loop pattern for agentic coding tasks. How should that integrate with this architecture — when should agents hand work to Claude Code vs handle it themselves, and what does a development job look like in the jobs table? Is there a pattern for agents initiating ralph-loop style coding runs, tracking them, and ingesting the results back into the system?
Second: looking at everything we've covered today — the full stack, memory, idle cognition, proactivity, deployment sequence — is there anything important I should have asked but didn't? What are the gaps, blind spots, or failure modes I haven't thought about yet?

warm elmBOT
#

Yep, two big ones, and both matter.

1) How Claude Code / ralph-loop should fit

Short version:

  • Agents should not do serious coding inline by default
  • They should decide, prepare, dispatch, review, and ingest
  • Claude Code / ralph-loop should be the execution engine
  • Postgres should still be the durable record

So the pattern is:

agent decides work exists → durable dev job created → coding runner launched → results ingested back → follow-up review / merge / deploy job created

When agents should hand work to Claude Code

#

Use Claude Code / ralph-loop when the task is:

  • multi-file
  • repo-mutating
  • test-running
  • long-running
  • iterative
  • likely to need retries / review loops
  • risky enough that you want a clear artifact trail

Let the agent handle it itself when it’s:

  • triage
  • classification
  • planning
  • spec writing
  • backlog grooming
#
  • code review summarization
  • failure analysis
  • “should we do this?”
  • small read-only repo inspection

A good rule:

  • thinking jobs stay in-agent
  • mutation jobs go to coding runners

What a development job should look like

I would keep the main jobs row as the business-level unit of work, for example:

  • job_type = "dev.code_change"
  • target_agent = "devops" or "project-tracker" or "coordinator"
  • subject = "fix flaky dispatcher lease recovery test"
#
  • payload contains repo/task details

Example shape:

{
  "repo": "git@github.com:yourorg/openclaw-infra.git",
  "worktree_ref": "main",
  "task_kind": "bugfix",
  "summary": "Fix lease recovery race in dispatcher",
  "problem_statement": "Expired lease can be reclaimed while prior worker is still writing closeout state",
  "acceptance_criteria": [
    "race is reproduced by test",
    "fix added",
    "tests pass",
    "result summarized"
#
  ],
  "constraints": {
    "allow_merge": false,
    "allow_deploy": false,
    "max_changed_files": 12
  },
  "handoff": {
    "runner": "claude-code",
    "mode": "ralph-loop"
  }
}

Add a separate coding_runs table

I would not overload jobs to hold the whole coding loop.

#

Use:

  • jobs = durable intent / business work item
  • coding_runs = actual Claude Code / ralph-loop execution record

Something like:

create table coding_runs (
  id bigserial primary key,
  job_id bigint not null references jobs(id) on delete cascade,
  runner text not null,                    -- claude-code / codex / etc
  run_mode text not null,                  -- single-shot / ralph-loop
  status text not null,                    -- queued/running/review/completed/failed
  repo text not null,
  branch text,
#
  worktree_path text,
  external_session_id text,                -- Claude Code session / ACP session id
  attempt_no integer not null default 1,
  prompt_input jsonb not null default '{}'::jsonb,
  result_summary text,
  patch_ref text,
  test_summary jsonb,
  artifact_refs jsonb not null default '[]'::jsonb,
  started_at timestamptz,
  finished_at timestamptz,
  created_at timestamptz not null default now()
);

That gives you proper traceability without stuffing everything into jobs.payload.

#

Best integration pattern

I’d use three layers:

  1. Planner agent

    • decides the work is real
    • writes the durable job
    • sets acceptance criteria
    • picks the runner
  2. Coding executor

    • launches Claude Code / ralph-loop
    • updates coding_runs
    • stores artifacts, test output, diff summary
  3. Ingest/review agent

    • reads results
#
  • marks job completed, failed, blocked, or needs_review
    • creates follow-up jobs
    • emits memory candidates if there was a reusable lesson

Ralph-loop specific pattern

For ralph-loop style coding, I’d model it as a stateful sub-run under one durable job:

  • queued
  • running
  • waiting_for_tests
  • waiting_for_review
  • revision_requested
  • completed
  • failed

That way the loop can do multiple code/test/revise passes without pretending each pass is a brand-new business job.

#

How results come back into the system

When the coding run finishes, ingest back:

  • diff / patch ref
  • changed files
  • test results
  • lint results
  • summary
  • follow-up recommendations
  • whether human review is required

Then one of these happens:

  • mark parent dev job completed
  • mark it blocked with reason
  • enqueue dev.review
#
  • enqueue deploy.prepare
  • create memory_candidate if the lesson is durable

Important boundary

I would keep these as separate approval boundaries:

  • code generation
  • merge
  • deploy
  • prod config change

Those should not all live in one autonomous step.

OpenClaw-native fit

If you want this aligned with OpenClaw’s direction, the cleanest fit is:

#
  • OpenClaw agents do orchestration, planning, and ingestion
  • Claude Code runs as an external coding harness / ACP-style executor
  • Postgres remains the durable system of record
  • do not let a Claude Code session be the only record of what happened

That’s the important bit.


2) What you didn’t ask, but probably should

Honestly, you covered the architecture really well. The main blind spots now are operational, not conceptual.

The biggest gaps I think remain

A. Merge / deploy policy

You covered coding, memory, proactivity, audit, but not the final boundary:

#
  • who can merge?
  • who can deploy?
  • under what approval rule?
  • what is auto-merge vs human-reviewed only?

If you don’t define this, the coding layer will get weird fast.

B. Worktree / branch isolation

If multiple coding runs touch the same repo:

  • how are worktrees allocated?
  • how do you prevent branch collisions?
  • how do you avoid one run poisoning another?

This needs a real policy, not vibes.

C. Evaluation gates

#

You talked about traceability, which is great, but not enough about:

  • what counts as success?
  • what tests are mandatory?
  • what blocks promotion from opportunity → job?
  • what blocks memory promotion?
  • what blocks merge?

You need explicit eval gates or the system will reward output volume.

D. Backpressure and WIP limits

You need hard limits on:

  • concurrent coding runs
  • concurrent opportunities
  • memory promotion rate
  • idle cognition budget
#
  • publisher digest size

Otherwise the stack will look “alive” while quietly drowning.

E. Prompt / policy versioning

For real auditability, you’ll eventually want to know:

  • which prompt version ran?
  • which agent instructions were active?
  • which memory set was injected?
  • which model/harness version produced the result?

Without that, some failures will be impossible to reproduce.

F. Secret exposure and repo trust zones

For coding agents especially:

#
  • which repos are safe to expose?
  • which directories are off-limits?
  • what secrets may appear in code or env?
  • can a coding runner see prod configs?

You probably want trust zones, not one flat workspace policy.

G. Artifact retention

You’ll want a clear rule for how long to keep:

  • diffs
  • test logs
  • build logs
  • coding transcripts
  • prompt inputs
  • retrieved memories
#

This matters for both debugging and cost.

H. Human override model

When the system is wrong, what is the operator’s exact move?

You want explicit controls for:

  • pause lane
  • freeze promotions
  • disable an agent
  • drain a queue
  • cancel coding runs
  • mark memory stale
  • force retry / force dead-letter

That’s the “oh shit” layer. It matters a lot.

#

My top 5 failure modes to watch

  1. The proactive lane becomes spammy

    • feels smart
    • destroys trust
  2. Memory becomes sludge

    • too much stored
    • too little curated
    • retrieval quality collapses
  3. Coding runs become unbounded

    • too many loops
    • too much cost
    • too much branch/worktree mess
#
  1. Auditability becomes performative

    • lots of logs
    • not enough real state transitions
    • hard to answer simple questions
  2. Publisher becomes a trash compactor

    • instead of curating
    • it just summarizes chaos

If I were you, the next 3 design questions I’d answer are

  1. What is the merge/deploy approval model?
  2. What is the coding-run isolation model? (repo, branch, worktree, concurrency)
  3. What exact eval gates promote or reject
    • opportunities
#
  • memories
    • code changes

If you nail those three, you’ll avoid a lot of pain.


My opinionated recommendation

Treat Claude Code / ralph-loop as a specialized mutation worker, not as another freeform chatting agent.

That means:

  • durable job comes first
  • coding run is a tracked execution under that job
  • results get ingested back into Postgres
  • review/merge/deploy stay separate
#
  • reusable lessons become memory candidates, not instant memories

That keeps the whole stack clean.

And yeah, I’m glad you asked this before building it. This is exactly where a lot of multi-agent systems quietly go off the rails.

Relevant docs if you want to anchor this back to OpenClaw primitives:

If this thread got you unstuck, please right-click the solution message → Apps → ✅ Mark Solution so Answer Overflow can index it for the next person.