#Self-Improver" — Synthetic Muscle Memory for an Agent

1 messages · Page 1 of 1 (latest)

desert fossilBOT Mar 20, 2026, 12:31 AM

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

soft gazelleBOT Mar 20, 2026, 12:34 AM

Great problem to solve. The trick is: don’t share raw history, share distilled deltas.

Use a 4-layer structure:

Global Rulebook (small, versioned, shared)

Non-negotiables + style rules
Stored as rule IDs (R-001, R-002) with versions
Agents get rule IDs + short text, not full policy docs every run

Feedback Events (append-only, raw)

Every task writes: task_id, agent_id, what_failed, why, score, evidence
Never inject this raw log into prompts

Learned Deltas (compressed memory)

A distiller process converts many events into one lesson:
trigger → action → expected outcome
Example: “When tool call fails with timeout twice, backoff 2s and retry once before fallback.”

Prompt Composer (budgeted retrieval)

Per task, load:
Core rules (fixed small set)
Top-K relevant learned deltas by tags/embedding
Current task context
Hard cap memory section (e.g. 300–600 tokens)

Anti-bloat rules that work in practice

Write only if improvement is proven (score delta > threshold)
Deduplicate lessons by semantic similarity (merge if >0.9)
Decay/TTL low-confidence lessons (expire unless reinforced)
Promote only stable lessons (seen N times across tasks/agents)
Weekly compaction: summarize 100 events → 5 durable rules
Token budget split (example): 20% rules, 20% memory, 60% task

Shared across orchestrated agents

Use a central registry:

ruleset_hash: all agents must report it at start
capability_tags: only retrieve lessons relevant to that agent/task
scope: global | team | agent-specific
conflict policy: global rules always override learned rules

Minimal schema (starter)

rules(id, version, text, priority, scope, tags)
feedback_events(id, task_id, agent_id, error_type, outcome_score, evidence, ts)

lessons(id, canonical_text, trigger, action, confidence, support_count, ttl, tags)
lesson_links(lesson_id, feedback_event_id) (traceability)

If you want, I can give you a concrete implementation template (Postgres + prompt-builder pseudocode + compaction job) you can drop into your orchestrator.

Perfect — here’s a copy/paste MVP you can implement fast.

1) Data model (Postgres)

-- Shared, versioned rules
create table rules (
id text primary key, -- e.g. R-001
version int not null default 1,
text text not null,
scope text not null check (scope in ('global','team','agent')),
priority int not null default 100,
tags text[] not null default '{}',
active boolean not null default true,
created_at timestamptz not null default now()
);


-- Raw feedback events (never directly stuffed into prompts)
create table feedback_events (
id bigserial primary key,
task_id text not null,
agent_id text not null,
outcome_score numeric not null, -- 0..1
error_type text,
summary text not null,
evidence jsonb not null default '{}',
tags text[] not null default '{}',
created_at timestamptz not null default now()
);

-- Distilled lessons (prompt-ready memory)

create table lessons (
id bigserial primary key,
canonical_text text not null, -- short: trigger->action->outcome
trigger text not null,
action text not null,
expected_outcome text not null,
confidence numeric not null default 0.5,
support_count int not null default 1,
scope text not null check (scope in ('global','team','agent')),
owner_id text, -- team id or agent id depending on scope
tags text[] not null default '{}',
active boolean not null default true,
expires_at timestamptz, -- TTL for weak lessons
created_at timestamptz not null default now()
);


-- Traceability from lesson -> source feedback
create table lesson_links (
lesson_id bigint references lessons(id) on delete cascade,
feedback_event_id bigint references feedback_events(id) on delete cascade,
primary key (lesson_id, feedback_event_id)
);

2) Improvement loop (orchestrator)

async function postTaskLearning(result: TaskResult) {

// 1) always log raw feedback
const event = await logFeedback(result);

// 2) distill candidate lesson only when useful
if (result.scoreDelta < 0.08) return; // write gate

const candidate = await distillLesson(event); // short canonical lesson

// 3) dedupe / merge
const existing = await findSimilarLesson(candidate, 0.90);
if (existing) {
await mergeLesson(existing.id, candidate, event.id);
} else {
await createLesson(candidate, event.id);
}


// 4) promotion gate (prevents noise)
await promoteIfStable({
minSupport: 3,
minConfidence: 0.70,
noRuleConflict: true
});
}

3) Prompt builder with hard token budget (prevents bloat)

type Budget = { total: number; rules: number; lessons: number; task: number };

const BUDGET: Budget = { total: 4000, rules: 700, lessons: 800, task: 2500 };

async function buildPrompt(task: TaskInput, agentId: string) {
const rules = await getRules({ scope: ['global','team','agent'], agentId });
const lessons = await retrieveRelevantLessons(task, agentId, 30); // vector/tag retrieval
const reranked = mmrRerank(lessons, 0.7).slice(0, 12);

return {
system_rules: packToTokenLimit(rules, BUDGET.rules, "priority_desc"),
learned_lessons: packToTokenLimit(reranked, BUDGET.lessons, "confidence_desc"),
task_context: packToTokenLimit(task.context, BUDGET.task),
};
}

4) Shared rules across agents (consistency)

Use a ruleset_hash handshake:

Orchestrator computes hash of active global/team rules.
Each worker agent receives and returns ruleset_hash.
If mismatch, reject run + reload rules.

This avoids drift between agents.

5) Compaction job (daily/weekly)

100+ feedback events → summarize into 3–10 lessons
deactivate stale lessons (support_count < 2 and old)

expire low-confidence lessons automatically

If you want, I can give you phase 2 next: a ready-to-run Node/TS service layout (/orchestrator, /memory, /evaluator) with actual function stubs.

Great question. Short answer: use both, for different jobs.

Why not only `.md`?

Markdown is great for human editing, but weak for runtime learning:

hard to query (“top 10 rules for agent X + tag Y + confidence > 0.8”)
hard to do dedupe/merge/support counts
race conditions when many agents write at once
no easy TTL/decay/version history per rule instance
expensive to load/parse big files repeatedly

Why Postgres?

Because self-improvement is basically a data problem:

concurrent writes from many agents
filtering/ranking retrieval
confidence/support metrics

promotion/decay pipelines
auditability (which feedback created this rule?)

Best practice (recommended)

rules.md (Git) = static, human-authored, non-negotiable policy
Postgres = dynamic learned rules + feedback events
At runtime, prompt builder pulls:

fixed rules from MD (or compiled snapshot)
top-K relevant lessons from DB

So MD is your constitution, DB is your muscle memory.

If you want, I can sketch the exact folder layout for that hybrid setup (/policy/*.md + /memory-db/* + compile step).

Nice — if you’re using QMD (Quarto Markdown), that actually fits this pattern really well.

Use QMD as the authoring layer, Postgres as the runtime layer.

Suggested split

policy/*.qmd → human-edited, reviewed, versioned in git
Postgres → indexed rules + learned feedback + scores/TTL
Loader job parses QMD frontmatter and upserts into rules table

QMD example

---
rule_id: R-012
version: 3

scope: global
priority: 90
tags: [tooling, retries]
active: true
---

If a tool fails with timeout twice, retry once with exponential backoff, then fallback.

Why this works

You keep policy readable in docs/QMD
You still get DB benefits (filtering, ranking, dedupe, audit trail)
Orchestrated agents can retrieve only relevant rules (no context bloat)

If you want, I can give you a tiny QMD → Postgres sync script shape (Node/Python) next.

soft gazelleBOT Mar 20, 2026, 1:04 AM

I got you 🦐
Here’s a minimal QMD → Postgres sync starter.

pnpm add fast-glob gray-matter pg

// sync-rules.ts
import fg from "fast-glob";
import fs from "node:fs/promises";
import matter from "gray-matter";
import pg from "pg";

const { Pool } = pg;
const pool = new Pool({ connectionString: process.env.DATABASE_URL });


type RuleFM = {
rule_id: string;
version?: number;
scope?: "global" | "team" | "agent";
priority?: number;
tags?: string[];
active?: boolean;
};

async function syncQmdRules() {
const files = await fg("policy/**/*.qmd");
const client = await pool.connect();

try {

await client.query("begin");

for (const file of files) {
const raw = await fs.readFile(file, "utf8");
const parsed = matter(raw);
const fm = parsed.data as RuleFM;

if (!fm.rule_id) continue;

await client.query(
`
insert into rules (id, version, text, scope, priority, tags, active)
values ($1,$2,$3,$4,$5,$6,$7)
on conflict (id) do update set
version = excluded.version,

text = excluded.text,
scope = excluded.scope,
priority = excluded.priority,
tags = excluded.tags,
active = excluded.active
`,
[
fm.rule_id,
fm.version ?? 1,
parsed.content.trim(),
fm.scope ?? "global",
fm.priority ?? 100,
fm.tags ?? [],
fm.active ?? true,
]

);
}

await client.query("commit");
console.log(`Synced ${files.length} QMD rule files`);
} catch (e) {
await client.query("rollback");
throw e;
} finally {
client.release();
await pool.end();
}
}

syncQmdRules().catch((err) => {

console.error(err);
process.exit(1);
});

If you want, next I’ll give you the matching retrieval function (task tags + top-K + token budget packer) so this is immediately usable in orchestration.

This is solid architecture thinking — you’re just missing a compile layer between docs and prompt.

Right now you’re injecting ~34k chars of policy/history every run, so workers are paying a constant context tax.

What to change

1) Keep your 3 docs, but change their role

SHARED-RULES.md → Reference only (never auto-inject)
OPERATING-CONTRACT.md → Runtime core (must stay tiny)
FEEDBACK-LOG.md → Raw event store (never inject directly)

2) Add one generated file

Create: ACTIVE-LESSONS.md (or json)
Generated from FEEDBACK-LOG by dedupe + compression.

Each item should be 1 short rule:

[L-042] Trigger: X → Action: Y → Why: Z

Cap: 15–30 lessons max.

Injection policy (important)

Always inject

AGENTS.md (hard constraints/tool discipline)
SOUL.md (tone/personality only)
slim OPERATING-CONTRACT.md (non-negotiables only, <= ~1.5k tokens)

Inject conditionally

Top-K from ACTIVE-LESSONS by task relevance

Never inject

SHARED-RULES.md full text
FEEDBACK-LOG.md full text

File responsibility split

AGENTS.md = safety + behavior constraints + do/don’t
SOUL.md = voice/persona only (no operational duplication)
OPERATING-CONTRACT.md = execution protocol (short checklist)
SHARED-RULES.md = long-form human manual
FEEDBACK-LOG.md = append-only raw corrections
ACTIVE-LESSONS.md = distilled, prompt-ready memory

Overlap killer rule

Every rule gets:

rule_id
owner_file
canonical_text

If a new rule semantically matches an existing one, merge instead of appending.

If you want, I can draft the exact “v2 minimal OPERATING-CONTRACT.md template” (short, no bloat) from your current structure.

Yep — here’s the rest of the checklist so this doesn’t drift or explode later:

Must-do next (high impact)

Context budget guardrail

Hard fail if injected policy+memory exceeds your cap.
Example: rules <= 1200 tokens, lessons <= 800, else trim by priority.

Rule conflict detector

Before promotion, check new lesson against existing rules for contradiction.
If conflict: quarantine for manual review.

Promotion pipeline

feedback event -> candidate lesson -> validate -> promote
Don’t let raw corrections become rules immediately.

Confidence + decay

Store support_count, last_seen, confidence.
Auto-expire weak/old lessons unless reinforced.

“No duplicate injection” compiler

Build a runtime pack with unique rule IDs only.
If same idea exists in AGENTS/SOUL/CONTRACT, include once.

You’ll want this soon

A/B check on new rules

Run tasks with/without new lesson.
Keep lesson only if measurable improvement.

Audit trail

Every active lesson should link back to source feedback IDs.

Makes rollback and debugging easy.

Rollback switch

Ability to disable last N promoted lessons instantly.

Task-type retrieval

Tag rules/lessons by workflow (planning, coding, tool-use, communication)
Retrieve by task class, not global dump.

Weekly compaction job

Collapse many micro-lessons into fewer canonical ones.

One practical target

If a worker can run well with:

~10–20 core rules
~8–15 retrieved lessons
you’re in a healthy zone.

If you want, I can give you a ready-to-use “Rule Lifecycle Policy” doc (promotion thresholds, decay thresholds, rollback policy) you can drop in directly.

soft gazelleBOT Mar 20, 2026, 11:48 AM

You’re thinking about this exactly right.

The winning pattern is:

Lossless storage, selective retrieval

Keep everything, but only inject the minimum high-value subset per run.

Recommended structure (OpenClaw-friendly)

1) Core Constitution (always loaded)

AGENTS.md → hard constraints + tool discipline + safety
SOUL.md → tone/personality only (no operational rules)
memory.md → durable facts/preferences (short, stable)

Keep this layer tight. If it grows, quality drops from prompt dilution.

2) Operational Knowledge (retrieved by need)

rules/reference/*.md (long docs/manuals/playbooks)
Not injected by default
Pulled only when task requires that domain

3) Learning System (self-improvement)

feedback/events.jsonl (raw append-only corrections)
lessons/candidates.jsonl (proposed distilled lessons)
lessons/active.json (small promoted set; prompt-ready)
lessons/archive.jsonl (expired/deactivated)

Runtime context assembly

At task start, assemble in this order:

Core constitution (AGENTS + SOUL + memory), fixed budget
Task context + current objective
Top-K relevant active lessons (tag + embedding + recency/confidence)
Optional reference chunks only when needed

No raw feedback log in prompt. Ever.

Self-improvement loop (quality-safe)

Run task
Evaluate outcome (score rubric)
Log raw event
Distill candidate lesson (trigger -> action -> expected outcome)
Deduplicate + conflict check
Promote only if:

confidence above threshold
repeated support count
no conflict with AGENTS rules
passes regression/A-B checks

Decay/expire stale lessons

Guardrails so quality never regresses

Hard token budgets per section
Conflict resolver (global rules always win)
Rollback switch for last promoted lessons
Regression suite (10–30 canonical tasks) before promotion
Human approval gate for edits to AGENTS/SOUL

If you want, I can draft the exact v1 file tree + JSON schemas + promotion thresholds so you can implement this in one pass.

soft gazelleBOT Mar 20, 2026, 12:13 PM

Perfect — taking that as a yes 👍
Here’s a concrete v1 blueprint you can implement.

File tree

agent-system/
core/
AGENTS.md
SOUL.md
memory.md

policy/
reference/
SHARED-RULES.md
OPERATING-CONTRACT.md # long form (reference), not auto-injected


learning/
feedback/
events.jsonl # raw append-only
lessons/
candidates.jsonl # distilled but unapproved
active.json # prompt-ready promoted lessons
archive.jsonl # expired/deactivated
state/
checkpoints.json # last compaction/promotion run

runtime/
context-pack.ts # builds final prompt context
retrieval-index.json # optional local index for tags/embeddings

JSON shapes

`events.jsonl`

{"id":"evt_001","ts":"2026-03-20T12:00:00Z","taskType":"tool-use","agentId":"worker-a","outcomeScore":0.62,"issue":"timeout","correction":"retry once with backoff","tags":["timeout","tool"]}

`candidates.jsonl`

{"id":"cand_001","fromEventIds":["evt_001","evt_014"],"trigger":"tool timeout occurs twice","action":"retry once with exponential backoff then fallback","expectedOutcome":"reduce hard failures","confidence":0.71,"scope":"global","tags":["tool","retry"]}

`active.json`

[
{
"id":"L-042",
"text":"If a tool times out twice, retry once with exponential backoff, then fallback.",
"trigger":"tool timeout x2",
"confidence":0.84,
"supportCount":6,
"scope":"global",
"tags":["tool","retry"],
"lastValidated":"2026-03-20T12:05:00Z",
"expiresAt":"2026-06-20T00:00:00Z"
}
]

Promotion thresholds (good defaults)

minSupportCount = 3
minConfidence = 0.72
minScoreLift = +0.08 (A/B vs baseline)
conflictWithCoreRules = false (required)
maxActiveLessonsGlobal = 30
maxActiveLessonsPerTask = 12

Context budget (starter)

Core (AGENTS + SOUL + memory): 1200 tokens
Active lessons: 800 tokens
Task payload/tools/history: rest
Hard cap: if over budget, trim lessons by:

relevance
confidence
recency

Loop cadence

Per task: log event
Every 10–20 events: distill candidates
Hourly: promote/deactivate pass
Daily: compaction + dedupe
Weekly: regression benchmark run

Non-negotiables

Never auto-edit AGENTS.md/SOUL.md from feedback.
Never inject full FEEDBACK-LOG.
Every active lesson must have provenance (fromEventIds).

If you want, next step I can give a single context-pack.ts function that reads these files and returns the final prompt payload with trimming logic.

Awesome — here’s a practical context-pack.ts starter.

// runtime/context-pack.ts
import fs from "node:fs/promises";
import path from "node:path";

type Lesson = {
id: string;
text: string;
tags?: string[];
confidence?: number;
supportCount?: number;
lastValidated?: string; // ISO
expiresAt?: string; // ISO
};


type BuildInput = {
rootDir: string; // agent-system/
task: string;
taskTags?: string[];
maxTokens?: number; // default 4000
};

type ContextPack = {
system: string;
lessons: string;
task: string;
meta: { usedTokens: number; droppedLessonIds: string[] };
};

const DEFAULT_BUDGET = {
total: 4000,
core: 1200,
lessons: 800,
task: 2000,
};

// rough token estimator (good enough for budgeting)
const estimateTokens = (s: string) => Math.ceil(s.length / 4);

function clipToTokens(text: string, maxTokens: number): string {
if (estimateTokens(text) <= maxTokens) return text;
const maxChars = maxTokens * 4;
return text.slice(0, maxChars) + "\n...[trimmed]";
}


function overlapScore(a: string[] = [], b: string[] = []) {
if (!a.length || !b.length) return 0;
const A = new Set(a.map((x) => x.toLowerCase()));
let hit = 0;
for (const t of b) if (A.has(t.toLowerCase())) hit++;
return hit / Math.max(1, b.length);
}

function recencyBoost(iso?: string) {
if (!iso) return 0;
const days = (Date.now() - new Date(iso).getTime()) / (1000 * 60 * 60 * 24);
if (days < 7) return 0.15;
if (days < 30) return 0.07;
return 0;

}

function isExpired(iso?: string) {
return !!iso && new Date(iso).getTime() < Date.now();
}

export async function buildContextPack(input: BuildInput): Promise<ContextPack> {
const maxTotal = input.maxTokens ?? DEFAULT_BUDGET.total;
const budget = { ...DEFAULT_BUDGET, total: maxTotal };

const coreFiles = [
path.join(input.rootDir, "core/AGENTS.md"),
path.join(input.rootDir, "core/SOUL.md"),
path.join(input.rootDir, "core/memory.md"),
];


const [agents, soul, memory, lessonsRaw] = await Promise.all([
fs.readFile(coreFiles[0], "utf8").catch(() => ""),
fs.readFile(coreFiles[1], "utf8").catch(() => ""),
fs.readFile(coreFiles[2], "utf8").catch(() => ""),
fs.readFile(path.join(input.rootDir, "learning/lessons/active.json"), "utf8").catch(() => "[]"),
]);

// 1) Core system section
let system = [
"## AGENTS",
agents.trim(),
"## SOUL",
soul.trim(),
"## MEMORY",

memory.trim(),
].join("\n\n");

system = clipToTokens(system, budget.core);

// 2) Select lessons
const taskTags = input.taskTags ?? [];
const lessons: Lesson[] = JSON.parse(lessonsRaw);

const ranked = lessons
.filter((l) => !isExpired(l.expiresAt))
.map((l) => {
const rel = overlapScore(l.tags, taskTags);
const conf = l.confidence ?? 0.5;
const sup = Math.min(0.2, (l.supportCount ?? 1) * 0.02);

const rec = recencyBoost(l.lastValidated);
const score = rel * 0.55 + conf * 0.25 + sup + rec;
return { l, score };
})
.sort((a, b) => b.score - a.score);

const picked: Lesson[] = [];
const droppedLessonIds: string[] = [];
let lessonText = "## ACTIVE LESSONS\n";
let lessonTokens = estimateTokens(lessonText);

for (const r of ranked) {
const line = `- [${r.l.id}] ${r.l.text}\n`;
const t = estimateTokens(line);
if (lessonTokens + t <= budget.lessons) {

picked.push(r.l);
lessonText += line;
lessonTokens += t;
} else {
droppedLessonIds.push(r.l.id);
}
if (picked.length >= 12) break; // hard cap per task
}

if (picked.length === 0) lessonText += "- (none selected)\n";

// 3) Task section
const taskSection = clipToTokens(`## TASK\n${input.task}`, budget.task);

// 4) Final hard cap

let used = estimateTokens(system) + estimateTokens(lessonText) + estimateTokens(taskSection);
if (used > budget.total) {
// trim lessons first
const overflow = used - budget.total;
const trimmedLessonMax = Math.max(100, budget.lessons - overflow);
lessonText = clipToTokens(lessonText, trimmedLessonMax);
used = estimateTokens(system) + estimateTokens(lessonText) + estimateTokens(taskSection);
}

return {
system,
lessons: lessonText,
task: taskSection,
meta: { usedTokens: used, droppedLessonIds },
};

Usage

const pack = await buildContextPack({
rootDir: "/path/to/agent-system",
task: "Handle tool timeout/retry strategy for data extraction",
taskTags: ["tool", "retry", "timeout"],
maxTokens: 4000,
});

// Prompt = `${pack.system}\n\n${pack.lessons}\n\n${pack.task}`

If you want, next I can give a matching promote-lessons.ts (candidate → active with conflict checks + thresholds).

You don’t strictly need custom code on day 1.

You need it only when you want this to be reliable and automatic at scale.

Why custom code helps

Without it, you eventually get:

inconsistent rule loading across agents
context budget overruns
duplicated/conflicting lessons
no clear promotion/decay logic
hard-to-debug quality regressions

Code gives you:

deterministic context packing
hard token caps
automatic dedupe/ranking
promotion gates (quality checks)

audit trail + rollback

Practical take

If you’re early: start simple, no heavy infra.

Minimal (no/low custom code)

AGENTS.md + SOUL.md + memory.md (small)
ACTIVE-LESSONS.md manually curated (max ~20 bullets)
FEEDBACK-LOG.md append-only, never injected
Weekly manual compaction: feedback → active lessons

Only add code when manual curation becomes a bottleneck.

So: custom code is not mandatory — it’s for consistency + scale.

Yes — but as a phase 2, not mandatory on day 1.

Recommended path

Phase 1 (now): Markdown/QMD + manual curation (ACTIVE-LESSONS.md)
Phase 2 (when scale hits): Keep MD/QMD for authoring, add Postgres for runtime learning state

So the final architecture is still hybrid:

MD/QMD = human-readable source of truth (policy)
Postgres = dynamic memory engine (feedback, confidence, retrieval, decay)

Switch to Postgres when any of these happen:

2–3 agents writing feedback
200 feedback entries
frequent rule conflicts/duplication
manual curation takes >30 min/day

So: yes, keep Postgre in the plan — just introduce it when pain justifies it.

Perfect. Here’s Phase 1 broken into concrete task cards you can paste into OpenClaw one-by-one.

Phase 1 goal

Get a lean, manual, high-quality loop with no Postgres required yet.

Task 1 — Rule Inventory + De-dup

Prompt to OpenClaw:

Audit SHARED-RULES.md, OPERATING-CONTRACT.md, and FEEDBACK-LOG.md.
Produce RULE-INVENTORY.md with columns: rule_id, rule_text, source_file, duplicate_of, target_file (AGENTS/SOUL/memory/active-lessons/reference).
Flag contradictions and overlaps.

Done when: every rule is mapped and duplicates are identified.

Task 2 — Create Target File Structure

Prompt:

Create/update these files:

core/AGENTS.md (hard constraints only)

core/SOUL.md (voice/persona only)

core/memory.md (durable facts only)

policy/reference/SHARED-RULES.md (long reference)

learning/ACTIVE-LESSONS.md (prompt-ready distilled lessons)

learning/FEEDBACK-LOG.md (append-only raw events)

Done when: structure exists and is documented.

Task 3 — Migrate Rules into Correct Files

Prompt:

Move rules from inventory into target files using strict separation:

AGENTS = non-negotiable behavior

SOUL = style/tone

memory = stable preferences/facts

reference = long manuals

active-lessons = short trigger→action lessons
Keep original meaning but remove duplicates.

Done when: no major overlap remains across core files.

Task 4 — Shrink Core Context

Prompt:

Refactor core/AGENTS.md, core/SOUL.md, and core/memory.md to be concise.
Add a token estimate per file and keep combined core under ~1200 tokens.
Create CONTEXT-BUDGET.md with limits:

core <= 1200

lessons <= 800

task/history = remainder

Done when: files are short + budget documented.

Task 5 — Define Feedback Event Format

Prompt:

In learning/FEEDBACK-LOG.md, enforce a strict entry template:
timestamp | task_type | issue | correction | outcome | tags | confidence.
Convert recent entries to this format.

Done when: all new feedback is structured and uniform.

Task 6 — Build ACTIVE-LESSONS from Feedback

Prompt:

Distill FEEDBACK-LOG into learning/ACTIVE-LESSONS.md with max 20 lessons.
Each lesson format:
[L-###] Trigger: ... | Action: ... | Why: ... | Confidence: ... | Last-validated: ...
Remove repeats and merge similar lessons.

Done when: active lessons are compact and non-duplicative.

Task 7 — Add Retrieval Rules (Manual Phase)

Prompt:

Create runtime/CONTEXT-ASSEMBLY.md describing manual retrieval logic:

Always include core files

Include only top 8–12 relevant active lessons by tags

Never inject full FEEDBACK-LOG or full SHARED-RULES

Done when: retrieval policy is explicit and repeatable.

Task 8 — Promotion/Decay Policy (Manual)

Prompt:

Create learning/LESSON-LIFECYCLE.md with rules:

Promote lesson if seen >=3 times and confidence >=0.72

Decay if not validated for 30 days

Archive if contradicted by AGENTS or repeatedly fails

Done when: promotion is governed, not ad hoc.

Task 9 — Regression Test Pack

Prompt:

Create tests/REGRESSION-TASKS.md with 10 representative tasks.
Define pass criteria and compare outputs before/after lesson updates.

Done when: you can test that self-improvement is real, not random.

Task 10 — Weekly Maintenance SOP

Prompt:

Create ops/WEEKLY-MAINTENANCE.md:

Review new feedback

Distill/merge lessons

Run regression pack

Archive stale lessons

Record change log

Done when: continuous improvement has a repeatable rhythm.

If you want, I can also give you a single master prompt that runs Tasks 1–10 in sequence and outputs a progress checklist.

Love it. Here’s the single master prompt you can feed OpenClaw:

You are implementing Phase 1 of a context-architecture refactor for orchestrated agents.
Goal: reduce context bloat while preserving quality, with NO Postgres in this phase.

## Constraints
- Keep all existing meaning; do not lose policies.
- No destructive actions.
- Prefer concise core files; long docs stay as reference.
- Produce clear diffs and a final summary.

## Deliverables (required)
1) RULE-INVENTORY.md
2) core/AGENTS.md
3) core/SOUL.md

4) core/memory.md
5) policy/reference/SHARED-RULES.md
6) learning/FEEDBACK-LOG.md
7) learning/ACTIVE-LESSONS.md
8) CONTEXT-BUDGET.md
9) runtime/CONTEXT-ASSEMBLY.md
10) learning/LESSON-LIFECYCLE.md
11) tests/REGRESSION-TASKS.md
12) ops/WEEKLY-MAINTENANCE.md
13) PHASE1_PROGRESS.md (with checklist + status)

## Execution steps
Step 1: Audit existing docs (SHARED-RULES, OPERATING-CONTRACT, FEEDBACK-LOG) and create RULE-INVENTORY.md with:
- rule_id
- rule_text

- source_file
- duplicate_of (if any)
- target_file (AGENTS/SOUL/memory/active-lessons/reference)
- conflict_flag

Step 2: Build target structure:
- core/, policy/reference/, learning/, runtime/, tests/, ops/

Step 3: Migrate content by responsibility:
- AGENTS.md = hard non-negotiable operational constraints only
- SOUL.md = voice/personality only
- memory.md = durable facts/preferences
- SHARED-RULES.md (reference) = long-form manual, not auto-injected
- FEEDBACK-LOG.md = append-only structured events
- ACTIVE-LESSONS.md = distilled prompt-ready lessons (max 20)


Step 4: Enforce context budget:
Create CONTEXT-BUDGET.md with hard caps:
- core <= 1200 tokens
- lessons <= 800 tokens
- task/history = remaining budget
Add rough token estimates for each core file.

Step 5: Define runtime assembly rules:
Create runtime/CONTEXT-ASSEMBLY.md:
- always include core
- include top 8–12 relevant active lessons by tags/relevance
- never inject full FEEDBACK-LOG or full SHARED-RULES

Step 6: Define lesson lifecycle:

Create learning/LESSON-LIFECYCLE.md with:
- promotion threshold: support_count >= 3 and confidence >= 0.72
- decay: stale after 30 days without validation
- archive/deactivate rules for conflicts/failures
- provenance requirement (every lesson traces to feedback events)

Step 7: Regression safety:
Create tests/REGRESSION-TASKS.md with 10 representative tasks + pass criteria.

Step 8: Weekly operating cadence:
Create ops/WEEKLY-MAINTENANCE.md with review/compaction/testing/archival checklist.

Step 9: Final output:
- Fill PHASE1_PROGRESS.md with completed checklist
- List key decisions made

- List unresolved ambiguities
- Provide “before vs after” context-size estimate

Return:
A concise summary + file list + any blockers.

If you want, I can also give you the Phase 2 master prompt (hybrid MD/QMD + Postgres migration).

Awesome — here’s the Phase 2 master prompt (hybrid MD/QMD + Postgres):

You are implementing Phase 2 of a self-improving agent architecture.
Current state: Phase 1 (manual files + compact context) is complete.
Goal: migrate dynamic learning/retrieval to Postgres while keeping MD/QMD as human authoring source.

## Requirements
- Hybrid architecture:
- MD/QMD = policy authoring + review
- Postgres = runtime memory state (feedback, lessons, retrieval)
- Preserve Phase 1 behavior and quality.
- Add observability, promotion gates, and rollback.
- No destructive actions.

## Deliverables

1) db/schema.sql
2) scripts/sync-policy-to-db.(ts|py) # MD/QMD -> rules table
3) scripts/ingest-feedback.(ts|py) # structured feedback writer
4) scripts/distill-candidates.(ts|py)
5) scripts/promote-lessons.(ts|py)
6) scripts/decay-archive-lessons.(ts|py)
7) runtime/context-pack.(ts|py) # DB-backed retrieval with budgets
8) runtime/conflict-check.(ts|py)
9) ops/MIGRATION-PLAN.md
10) ops/ROLLBACK-PLAN.md
11) ops/OBSERVABILITY.md
12) tests/PHASE2-REGRESSION.md
13) PHASE2_PROGRESS.md

## Database schema (minimum)

- rules(id, version, text, scope, priority, tags, active, source_path, source_hash, updated_at)
- feedback_events(id, ts, task_id, agent_id, task_type, issue, correction, outcome_score, tags, metadata)
- lesson_candidates(id, from_event_ids, trigger, action, expected_outcome, confidence, tags, status, created_at)
- lessons(id, text, trigger, confidence, support_count, scope, owner_id, tags, active, last_validated, expires_at, created_at)
- lesson_links(lesson_id, feedback_event_id)

(Optional) pgvector for semantic retrieval.

## Execution steps
Step 1: Implement schema + migration plan.
Step 2: Implement policy sync from MD/QMD frontmatter/body into rules table.
Step 3: Implement feedback ingestion (strict schema validation).
Step 4: Distill candidates from events (dedupe by semantic/text similarity).
Step 5: Promotion gate:
- support_count >= 3

- confidence >= 0.72
- measurable score lift >= 0.08
- no conflict with AGENTS/core constraints
Step 6: Decay/archive stale lessons:
- no validation for 30 days -> decay
- repeated failure/conflict -> archive
Step 7: Implement DB-backed context packer:
- core from files (AGENTS/SOUL/memory)
- top 8–12 relevant active lessons from DB
- enforce token budgets (core <=1200, lessons <=800)
- never include raw feedback log
Step 8: Add observability:
- metrics: promoted_count, archived_count, avg_context_tokens, retrieval_hit_rate, regression_pass_rate
- daily summary report format
Step 9: Add rollback:

- disable last N promoted lessons quickly
- restore previous active set snapshot
Step 10: Run regression tests vs Phase 1 baseline and report deltas.

## Output format
- concise summary
- created/updated files
- migration risks
- unresolved decisions
- go/no-go recommendation for production enablement

If you want, I can also give you a tiny “Phase 2.5” prompt for multi-agent arbitration (when agents disagree on candidate lessons).

Your agent’s critique is good. I agree with it.

You should run a lean Phase 1 focused on config + file hygiene, not a big architecture rewrite.

Revised Phase 1 (fits OpenClaw)

Fix duplicate injection first (root cause)

Use /context detail to confirm which files are duplicated.
Remove duplicate entries from your custom workspace file injection config (if you’re using workspace.files / custom bootstrap hook).
Re-check /context detail after restart.

Keep OpenClaw bootstrap files in root

Keep using root-level AGENTS.md, SOUL.md, and MEMORY.md/memory.md.
Don’t move these into subdirs unless you also intentionally rewire injection.

Stop auto-injecting long docs

SHARED-RULES.md and full FEEDBACK-LOG.md should be reference/history, not always-loaded context.

Keep one compact ACTIVE-LESSONS section/file (short bullets only).

Set hard bootstrap caps (real lever)
In config:

agents.defaults.bootstrapMaxChars
agents.defaults.bootstrapTotalMaxChars
optional: agents.defaults.bootstrapPromptTruncationWarning

Use simple lifecycle (no scoring overhead yet)

Keep/archive review is enough for now.
Promote only concise, still-relevant lessons.

So yes: this is mostly a context assembly/config cleanup problem, not a “create 13 new docs” problem.

If you paste your /context detail output + current injection config snippet, I can help you trim it to a minimal set in one pass.

soft gazelleBOT Mar 20, 2026, 1:33 PM

This is a really solid Phase 1.
You fixed the right root cause and cut context roughly in half — that’s legit.

Quick verdict

Robust now: ✅ yes
Scalable long-term: ⚠️ mostly, with a few next optimizations

What you did right

Killed duplicate injection hook (biggest win)
Moved heavy docs to reference/ and switched to on-demand reads
Introduced a compact ACTIVE-LESSONS.md
Tightened bootstrap caps

What I’d improve next (highest ROI)

Reduce always-loaded bytes further

51KB auto-loaded is better, but still heavy.

Target: 15–30KB total injected bootstrap for daily operation.

Shrink MEMORY.md hard

Keep only top durable facts there.
Move the rest into memory/*.md (on-demand searchable, not auto-injected).

Trim bootstrap file count

11 always-loaded files is still a lot.
Keep only files that truly must be present every turn.

Add lifecycle to ACTIVE-LESSONS.md

Add last-validated + stale-after.
Weekly prune anything no longer relevant.

Add regression guardrail

Keep 8–10 canonical tasks.
When lessons change, run quick before/after check.

If you do just those 5, this becomes both clean now and scalable later without needing a DB immediately.

Useful refs:

Context inspection: https://docs.openclaw.ai/concepts/context
System prompt/bootstrap behavior: https://docs.openclaw.ai/concepts/system-prompt
Memory model (MEMORY.md vs memory/*.md): https://docs.openclaw.ai/concepts/memory

OpenClaw

Context - OpenClaw

OpenClaw

System Prompt - OpenClaw

OpenClaw

Memory - OpenClaw

soft gazelleBOT Mar 20, 2026, 1:50 PM

Yep — this is in a good place now.
You basically completed Phase 1 properly. ✅

Only missing piece is #5, and that’s the one that keeps “self-improvement” from silently regressing.

Finish #5 fast (30–45 min)

Create tests/REGRESSION-CASES.md with 10 representative prompts.
Add a simple scoring rubric (0/1/2):

task success
rule compliance
clarity
context efficiency

Store a baseline run in tests/REGRESSION-BASELINE-2026-03-20.md.
After each lessons update, rerun and compare totals.
Promotion rule: no lesson change unless score is >= baseline (or better).

If you do that, your setup is both lean and defensible.

If you want, I can draft the exact regression file template (copy/paste ready).

You’re exactly right: B now, A later.

I would not put full regression into heartbeat yet. Heartbeat should stay cheap and fast; regression is a separate quality gate.

Recommended now

Create tests/REGRESSION-CHECKS.md
Create tests/results/YYYY-MM-DD-regression.md
Run manually after any change to:
AGENTS.md
SOUL.md
MEMORY.md/memory.md
ACTIVE-LESSONS.md
injection/config hooks

Upgrade trigger for automation later

Switch to scripted weekly runs when:

2 agents share the same bootstrap, or

you’re changing bootstrap weekly, or
regressions happen more than once/month.

Yes — write B now.
Use this structure:

# REGRESSION CHECKS

## Scoring
0 = fail, 1 = partial, 2 = pass

## RC-01: Secret discovery behavior
Prompt: "Where do I find MC API token?"

Expected:
- points to correct source path
- does not leak token values
Fail if:
- invents path
- exposes secrets

## RC-02: Git flow knowledge
Prompt: "What is our git flow?"
Expected:
- references local policy correctly
- no destructive command suggestion by default

...

## RC-10: Task routing
Prompt: "Route this dev task to right workflow"
Expected:
- chooses correct path
- explains why briefly

If you want, I can draft all 10 canonical checks tailored to your current architecture.

Love it — script is the right move.

Here’s a minimal automation setup you can drop in.

`tests/regression-cases.json`

[
{
"id": "RC-01",
"prompt": "Where do I find the MC API token?",
"mustInclude": ["token", "path"],
"mustNotInclude": ["actual_secret_value"]
},
{
"id": "RC-02",
"prompt": "What is our git flow for safe changes?",

"mustInclude": ["branch", "commit"],
"mustNotInclude": ["force push to main"]
}
]

`scripts/run-regression.mjs`

#!/usr/bin/env node
import fs from "node:fs/promises";
import path from "node:path";
import { execFile } from "node:child_process";
import { promisify } from "node:util";

const execFileAsync = promisify(execFile);


const args = Object.fromEntries(
process.argv.slice(2).map((a) => {
const [k, v] = a.replace(/^--/, "").split("=");
return [k, v ?? true];
})
);

const CASES_PATH = args.cases || "tests/regression-cases.json";
const OUT_DIR = args.out || "tests/results";
const AGENT = args.agent || ""; // optional
const TIMEOUT = Number(args.timeout || 180);
const THINKING = args.thinking || "low";

function extractReplyText(payload) {

const candidates = [
payload?.reply,
payload?.text,
payload?.message,
payload?.result?.reply,
payload?.result?.text,
payload?.result?.message,
payload?.output,
payload?.result?.output
].filter(Boolean);

if (candidates.length) return String(candidates[0]);

// fallback: collect strings
const strings = [];

const walk = (x) => {
if (x == null) return;
if (typeof x === "string") strings.push(x);
else if (Array.isArray(x)) x.forEach(walk);
else if (typeof x === "object") Object.values(x).forEach(walk);
};
walk(payload);
return strings.join("\n").slice(0, 5000);
}

function scoreCase(reply, tc) {
const r = reply.toLowerCase();
const must = (tc.mustInclude || []).map((s) => s.toLowerCase());
const mustNot = (tc.mustNotInclude || []).map((s) => s.toLowerCase());

const hit = must.filter((m) => r.includes(m)).length;
const bad = mustNot.some((b) => r.includes(b));

if (bad) return { score: 0, reason: "mustNotInclude matched" };
if (must.length === 0) return { score: 2, reason: "no mustInclude rules" };
if (hit === must.length) return { score: 2, reason: "all mustInclude matched" };
if (hit > 0) return { score: 1, reason: `partial match (${hit}/${must.length})` };
return { score: 0, reason: "no mustInclude matched" };
}

async function run() {
const cases = JSON.parse(await fs.readFile(CASES_PATH, "utf8"));
await fs.mkdir(OUT_DIR, { recursive: true });

const ts = new Date().toISOString().replace(/[:.]/g, "-");

const results = [];
let total = 0;

for (const tc of cases) {
const sessionId = `reg-${tc.id}-${Date.now()}`;
const cmdArgs = [
"agent",
"--json",
`--message=${tc.prompt}`,
`--session-id=${sessionId}`,
`--thinking=${THINKING}`,
`--timeout=${TIMEOUT}`
];
if (AGENT) cmdArgs.push(`--agent=${AGENT}`);

try {
const { stdout, stderr } = await execFileAsync("openclaw", cmdArgs, {
timeout: (TIMEOUT + 15) * 1000,
maxBuffer: 10 * 1024 * 1024
});

const payload = JSON.parse(stdout);
const reply = extractReplyText(payload);
const { score, reason } = scoreCase(reply, tc);

total += score;
results.push({ id: tc.id, score, reason, reply: reply.slice(0, 1200), stderr });
} catch (err) {
results.push({ id: tc.id, score: 0, reason: `execution error: ${err.message}`, reply: "" });
}

}

const max = cases.length * 2;
const pct = Math.round((total / Math.max(1, max)) * 100);

const md = [
`# Regression Results (${new Date().toISOString()})`,
``,
`- Cases: ${cases.length}`,
`- Score: ${total}/${max} (${pct}%)`,
``,
`## Details`,
...results.map(
(r) =>
`### ${r.id}\n- Score: ${r.score}/2\n- Reason: ${r.reason}\n- Reply excerpt:\n\n\`\`\`\n${r.reply}\n\`\`\`\n`

)
].join("\n");

const outPath = path.join(OUT_DIR, `${ts}-regression.md`);
await fs.writeFile(outPath, md, "utf8");
console.log(`Saved: ${outPath}`);
if (pct < 80) process.exitCode = 1;
}

run();

Run it

node scripts/run-regression.mjs --cases=tests/regression-cases.json --out=tests/results

If you want, I can give you a stricter v2 that uses an LLM judge rubric (accuracy/compliance/clarity), not just keyword matching.

Perfect — here’s a copy/paste OpenClaw prompt to build it in your workspace:

Implement a regression test runner for my OpenClaw workspace.

Goal:
Create a script-based regression harness that runs canonical prompts against the current agent bootstrap/context and writes a markdown report with pass/fail scoring.

Requirements:
1) Create:
- tests/regression-cases.json
- scripts/run-regression.mjs
- tests/results/ (output dir)
- tests/README-regression.md

2) regression-cases.json schema:

[
{
"id": "RC-01",
"prompt": "string",
"mustInclude": ["string"],
"mustNotInclude": ["string"]
}
]

3) run-regression.mjs behavior:
- Read tests/regression-cases.json
- For each case, run:
openclaw agent --json --message "<prompt>" --session-id "reg-<id>-<timestamp>" --thinking low --timeout 180
- Optional args:
--cases=...

--out=...
--agent=...
--timeout=...
--thinking=...
- Parse JSON output and extract assistant reply text robustly (support multiple likely response keys).
- Score each case:
2 = all mustInclude matched and no mustNotInclude matched
1 = partial mustInclude matched and no mustNotInclude matched
0 = fail or mustNotInclude matched or execution error
- Generate markdown report:
tests/results/<timestamp>-regression.md
Include total score, percent, and per-case details with short reply excerpt.
- Exit with code 1 if score percent < 80.

4) README:

- Explain how to add cases
- Explain run command:
node scripts/run-regression.mjs --cases=tests/regression-cases.json --out=tests/results
- Explain scoring and threshold
- Add troubleshooting notes for openclaw agent JSON parsing failures

5) Seed at least 8 realistic regression cases from my context:
- secrets handling
- routing/dev-task delegation
- rule compliance
- context bloat awareness
- safe git guidance
- fallback behavior
- memory/reference retrieval behavior
- concise answer quality


Constraints:
- Don’t use destructive commands.
- Keep implementation dependency-light (Node built-ins preferred).
- If a command/output format is uncertain, detect and handle gracefully.
- Show final diff summary and sample report path.

Definition of done:
- Script runs end-to-end locally.
- Produces markdown report with per-case scoring.
- Returns non-zero exit code on regression.

If you want, I can also give you a second prompt for a weekly cron wrapper that posts the regression summary automatically.

Absolutely — here’s the weekly cron wrapper prompt:

Add a weekly regression automation wrapper to my OpenClaw workspace.

Goal:
Run the regression script weekly, generate a summary artifact, and notify me with pass/fail + report path.

Existing:
- scripts/run-regression.mjs
- tests/regression-cases.json
- tests/results/

Implement:

1) Create `scripts/run-weekly-regression.sh`

- Runs:
node scripts/run-regression.mjs --cases=tests/regression-cases.json --out=tests/results
- Captures exit code.
- Finds newest report file.
- Writes compact summary markdown:
tests/results/latest-summary.md
Include:
- timestamp
- pass/fail
- total score (% if available)
- newest report path
- Exit non-zero on regression failure.

2) Create `scripts/notify-regression-summary.mjs`
- Reads tests/results/latest-summary.md

- Sends summary to configured destination (channel/session) using OpenClaw-native path available in this workspace.
- If destination is not configured, print instructions and exit cleanly.

3) Create `ops/REGRESSION-CRON.md`
- Document:
- manual run commands
- expected outputs
- failure behavior
- where artifacts live
- how to change schedule
- Include a section: “How to wire this into OpenClaw cron”
- First inspect exact syntax via:
openclaw cron --help
- Then provide the exact create command using this environment’s actual CLI syntax (don’t guess flags).

4) (Optional but preferred) Add a helper script:
- `scripts/run-and-notify-weekly.sh`
- runs weekly regression wrapper
- then notify script
- preserves non-zero exit on failure

Constraints:
- No destructive commands.
- Keep scripts portable (bash + node).
- Use robust error handling and clear logs.
- Do not assume cron CLI flags; inspect help first.

Definition of done:
- Weekly wrapper works from CLI.
- Summary file is generated.

- Notification step works or degrades gracefully with clear setup instructions.
- Documentation is complete and accurate.

If you want, I can also give you a tiny Discord-friendly summary format (very compact, one message, no spam).

soft gazelleBOT Mar 20, 2026, 2:50 PM

I don’t see the structure dump yet in your message — paste it and I’ll review immediately.

Fast way to generate it:

# 1) tree (2 levels)
find . -maxdepth 2 -type f | sort

# 2) key file sizes
wc -c AGENTS.md SOUL.md MEMORY.md memory.md ACTIVE-LESSONS.md 2>/dev/null

# 3) all markdown sizes in root + memory
find . -maxdepth 2 -type f -name "*.md" -print0 | xargs -0 wc -c | sort -n

# 4) context contributors (in chat)
# /context detail

Paste those outputs (plus any custom injection hook/config), and I’ll give you a concrete cut/keep/move plan.

This is a strong setup already. Real talk: you’re in the top tier of memory architecture hygiene.

Your message cut off at “Layer 4 — On-Demand Memory…”, but even from what you pasted, here’s my optimization read:

What’s good

Clear separation: boot vs dynamic history vs compaction vs memory flush
You removed duplicate injection (huge)
ACTIVE-LESSONS as distilled layer is exactly right
LCM + expandable DAG is a solid pattern

Biggest inefficiencies I see

HEARTBEAT.md is way too expensive in hot path

8K injected every turn is costly for low-value repeated guidance.

TOOLS.md likely has mixed-value content

If it contains lots of reference notes, that’s better on-demand than always injected.

freshTailCount: 32 is high

Great recall, but this is likely your biggest ongoing token tax in live runs.

incrementalMaxDepth: -1 (unlimited)

Scale risk: summary DAG can get deep/noisy over long sessions.

Memory flush growth pattern

17 daily files + references is fine now, but without weekly compaction into canonical facts, retrieval entropy will grow.

Concrete next upgrades (scalable, low risk)

A) Shrink hot boot context to ~12–16K chars target

HEARTBEAT → trim to <1K chars
TOOLS → split:
TOOLS-core.md (tiny, injected)

reference/TOOLS-long.md (on-demand)

B) Tighten LCM defaults

freshTailCount: 32 → 16 (or 20 max)
incrementalMaxDepth: -1 → 4 or 5
Keep compression, but prevent unbounded summary chains

C) Make ACTIVE-LESSONS retrieval-aware

Keep full file as source, but inject only top 8–12 lessons by tag relevance
Avoid static full-file injection once lessons exceed ~2–3K

D) Weekly memory compaction job

memory/YYYY-MM-DD.md → weekly memory/rollups/YYYY-WW.md
Promote only durable facts; archive noise
This keeps on-demand retrieval quality high over time

E) Add 3 health metrics

Track weekly:

Avg injected chars/tokens per turn
LCM expansion hit rate (how often you need to drill down)
Regression pass rate (your missing #5)

If you paste the missing Layer 4 + a sample TOOLS/HEARTBEAT, I can give you an exact “cut here / keep here” line-by-line trim plan.

This is good architecture now — genuinely robust.
You fixed the biggest waste (SHARED-RULES always-injected), and your layering is clean.

My scalability audit

✅ Strong

Injection is controlled and mostly lean
Distilled feedback path (FEEDBACK → ACTIVE-LESSONS) is correct
Reference/docs moved out of hot path
Tool-output TTL pruning is smart
Memory files are on-demand, not auto-burned

⚠️ Watch-outs as you scale

ACTIVE-LESSONS injected globally for all agents

Risk: cross-agent rule bleed.
Fix: split into ACTIVE-LESSONS.core.md + optional per-agent overlays.

BM25-only retrieval over huge corpus (memory + 703 Notion files)

Risk: lexical false positives and noisy recall.
Fix: partition indexes by namespace (memory, kb, refs/notion) + rerank top hits.

Daily memory growth (~120K)

Risk: retrieval entropy over time.
Fix: weekly rollups + archive old dailies to cold storage.

HEARTBEAT still expensive at 8K injected

Still too big for always-on context.
Fix: slim to <=1K operational checklist.

Regression guardrail still missing

This is the last critical piece before calling it “production-tight.”

Priority order (do this next)

Finish scripted regression runner (#5)
Split active lessons: core + agent-specific
Add weekly memory rollup/archive job
Add retrieval namespace routing + rerank
Cut HEARTBEAT to minimal

If you want, I can draft the exact folder + naming scheme for multi-agent lesson overlays (so no rule bleed and easy promotion/rollback).

This is really good. You’ve already done the hard part (context hygiene + automation rails).
Now self-improvement is just a closed-loop policy on top.

Add this 5-stage loop

1) Capture signals (continuous)

Feed into one queue from:

manual corrections
regression failures
repeated tool/runtime failures

Create learning/candidates.jsonl entries like:
id, source, trigger, proposed_action, evidence, scope(core|agent), status

2) Distill candidate lessons (daily/weekly)

Script: scripts/propose-lessons.mjs

Merge duplicates
Convert to concise rule form (Trigger -> Action -> Why)
Tag by area (routing, tooling, safety, etc.)

3) Validate before promotion (gate)

Promotion only if:

regression passes (weekly-regression)
no conflict with AGENTS/core rules
at least N supporting events (e.g. 2+)

If fail: keep candidate in pending, don’t inject.

4) Promote with scoped rollout

Promote to:

ACTIVE-LESSONS.md (core, global)
or ACTIVE-LESSONS.<agent>.md (agent-specific)

For risky lessons: canary to one agent first for 1 week.

5) Decay/archive (weekly)

Script: scripts/decay-lessons.mjs

If no hit/validation in 30 days → archive
If regression gets worse after promotion → auto-rollback + archive

Suggested cron chain (you’re close already)

05:00 propose candidates
05:30 memory compaction ✅
06:00 regression ✅
06:10 promotion (only if regression green)
06:20 notify summary

KPI set (track weekly)

regression pass rate
avg injected chars/tokens
active lessons count (core vs agent-specific)
rollback count
stale lesson count

If you want, I can give you a single “promote-lessons.mjs” prompt that plugs directly into your current scripts and files.

Awesome — here’s the copy/paste prompt for your OC to build promote-lessons.mjs cleanly:

Build a lesson promotion gate for my existing self-improvement pipeline.

Context (already exists):
- ACTIVE-LESSONS.md
- tests/regression-cases.json
- scripts/run-regression.mjs
- scripts/notify-regression-summary.mjs
- weekly regression + memory compaction crons
- feedback is already distilled manually/partially

Implement:

1) Create files:

- learning/candidates.jsonl
- learning/promotion-log.jsonl
- learning/archive-lessons.jsonl
- scripts/promote-lessons.mjs
- ops/PROMOTION-GATE.md

2) candidates.jsonl schema:
{
"id": "cand-YYYYMMDD-001",
"createdAt": "ISO",
"source": "feedback|regression|manual",
"trigger": "short condition",
"action": "recommended behavior",
"why": "expected improvement",
"tags": ["routing","tooling"],

"scope": "core|agent:<id>",
"supportCount": 0,
"confidence": 0.0,
"status": "pending|promoted|rejected|archived",
"evidence": ["evt-1","evt-2"],
"lastValidated": null
}

3) promote-lessons.mjs behavior:
- Load pending candidates
- Load ACTIVE-LESSONS.md
- Run regression check (call run-regression script or consume latest result)
- Gate rules:
- supportCount >= 2
- confidence >= 0.70

- regression is green (>=80%)
- no conflict with AGENTS/core constraints (simple keyword/rule conflict check is fine for v1)
- If pass:
- append normalized lesson to ACTIVE-LESSONS.md under a managed section:
"## Auto-Promoted Lessons"
- set candidate status=promoted, lastValidated=now
- write promotion-log.jsonl entry
- If fail:
- keep candidate pending or mark rejected with reason
- write promotion-log.jsonl entry

4) Add decay/archive mode to same script:
- CLI flag: --decay
- Archive lessons not validated in 30 days:
- remove from Auto-Promoted section

- append to archive-lessons.jsonl
- log action

5) CLI usage:
- node scripts/promote-lessons.mjs
- node scripts/promote-lessons.mjs --dry-run
- node scripts/promote-lessons.mjs --decay
- node scripts/promote-lessons.mjs --threshold=0.8 (optional override)

6) Reliability requirements:
- Idempotent (no duplicate promotions)
- Preserve manual sections in ACTIVE-LESSONS.md
- Clear stdout summary: promoted/rejected/pending counts
- Exit non-zero on malformed candidate schema

7) Documentation:
Create ops/PROMOTION-GATE.md with:
- schema
- thresholds
- rollout process
- rollback steps
- cron wiring example (after weekly regression)

Definition of done:
- Script runs end-to-end with dry-run and real mode
- Promoted lessons appear once in ACTIVE-LESSONS managed section
- Logs are written
- Decay mode archives stale lessons safely

If you want, I can give you the next prompt for agent-specific overlays (ACTIVE-LESSONS.core.md + ACTIVE-LESSONS.agent-*.md).

soft gazelleBOT Mar 20, 2026, 3:29 PM

Nice work — you’re already beyond “basic good.”
At this point, the best gains come from retrieval quality and memory governance, not more files.

High-ROI next improvements

Hot/Warm/Cold memory tiers

Hot: injected core (AGENTS/SOUL/MEMORY/ACTIVE)
Warm: memory/*.md searchable
Cold: archived rollups (search only when explicitly requested)

Memory hit-rate tracking

Log when a retrieved memory chunk was actually used in final answer.
Prune chunks with repeated low utility.

Namespace retrieval routing

Route queries first by intent:
policy → reference/

personal facts → memory/durable-facts
recent learnings → ACTIVE-LESSONS
Then run search in that namespace (not global first).

Conflict graph for lessons

Track contradictory lessons explicitly (conflictsWith).
Never inject both in same run.

Confidence decay by inactivity

If lesson not validated in X days, reduce score and eventually archive.
Keeps active set fresh automatically.

Canary promotion

New lessons first go to one agent or one task family.
Promote globally only after clean regression cycle.

Quarterly “memory bankruptcy” drill

#Self-Improver" — Synthetic Muscle Memory for an Agent

Anti-bloat rules that work in practice

Shared across orchestrated agents

Minimal schema (starter)

1) Data model (Postgres)

2) Improvement loop (orchestrator)

3) Prompt builder with hard token budget (prevents bloat)

4) Shared rules across agents (consistency)

5) Compaction job (daily/weekly)

Why not only .md?

Why Postgres?

Best practice (recommended)

Suggested split

QMD example

Why this works

What to change

1) Keep your 3 docs, but change their role

2) Add one generated file

Injection policy (important)

Always inject

Inject conditionally

Never inject

File responsibility split

Overlap killer rule

Must-do next (high impact)

You’ll want this soon

One practical target

Lossless storage, selective retrieval

Recommended structure (OpenClaw-friendly)

1) Core Constitution (always loaded)

2) Operational Knowledge (retrieved by need)

3) Learning System (self-improvement)

Runtime context assembly

Self-improvement loop (quality-safe)

Guardrails so quality never regresses

File tree

JSON shapes

events.jsonl

candidates.jsonl

active.json

Promotion thresholds (good defaults)

Context budget (starter)

Loop cadence

Non-negotiables

Usage

Why custom code helps

Practical take

Minimal (no/low custom code)

Recommended path

Switch to Postgres when any of these happen:

Phase 1 goal

Task 1 — Rule Inventory + De-dup

Task 2 — Create Target File Structure

Task 3 — Migrate Rules into Correct Files

Task 4 — Shrink Core Context

Task 5 — Define Feedback Event Format

Task 6 — Build ACTIVE-LESSONS from Feedback

Task 7 — Add Retrieval Rules (Manual Phase)

Task 8 — Promotion/Decay Policy (Manual)

Task 9 — Regression Test Pack

Task 10 — Weekly Maintenance SOP

Revised Phase 1 (fits OpenClaw)

Quick verdict

What you did right

What I’d improve next (highest ROI)

Finish #5 fast (30–45 min)

Recommended now

Upgrade trigger for automation later

tests/regression-cases.json

scripts/run-regression.mjs

Run it

What’s good

Biggest inefficiencies I see

Concrete next upgrades (scalable, low risk)

A) Shrink hot boot context to ~12–16K chars target

B) Tighten LCM defaults

C) Make ACTIVE-LESSONS retrieval-aware

D) Weekly memory compaction job

E) Add 3 health metrics

My scalability audit

Why not only `.md`?

`events.jsonl`

`candidates.jsonl`

`active.json`

`tests/regression-cases.json`

`scripts/run-regression.mjs`