Mule Post Mortem | Friends of the Crustacean 🦞🤝 | Page 1

green socket Mar 23, 2026, 9:33 AM

#

So I'm at the point where I think I need to reset my agent. Running on a Pi 4 with Gemini 3.1 Flash Lite being the main LLM.
How did I get here and why ?
It all began with API call errors as I am using the free Google API which just wasn't able to handle the spikes in traffic.
But no problem I can just run my calls through a LiteLLM proxy with long enough retry times to help smooth out the spikes.
That worked, mostly, occasional API errors persisted but mostly because of requests per day limits.
At this point I noticed heartbeat wasn't firing.
This was solved by setting the gateway timeout interval much higher.
So far so good. Given that I still had intermittent API issues and really wanted my agent to labour
away at issues asynchronously I tried to set up some persistent task architecture where heartbeat would trigger main to continue in progress tasks, I never got that to work.
In the meantime I also tried to solve the API limit by adding some free open router models, they were slow and dumber than Gemini but they should give me more API calls to work with.
Setting up the models worked great, I then had the idea to have my main agent work at an orchestrator calling sub agents using those cheaper models to solve the problems.
In retrospect that was a terrible idea.
It led to both still being vulnerable to Gemini API failures (for some reason even more so) while having the output be of the quality of the worse models.
My attempts to somehow doctor together a heartbeat task resumption logic by now had culminated in my heartbeat file being a complicated checklist somehow doing none of the things I wanted it to.
Hallucinations, context bloat, and the fragments of all these projects being concurrently implemented without a whollistic plan ended in a crash out where the agent started looping the same output over and over again in a channel in which it was supposed to deliver files to me.

small skiffBOT Mar 23, 2026, 9:33 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

green socket Mar 23, 2026, 9:35 AM

#

All this leads me to wonder was this maybe a little to much a little to early? Asking to much of the limited models I was using? Or should I start again this time with a whollistic approach from the start? Probably not right.

real crater Mar 23, 2026, 9:36 AM

#

@ancient dock

ancient dockBOT Mar 23, 2026, 9:40 AM

#

You’re not crazy — this is a classic stacked-failure spiral.

Why this blew up

What you described usually happens when these combine:

Free-tier quota + retries → latency spikes + periodic hard failures.
Heartbeat doing “workflow orchestration” (instead of lightweight checks) → context bloat.
Mixed model quality in orchestration → weaker planner driving stronger executor = worse outcomes.
No isolation/idempotency on delivery → same task state gets replayed and re-sent in loops.

So yeah — “reset and simplify” is the right move.

Recovery plan (minimal + stable)

Before changing anything, please paste:

#

pnpm openclaw --version
pnpm openclaw status --all
pnpm openclaw models status
pnpm openclaw cron status
pnpm openclaw system heartbeat last

Then do a soft reset architecture:

Keep one primary model path first (no orchestrator/sub-agent fanout yet).
Keep heartbeat tiny; use it for awareness, not long-running task continuation.
Move precise async jobs to cron isolated sessions (not heartbeat).
Set heartbeat conservative:
target: "none" (initially)
lightContext: true
isolatedSession: true

#

Replace HEARTBEAT.md with a tiny checklist (3–5 bullets max).

Example heartbeat intent: “check urgent errors / pending critical item; otherwise HEARTBEAT_OK.”

Key docs (worth following exactly)

FAQ: https://docs.openclaw.ai/help/faq
Troubleshooting ladder: https://docs.openclaw.ai/gateway/troubleshooting
Heartbeat config: https://docs.openclaw.ai/gateway/heartbeat
Cron vs heartbeat: https://docs.openclaw.ai/automation/cron-vs-heartbeat

If you paste those command outputs, I can help you produce a clean “v2 minimal config” from your current setup without nuking everything.

OpenClaw

FAQ - OpenClaw

OpenClaw

Troubleshooting - OpenClaw

OpenClaw

Heartbeat - OpenClaw

OpenClaw

Cron vs Heartbeat - OpenClaw

green socket Mar 23, 2026, 9:56 AM

#

Great. But my agents mem and agent files are also all fucked op so no. I'm doing a hard reset.
What I am wondering about is whether I can get openclaw to work in the way I want it to.
Using different API's for different things and having a main agent that is capable of working independently on tasks set by me, being respawned by the heartbeat if necessary.

#

@ancient dock

ancient dockBOT Mar 23, 2026, 9:58 AM

#

Yeah — you can absolutely get that behavior, but with one key design change:

Heartbeat should be a watchdog, not the worker.

If heartbeat tries to do long task continuation itself, it eventually turns into the loopy mess you just hit.

Architecture that actually works

Main agent = planner + operator interface

Takes your instructions
Writes/updates task state
Decides priority

Cron isolated runs = worker

Runs on schedule
Processes one task step at a time
Stops (no infinite “keep going” behavior)

#

Heartbeat = supervisor

Checks “stuck/failed/needs attention?”
Sends alert or HEARTBEAT_OK
Optionally nudges worker, but does not do the full job

Multi-API / multi-model part

Also yes — OpenClaw supports this natively via:

agents.defaults.model.primary
agents.defaults.model.fallbacks
provider auth profile rotation + cooldowns

So you can have one main path and controlled failover paths without hand-rolled orchestration chaos.

Docs:

#

Models/fallbacks: https://docs.openclaw.ai/concepts/models
Failover behavior: https://docs.openclaw.ai/concepts/model-failover
Heartbeat: https://docs.openclaw.ai/gateway/heartbeat
Cron vs heartbeat: https://docs.openclaw.ai/automation/cron-vs-heartbeat

Hard reset guidance (practical)

After reset, start with this “boring” baseline for a week:

1 primary + 1 fallback only
heartbeat: lightContext: true, isolatedSession: true, tiny checklist
heartbeat target initially none while testing
one isolated cron worker for task continuation
strict rule: each run handles one task step max

If you want, I can help you sketch an exact minimal config pattern for this setup once you paste:

OpenClaw

Models CLI - OpenClaw

OpenClaw

Model Failover - OpenClaw

OpenClaw

Heartbeat - OpenClaw

OpenClaw

Cron vs Heartbeat - OpenClaw

#

pnpm openclaw --version
pnpm openclaw models status
pnpm openclaw cron list
pnpm openclaw system heartbeat last

#

Yep — it can work that way. Here’s a clean pattern you can rebuild around after hard reset.

1) Minimal config shape (JSON5)

{
agents: {
defaults: {
model: {
primary: "provider/high-quality-model",
fallbacks: ["provider/reliable-cheaper-model"]
},
heartbeat: {
every: "30m",
target: "none", // turn on delivery later
lightContext: true,
isolatedSession: true,

#

prompt: "Read HEARTBEAT.md if it exists (workspace context). Follow it strictly. Do not infer or repeat old tasks from prior chats. If nothing needs attention, reply HEARTBEAT_OK."
}
}
}
}

2) Tiny `HEARTBEAT.md`

# Heartbeat checklist
- Check if task worker has failed or stalled.
- If stalled, report ONE alert with the task id.
- Do not execute long tasks here.
- If nothing needs attention, reply HEARTBEAT_OK.

#

3) Use cron as the worker (not heartbeat)

Create one isolated worker job that processes one step per run, then exits.
(That gives you resumable async behavior without loops.)

4) Multi-API behavior

Yes — use:

main model path for quality
fallbacks for resilience
optional per-cron --model override for cheap worker runs

So: main thinks, cron works, heartbeat supervises.

If you want, next message I can draft:

a one-job cron command template, and
a simple task-state file format (queued → running → done/failed)

#

that avoids duplicate sends.

green socket Mar 23, 2026, 10:03 AM

#

I have three questions, first wouldn't the cron job always trigger regardless of whether it's needed or not burning API calls ?
And secondly wouldn't the cron job need some way to manage memory as it would lose its context each step ?
And wouldn't cron and main profit from a pre defined communication protocol?

#

@ancient dock

ancient dockBOT Mar 23, 2026, 10:04 AM

#

Exactly the right questions.

1) “Won’t cron always burn API calls?”

If you run it naively, yes.
How to avoid that:

Guard clause first: worker checks task queue; if empty, exit immediately.
Use one-shot jobs (--at + --delete-after-run) instead of permanent frequent polling where possible.
If queue is usually empty, use a slower cadence + active hours.
Keep heartbeat cheap (isolatedSession + lightContext) and let it only detect “work exists”, not do work.

So cron can be very low-cost if it’s mostly “check + noop”.

2) “How does cron keep memory if isolated?”

By design, isolated cron should not depend on chat memory.

#

Use durable task state (file/db) as memory:

task id
objective
current state (queued/running/waiting/done/failed)
last output summary
next action
retry count / cooldown

Each run reads state, advances one step, writes state back, exits.
That gives resumability without context bloat.

3) “Should main and cron use a predefined protocol?”

100% yes. That’s the missing piece in most setups.

#

A minimal protocol:

Main can: create, reprioritize, cancel tasks.
Worker can: move task one state forward + append result.
State transitions only (no freeform rewriting).
Idempotency key per step to prevent duplicate sends.
Single writer rule for each field (avoid racey corruption).

This is what stops loops/hallucinated “progress”.

If you want, I can give you a concrete TASKS.json schema + transition table in next message (small enough to paste directly into your reset).
Docs: https://docs.openclaw.ai/automation/cron-jobs and https://docs.openclaw.ai/automation/cron-vs-heartbeat

OpenClaw

Cron Jobs - OpenClaw

OpenClaw

Cron vs Heartbeat - OpenClaw

#

Perfect — here’s the concrete protocol I’d use.

`TASKS.json` (durable memory between cron runs)

{
"version": 1,
"tasks": [
{
"id": "task-001",
"title": "example task",
"status": "queued",
"priority": 2,
"createdAt": "2026-03-23T10:00:00Z",
"updatedAt": "2026-03-23T10:00:00Z",

#

"goal": "What success looks like",
"constraints": ["max cost low", "one step per run"],
"inputs": ["optional refs/links"],

"progress": {
"step": 0,
"lastSummary": "",
"nextAction": "start",
"lastOutputHash": ""
},

"retry": {
"count": 0,
"nextEligibleAt": null
},

#


"lease": {
"owner": null,
"runId": null,
"expiresAt": null
},

"result": {
"outcome": null,
"artifacts": [],
"error": null
}
}
]
}

#

State machine (strict)

queued -> running (worker claims lease)
running -> waiting (blocked on external dependency)
running -> done (success)
running -> failed (max retries / unrecoverable)
waiting -> queued (dependency resolved)
failed -> queued (manual retry by main only)

No other transitions allowed.

Ownership rules (prevents corruption)

#

Main agent can: create/cancel/reprioritize tasks; edit goal/constraints.
Cron worker can: claim lease, advance one step, update progress/retry/result.
Heartbeat can: detect stale lease and raise alert (or requeue only if lease expired).

Anti-loop guardrails (important)

One step per run, then exit.
Idempotency hash: don’t resend same output if hash == lastOutputHash.
Lease timeout: if worker dies, task is recoverable.
Retry cooldown: avoid hammering provider when quota/rate-limit hits.
Terminal-state silence: no repeated “done” messages.

About API burn

#

If you want near-zero idle cost: don’t keep a frequent recurring worker cron.
Use one-shot jobs only when queue is non-empty, and have worker schedule next one-shot only if work remains.

That gives you resumable async execution without constant polling calls.

If you want, I can give you the exact worker prompt text next (short, deterministic, no hallucination wiggle room).

green socket Mar 23, 2026, 10:13 AM

#

If we just have a worker cron wouldn't that always fire potentially before the last cron has finished?
And if we have the workers daisy chain each other wouldn't that mean an API error kills the entire chain as the last worker didn't finish therefore couldn't set up the next one ?

#Mule Post Mortem

Why this blew up

Recovery plan (minimal + stable)

Key docs (worth following exactly)

Architecture that actually works

Multi-API / multi-model part

Hard reset guidance (practical)

1) Minimal config shape (JSON5)

2) Tiny HEARTBEAT.md

3) Use cron as the worker (not heartbeat)

4) Multi-API behavior

1) “Won’t cron always burn API calls?”

2) “How does cron keep memory if isolated?”

3) “Should main and cron use a predefined protocol?”

TASKS.json (durable memory between cron runs)

State machine (strict)

Ownership rules (prevents corruption)

Anti-loop guardrails (important)

About API burn

2) Tiny `HEARTBEAT.md`

`TASKS.json` (durable memory between cron runs)