#problems after 0auth

1 messages · Page 1 of 1 (latest)

rare brook
#

Subject: Lost captain/worker resilience — model fallback shipped garbage to a user

Context: Earlier this year we ran a captain + workers swarm. Captain on Opus decomposed tasks and dispatched them to workers on Opus; if a worker timed out, captain reassigned. Resilience lived in the swarm. Heartbeat was 30min.

What changed: Two shifts landed close together — workers got tiered down from Opus to Sonnet to stay inside the Claude Max weekly cap, and the OAuth migration forced heartbeat to 1h because of tighter rate-limit behavior. The combination made delegation less rewarding than just doing the work in the main agent. Over weeks the swarm went dormant; all traffic ended up on a single agent, and resilience collapsed onto a plain model fallback chain: Opus → Sonnet → local 7B.

Incident: The primary timed out on a long-form content task in a user's native (non-English) language. The chain cascaded past both Claude models and landed on the local 7B, which cheerfully produced fluent-looking gibberish for several paragraphs before the user noticed. I've removed the 7B from the chain as a stopgap, but that just trades garbage output for silent failure — both Claude models still go through the same CLI transport, so a subprocess hang takes the whole chain down at once.

What's actually broken:
Single transport in the chain — one CLI hang = total outage
No retry/backoff on the primary — first timeout jumps straight to fallback
No queue-and-retry on full-chain failure — non-interactive hooks have zero recovery
No quality gate — a tiny local model was allowed to answer serious user-facing content
No alerting when a fallback lands below tier — I found out from the user, not ops
The swarm, which was the resilience layer, is dormant

teal sigilBOT
meager pantherBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

teal sigilBOT
rare brook
#

Priorities:

P0 (config-only stopgap, ready today): diversify the chain so fallbacks span two transports (CLI + direct HTTP API) instead of only CLI, and drop the local 7B entirely. Buys us outage tolerance, not a real fix.

P1 (preferred long-term): revive the captain/worker pattern. Route heavy / long-running work through workers so captain can reassign on failure. Budget question: can we keep workers on Sonnet under the current cap? Sonnet workers beat any local fallback by orders of magnitude.

P2 (gateway code): retry with exponential backoff on the primary before any fallback fires — most hiccups self-resolve in seconds.

P3 (gateway code): queue-and-retry on full-chain failure — ack "retrying in 2 min", re-enqueue, hard-fail only after N attempts.

P4 (gateway code + schema): per-model capability tier; fallback refuses to cross tiers unless explicitly opted in; any cross-tier fallback posts to ops immediately.

P5: heartbeat back toward 30min if the OAuth rate limit has loosened — or tell us the real current ceiling so we can compromise. Faster heartbeat is load-bearing if we revive the swarm.
What can you land, and by when? If the swarm revival is in-scope this week I'll hold P0 and push for the real fix; otherwise I ship P0 as stopgap and we plan the rest.

#

@thick spoke please help solve this, thanks

#

did you give up

#

@thick spoke

thick spokeBOT
#

I’m on it, just verifying the last bit around hook recovery so I don’t hand-wave. So far the code/docs mostly support your diagnosis, with a couple of caveats.

#

I checked the current docs/code, and your diagnosis is mostly right.

Verified

On the swarm

  • Current subagents still support an orchestrator pattern (maxSpawnDepth: 2), but I did not find
#

an automatic “captain reassigns timed-out worker” layer in the normal main-agent reply path. If the swarm went dormant, OpenClaw does not automatically preserve that resilience. https://docs.openclaw.ai/tools/subagents

What I’d do now

  1. Keep weak local models out of user-facing fallback chains.
  2. Make fallbacks cross a real transport/provider boundary, not just a model boundary on the same CLI/backend.
  3. For critical non-interactive automation, use actual one-shot cron jobs with cron.retry + cron.failureAlert, or route hooks into that path instead of relying on direct /hooks/agent.
  4. If you want captain/worker resilience back, rebuild it explicitly with subagents/orchestrator sessions.
  5. Add ops monitoring on the existing fallback logs until native alerting exists.

If you want, I can turn this into a concrete “safe config” recommendation next, with:

  • a safer fallback chain,
  • cron retry/failureAlert settings,
  • and a no-garbage local-model policy.
rare brook
#

Yes, do the safe config recommendation. Specifically want:

Fallback chain that crosses a real transport boundary (API key → OAuth, or SDK → CLI), not just model swap on same CLI backend. No weak local models in user-facing chains at all.
Hook→cron routing for Salesforge + leads follow-up + morning brief, so cron.retry + cron.failureAlert actually kick in. Direct /hooks/agent path is too fragile for these.
Heartbeat tuning: confirm 1h default holds for our anthropic:manual profile, and whether we can tighten it for interactive sessions specifically.
Ops monitoring: concrete grep patterns or jq queries against the fallback decision logs so I can wire a cron that alerts #alerts when fallback lands below Sonnet tier.
Captain/worker resilience: if we rebuild it explicitly with subagents, what's the minimal pattern that gets auto-reassignment on worker timeout.

#

@thick spoke stopped again?

#

ayo

#

@thick spoke

thick spokeBOT
#

Yep. Here’s the safe version I’d run.

1) Fallback chain, make the boundary real

Recommendation

  • Keep all local models out of user-facing chains.
  • Make the fallback cross a provider/backend boundary you can verify, not just a model change on the same Claude CLI path.
  • In practice, the safest config is usually:
    • primary = your best Anthropic path
    • fallback = a different provider/backend entirely
  • I would not count “same provider, different auth profile” as enough unless you’ve verified it really uses a different transport in your install.

Important caveat

#

Suggested shape

{
  agents: {
    defaults: {
      heartbeat: {
        every: "1h", // set explicitly, do not rely on auto-detection
        target: "none",
      },
    },
    list: [
      {
        id: "interactive",
        default: true,
        model: {
          primary: "anthropic/<best-user-facing-primary>",
#
          fallbacks: [
            "openai/<independent-user-facing-fallback>"
          ],
        },
        heartbeat: {
          every: "30m", // tighter for interactive traffic
          target: "last",
        },
      },
      {
        id: "automation",
        model: {
          primary: "anthropic/<best-automation-primary>",
          fallbacks: [
            "openai/<independent-automation-fallback>"
#
          ],
        },
        heartbeat: {
          every: "1h",
          target: "none",
          isolatedSession: true,
          lightContext: true,
        },
      },

      // keep locals off the user-facing path entirely
      {
        id: "local-lab",
        model: {
          primary: "ollama/<local-experiment-only>",
#
          fallbacks: [],
        },
      },
    ],
  },
}

Policy

  • interactive and automation are allowed to talk to users.
  • local-lab is for experiments only, not reply delivery.

2) Hook → cron routing for Salesforge / leads / morning brief

This part is the annoying one: I did not find a native “/hooks/agent enqueues a real cron job” bridge.

#

Current behavior is still basically:

  • /hooks/agent → immediate isolated run via runCronIsolatedAgentTurn(...)
  • not timer-managed cron
  • so no cron.retry loop

Docs/code point that way:

Safe pattern

For Salesforge and leads follow-up:

  • Put a tiny ingress wrapper in front of OpenClaw.
  • External webhook hits your wrapper
  • Wrapper creates a one-shot isolated cron job
  • That gives you cron.retry + cron.failureAlert
#

Flow:
Salesforge webhook -> wrapper -> openclaw cron add --at <now> --session isolated ...

Example one-shot job:

openclaw cron add \
  --name "salesforge-${EVENT_ID}" \
  --at "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --agent automation \
  --session isolated \
  --message "Process Salesforge event ${EVENT_ID}: ${PAYLOAD_SUMMARY}" \
  --no-deliver

Then set global retry/alerts:

#
{
  cron: {
    retry: {
      maxAttempts: 3,
      backoffMs: [30000, 60000, 300000],
      retryOn: ["rate_limit", "overloaded", "network", "timeout", "server_error"],
    },
    failureAlert: {
      enabled: true,
      after: 1,
      cooldownMs: 3600000,
      mode: "announce",
    },
    failureDestination: {
      mode: "announce",
#
      channel: "discord",
      to: "channel:<alerts-channel-id>",
    },
  },
}

For morning brief:

  • make it a normal recurring isolated cron
  • note: recurring jobs do not use cron.retry
  • they do get their own consecutive-failure backoff (30s -> 1m -> 5m -> 15m -> 60m) plus failureAlert

Example:

openclaw cron add \
#

3) Heartbeat tuning

What I can confirm

For your anthropic:manual profile:

  • if that is your Anthropic token/bearer-style profile, it should be in that bucket
  • but I would still set every: "1h" explicitly so there’s no ambiguity

Can you tighten heartbeat for interactive sessions specifically?

  • I did not find a built-in “interactive session gets shorter heartbeat, everything else stays at 1h” switch.
  • What is supported is per-agent heartbeat config.

So the clean split is:

  • interactive agent → heartbeat.every: "30m" (or 15m if you really want it)
  • automation / ops agent → heartbeat.every: "1h"
#

That is the supported way to tighten only the lanes you care about.

4) Ops monitoring, concrete log filters

Use openclaw logs --json so you don’t have to guess file paths.
Docs: https://docs.openclaw.ai/cli/logs

A. Show every successful fallback landing off primary

openclaw logs --json --limit 2000 | jq -rc '
  select((.event // .meta.event) == "model_fallback_decision")
  | select((.decision // .meta.decision) == "candidate_succeeded")
  | select((.isPrimary // .meta.isPrimary) != true)
  | {
      ts: (.time // .timestamp // .ts),
      requested: ((.requestedProvider // .meta.requestedProvider) + "/" + (.requestedModel // .meta.requestedModel)),
#
      landed: ((.candidateProvider // .meta.candidateProvider) + "/" + (.candidateModel // .meta.candidateModel)),
      reason: (.reason // .meta.reason)
    }'

B. Alert only when fallback lands below your floor

I’d make this allowlist-based, not “tiny-model regex”-based.

Example:

export SAFE_FLOOR_REGEX='^(anthropic/.+(sonnet|opus)|openai/.+gpt-5)'

openclaw logs --json --limit 2000 | jq -rc '
  select((.event // .meta.event) == "model_fallback_decision")
  | select((.decision // .meta.decision) == "candidate_succeeded")
#
  | select((.isPrimary // .meta.isPrimary) != true)
  | . as $row
  | ((.candidateProvider // .meta.candidateProvider) + "/" + (.candidateModel // .meta.candidateModel)) as $landed
  | select(($landed | test(env.SAFE_FLOOR_REGEX; "i")) | not)
  | {
      ts: (.time // .timestamp // .ts),
      requested: ((.requestedProvider // .meta.requestedProvider) + "/" + (.requestedModel // .meta.requestedModel)),
      landed: $landed,
      reason: (.reason // .meta.reason)
    }'

C. Show failover decisions before the final landing

openclaw logs --json --limit 2000 | jq -rc '
#
  select((.event // .meta.event) == "embedded_run_failover_decision")
  | select((.decision // .meta.decision) == "fallback_model")
  | {
      ts: (.time // .timestamp // .ts),
      stage: (.stage // .meta.stage),
      from: ((.sourceProvider // .meta.sourceProvider) + "/" + (.sourceModel // .meta.sourceModel)),
      to: ((.provider // .meta.provider) + "/" + (.model // .meta.model)),
      reason: (.failoverReason // .meta.failoverReason)
    }'

If you want the simplest grep fallback:

openclaw logs --plain --limit 2000 | grep -E 'model fallback decision|embedded run failover decision'
#

5) Captain/worker resilience, minimal explicit rebuild

There is no built-in auto-reassign anymore. The smallest explicit version is:

Config

{
  agents: {
    defaults: {
      subagents: {
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 4,
        runTimeoutSeconds: 600
      }
    }
  }
#
}

Docs: https://docs.openclaw.ai/tools/subagents

Pattern

  • Main session spawns one orchestrator.
  • Orchestrator spawns worker subagents.
  • Each worker gets an explicit runTimeoutSeconds.
  • When a worker completion announce comes back with status=timeout or status=error, the orchestrator respawns that task once on:
    • a different worker agent, or
    • a different transport/provider path
  • Orchestrator only synthesizes after:
    • first good result, or
    • quorum, or
    • retry budget exhausted
#

Minimal policy

  • 1 retry on timeout
  • 0 retries on obvious bad-content / wrong-language / nonsense outputs
  • never reassign onto a weak local model
  • no polling loops, just react to child completion events

So the actual resilience rule lives in the orchestrator prompt/logic, not in core fallback config.


If you want, next I can turn this into:

  1. one exact JSON5 config block, and
  2. one tiny wrapper script for Salesforge webhook -> one-shot cron enqueue -> #alerts on failure.

Also, if you want release-specific drift checked, paste openclaw --version.

#

Nope, still here, just slow because I was verifying the hook/cron edge instead of guessing 😅

Short version, I did answer:

  • No weak locals in user-facing fallback chains
  • Use a fallback that crosses a real provider/backend boundary
  • Do not rely on /hooks/agent for Salesforge/leads, put a tiny wrapper in front that creates one-shot isolated cron jobs so cron.retry and cron.failureAlert actually apply
  • Set heartbeat explicitly, don’t rely on implicit detection
  • Rebuild captain/worker explicitly with subagents, because auto-reassignment is not built in anymore

If you want, I can do the next useful step right now:

  1. one exact JSON5 config block, and
  2. one tiny hook -> cron enqueue script for Salesforge/leads.
#

I’m here 😄

If you want, I’ll paste the exact JSON5 block + the tiny hook→cron wrapper script next, so you can drop it in instead of translating my summary by hand.

#

Yep, here’s the paste. Replace only the model strings and the Discord channel id placeholders.

1) JSON5 config template

{
  agents: {
    defaults: {
      heartbeat: {
        every: "1h", // make it explicit
        target: "none",
      },
      subagents: {
        maxSpawnDepth: 2,
        maxChildrenPerAgent: 4,
        maxConcurrent: 8,
#
        runTimeoutSeconds: 600,
      },
    },

    list: [
      {
        id: "interactive",
        default: true,
        model: {
          primary: "anthropic/<interactive-primary>",
          fallbacks: [
            "openai/<interactive-fallback>"
          ],
        },
        heartbeat: {
#
          every: "30m",
          target: "last",
        },
      },

      {
        id: "automation",
        model: {
          primary: "anthropic/<automation-primary>",
          fallbacks: [
            "openai/<automation-fallback>"
          ],
        },
        heartbeat: {
          every: "1h",
#
          target: "none",
          isolatedSession: true,
          lightContext: true,
        },
      },

      // locals stay off user-facing delivery paths
      {
        id: "local-lab",
        model: {
          primary: "ollama/<local-only>",
          fallbacks: [],
        },
        heartbeat: {
          every: "0m",
#
        },
      },
    ],
  },

  cron: {
    retry: {
      maxAttempts: 3,
      backoffMs: [30000, 60000, 300000],
      retryOn: ["rate_limit", "overloaded", "network", "timeout", "server_error"],
    },

    failureAlert: {
      enabled: true,
      after: 1,
#
      cooldownMs: 3600000,
      mode: "announce",
    },

    failureDestination: {
      mode: "announce",
      channel: "discord",
      to: "channel:<alerts-channel-id>",
    },
  },
}

Intent

  • interactive and automation only use strong remote models
  • fallback crosses a real provider/backend boundary
#
event_id="$(jq -r '.id // .eventId // .event_id // empty' <<<"$payload")"
event_type="$(jq -r '.type // .eventType // .kind // "external-event"' <<<"$payload")"
summary="$(jq -r '.summary // .message // .title // "No summary provided"' <<<"$payload")"

if [[ -z "$event_id" ]]; then
  echo "missing event id" >&2
  exit 1
fi

# tiny dedupe, optional but useful
stamp_dir="${XDG_STATE_HOME:-$HOME/.local/state}/openclaw-hook-dedupe"
mkdir -p "$stamp_dir"
stamp_file="$stamp_dir/$event_id"
if [[ -f "$stamp_file" ]]; then
  echo "already queued: $event_id"
#
  exit 0
fi

job_name="hook-${event_type}-${event_id}"
now_utc="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

prompt=$(cat <<EOF
Process this ${event_type} event.

Event ID: ${event_id}
Summary: ${summary}

Return a plain-text outcome summary suitable for cron delivery/logging.
If follow-up is needed, say exactly what needs doing.
EOF
#
)

openclaw cron add \
  --name "$job_name" \
  --at "$now_utc" \
  --agent automation \
  --session isolated \
  --message "$prompt" \
  --timeout-seconds 900 \
  --no-deliver \
  --json

touch "$stamp_file"
echo "queued: $job_name"
#

Notes

  • This gets you onto the real cron path, so cron.retry and global cron.failureAlert apply.
  • I’d use this for Salesforge and lead follow-up events.
  • I would not use /hooks/agent for these if you care about recovery.

3) Morning brief, make it a normal recurring cron

openclaw cron add \
  --name "Morning brief" \
  --cron "0 7 * * *" \
  --tz "Europe/Vienna" \
  --agent automation \
  --session isolated \
#
  --message "Prepare the morning brief. Return a plain-text summary." \
  --announce \
  --channel discord \
  --to "channel:<brief-destination-channel-id>"

Caveat:

  • recurring jobs do not use cron.retry
  • they do get recurring-job backoff + failureAlert

Docs: https://docs.openclaw.ai/automation/cron-jobs


4) Minimal captain/worker pattern

#

If you rebuild the swarm, the smallest useful rule is:

  • main spawns one orchestrator
  • orchestrator spawns workers with runTimeoutSeconds
  • if worker finishes with timeout or error, orchestrator respawns that task once on a different worker/model path
  • never reassign onto local-lab

That resilience lives in the orchestrator behavior, not in plain model fallback.

If you want, next I can paste the minimal orchestrator prompt too.

rare brook
#

can you give it as pdf?

#

@thick spoke

thick spokeBOT