Subject: Lost captain/worker resilience — model fallback shipped garbage to a user
Context: Earlier this year we ran a captain + workers swarm. Captain on Opus decomposed tasks and dispatched them to workers on Opus; if a worker timed out, captain reassigned. Resilience lived in the swarm. Heartbeat was 30min.
What changed: Two shifts landed close together — workers got tiered down from Opus to Sonnet to stay inside the Claude Max weekly cap, and the OAuth migration forced heartbeat to 1h because of tighter rate-limit behavior. The combination made delegation less rewarding than just doing the work in the main agent. Over weeks the swarm went dormant; all traffic ended up on a single agent, and resilience collapsed onto a plain model fallback chain: Opus → Sonnet → local 7B.
Incident: The primary timed out on a long-form content task in a user's native (non-English) language. The chain cascaded past both Claude models and landed on the local 7B, which cheerfully produced fluent-looking gibberish for several paragraphs before the user noticed. I've removed the 7B from the chain as a stopgap, but that just trades garbage output for silent failure — both Claude models still go through the same CLI transport, so a subprocess hang takes the whole chain down at once.
What's actually broken:
Single transport in the chain — one CLI hang = total outage
No retry/backoff on the primary — first timeout jumps straight to fallback
No queue-and-retry on full-chain failure — non-interactive hooks have zero recovery
No quality gate — a tiny local model was allowed to answer serious user-facing content
No alerting when a fallback lands below tier — I found out from the user, not ops
The swarm, which was the resilience layer, is dormant