Here's every recurring pattern we keep hitting:
1. Stale sessionKeys after gateway crash/restart
Isolated cron jobs get sessionKeys stuck from pre-crash sessions. Gateway routes them to main session context, causing silent skips (lastStatus=skipped). Health checks didn't catch it because skipped ≠error. We added P4/P5 detection but the root cause is: why do isolated jobs get sessionKeys assigned at all?
2. Cron job timeouts
Jobs regularly hit their timeoutSeconds limit. Agent startup + tool calls + model latency often exceeds the configured timeout, especially for Sonnet jobs doing multi-step work. We just added P6 auto-bump but the question is: why is agent startup in isolated sessions so slow?
3. lastDeliveryError "Write failed" on completed jobs
Jobs show lastStatus: "error" with lastDeliveryError: "⚠️ ✍️ Write: to /path (N chars) failed" — but Write tool works fine in isolated sessions now. These errors appear to be stale from the March 10 crash but they don't auto-clear. consecutiveErrors shows 0 but lastStatus still shows error.
4. lastStatus=error with empty lastError string
Multiple jobs show lastStatus: "error" and lastError: "" — no actual error message. Makes diagnosis impossible. These seem to come from the ENOENT session startup failures during crashes. Example: Email Digest, LinkedIn Digest, Weekly Ops Review, Weekly Analytics.
5. P2 startup failures (lastDurationMs < 200ms)
Jobs error in under 200ms — they never actually ran. Appears to be session startup failures. Usually clear on next run but sometimes persist across multiple cycles.