#Common errors I get

1 messages · Page 1 of 1 (latest)

drifting wasp
#

Here's every recurring pattern we keep hitting:

1. Stale sessionKeys after gateway crash/restart
Isolated cron jobs get sessionKeys stuck from pre-crash sessions. Gateway routes them to main session context, causing silent skips (lastStatus=skipped). Health checks didn't catch it because skipped ≠ error. We added P4/P5 detection but the root cause is: why do isolated jobs get sessionKeys assigned at all?

2. Cron job timeouts
Jobs regularly hit their timeoutSeconds limit. Agent startup + tool calls + model latency often exceeds the configured timeout, especially for Sonnet jobs doing multi-step work. We just added P6 auto-bump but the question is: why is agent startup in isolated sessions so slow?

3. lastDeliveryError "Write failed" on completed jobs
Jobs show lastStatus: "error" with lastDeliveryError: "⚠️ ✍️ Write: to /path (N chars) failed" — but Write tool works fine in isolated sessions now. These errors appear to be stale from the March 10 crash but they don't auto-clear. consecutiveErrors shows 0 but lastStatus still shows error.

4. lastStatus=error with empty lastError string
Multiple jobs show lastStatus: "error" and lastError: "" — no actual error message. Makes diagnosis impossible. These seem to come from the ENOENT session startup failures during crashes. Example: Email Digest, LinkedIn Digest, Weekly Ops Review, Weekly Analytics.

5. P2 startup failures (lastDurationMs < 200ms)
Jobs error in under 200ms — they never actually ran. Appears to be session startup failures. Usually clear on next run but sometimes persist across multiple cycles.

drowsy capeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

rough hollyBOT
#

A bunch of what you’re seeing is “working as designed, but sharp edges after a crash”, and a couple bits look like legit state-corruption / UX bugs.

I’ll go point-by-point and tie it to the actual cron semantics.

Docs worth having open while you read this:


1) “Stale sessionKeys” on isolated jobs → why they exist + why they drift to main

Two different things get called “sessionKey” in practice:

A) Internal cron run sessions (always exist)

Isolated runs intentionally create session entries + transcripts so you can debug runs and so retention works:

  • Isolated runs create session entries like ...:cron:<jobId>:run:<uuid> (and transcripts)
#
  • These get reaped via cron.sessionRetention

This is expected; isolated jobs can’t be totally “stateless” today because the agent runtime + run logs/transcripts are session-backed. (Docs call this out in the Maintenance section.)

B) job.sessionKey (optional override) — this is the one that bites you

Cron jobs can optionally store a sessionKey on the job itself (CLI: openclaw cron add/edit --session-key ...).

That field is used as a delivery targeting hint, especially when you use delivery.channel: "last" / implicit routing. Delivery target resolution does:

  1. look up store[job.sessionKey] (thread-specific session, if present)
  2. else fall back to store[mainSessionKey]

So after a crash/restart/reset, if that referenced session entry no longer exists (or got rolled), delivery silently falls back to main. That matches your “routes them to main session context” report.

Fix / best practice: for isolated jobs, avoid relying on “last route” + job.sessionKey unless you really mean “follow that thread’s routing forever”.

  • Prefer explicit delivery targets:
#
  • delivery.channel: "discord"
    • delivery.to: "channel:<threadOrChannelId>" (Discord threads are channels, so this is stable)
  • If you don’t need it, clear the job-level override:
    • openclaw cron edit <jobId> --clear-session-key

How to confirm quickly

openclaw cron list --all --json

Look for jobs where sessionKey is set unexpectedly.


2) Cron job timeouts + “why is isolated agent startup slow?”

Isolated cron runs are fresh agent turns by design (“no prior conversation carry-over”), so you pay startup costs each run:

  • resolve agent + workspace
  • bootstrap context injection (AGENTS/SOUL/…)
#
  • model catalog / allowlist checks
  • then the actual LLM + tool calls

Two concrete knobs that usually help:

A) Use lightweight bootstrap for chores

If the job doesn’t need the full workspace bootstrap context:

  • set payload.lightContext: true (CLI: --light-context)

This is explicitly documented for reducing overhead on scheduled chores.

B) Increase cron concurrency if runs queue up

Default is serialized:

  • cron.maxConcurrentRuns: 1

If you have multiple due jobs, “startup feels slow” because they’re just waiting their turn.

#

Config (docs show this exact key):

{
  cron: {
    maxConcurrentRuns: 2
  }
}

Also: if you’re doing multi-step “agentic” work on Sonnet, you often just need a bigger timeoutSeconds on that job; it includes everything in the agent turn.


3) lastDeliveryError “Write failed …” on jobs that are now fine

Cron job state fields are basically “last run snapshot”. They do not auto-clear just because the gateway is healthy again.

So if a weekly job failed on March 10 and hasn’t run since, it’ll still show that old write failure until the next run updates the state.

#

Fast way to clear/refresh state: manually run it once now:

openclaw cron run <jobId>

If it has run successfully since and you still see the old error, that’s more likely store/writeback weirdness — then the run log is the source of truth:

openclaw cron runs --id <jobId> --limit 20

4) lastStatus=error with lastError: "" (empty)

In current code, an error status should normally carry a non-empty error string (it’s set from the caught exception / timeout message).

So empty-string errors usually mean one of:

#
  • old corrupted state persisted during the crash window
  • a bug where something wrote an empty error (worth an issue if reproducible)

Workaround: the run log JSONL almost always still contains the real failure text:

openclaw cron runs --id <jobId> --limit 50

If you paste one example job’s cron runs output + the job entry from cron list --json, it’ll be obvious whether this is just stale state or a real “lost error message” bug.


5) P2 startup failures (lastDurationMs < 200ms)

Yeah — that signature is “never actually executed”, typically:

  • validation/preflight rejection
  • session store / workspace / ENOENT type failures
  • gateway mid-crash behavior