#Kanban DB corruption

1 messages · Page 1 of 1 (latest)

earnest copper
#

Kanban DB Corruption (recurring)

Hermes version: 0.15.1 (2026.5.29)
OS: Linux 5.15.0-177-generic x86_64
Python: 3.11.15
Memory provider: mnemosyne
Platform: telegram (gateway via systemd)
Kanban dispatcher: cron-based (dispatch_in_gateway: false)

Two corruption incidents

May 27 — Physical page corruption:

  • Dashboard kanban plugin polled DB every ~2s and detected corruption
  • Integrity check returned 10 pages unreadable (trees 2, 6, 7, 8, 14, 15, 18, 19, 26 — error 522)
  • Pages 29 and 30 marked "never used"
  • Created 311 .bak backup files in a crash loop before I fixed it
  • Preceded by disk I/O error events

May 30 — Index corruption:

  • wrong # of entries in index idx_events_run
  • Dashboard plugin detected once (not a crash loop this time)
  • DB otherwise readable — 12 tasks, 7,557 task_events, 493 task_runs all salvageable by rowid scan
  • Rebuilt DB by recreating schema + rowid-insert data → integrity: ok

Possible cause

Concurrent writes from dashboard plugin (every 2s poll) + gateway + hermes kanban CLI. SQLite gets corrupted when multiple processes write to WAL without coordination. The dashboard's kanban event stream polls aggressively.

Logs

  • Report: https://paste.rs/s2RpS
  • agent.log: https://paste.rs/jgFTB
  • gateway.log: https://paste.rs/QOXbj
  • Full extract with both corruption signatures and server stats also on Unraid: /mnt/user/data/Backups/hermes/hermes-kanban-debug-logs.tar.gz (but the actual broken DB was accidentally cleaned up — only the rebuilt healthy DB remains)

Fix applied locally

  • Rebuilt DB from rowid scan + fresh schema
  • Added --exclude='*.corrupt*' to backup script
  • Removed stale 45 × .bak files and 497 MB of workspace artifacts
  • Sync script changed to send only latest backup instead of entire folder
marsh rune
#

To confirm the Unraid/FUSE root cause (the highest-impact fix) vs. the SIGKILL-mid-write path, can you tell me where ~/.hermes (specifically kanban.db) physically lives?

#

@earnest copper

marsh rune
# earnest copper Ubuntu VPS

dug in. it's not the wal-concurrency bug everyone blames; that fix (busy_timeout + synchronous=FULL) is already in your build (0.15.1). on local ext4 yourwriters are coordinated and won't tear the db on their own.

looks like two different causes:

• may 27 (error 522 + 311 .bak files): storage-level fault, not a race — matches your "preceded by disk i/o errors" note, and you freed ~500MB so the disk may havebeen full during a checkpoint. the 311 backups are a separate known bug (PR #31795); your --exclude='.corrupt' is the right stopgap.

• may 30 (idx_events_run corruption): open gap (#33169/#34385) — a worker SIGKILLed mid-write desyncs the index. your logs show the trigger: TimeoutStopSec=120s ...may SIGKILL mid-drain. fix is one command: hermes gateway service install --replace.

to confirm which dominated, run this (safe on a live system — read-only, no writes/checkpoints, uses python3 so no sqlite3 cli needed):

bash kanban-corruption-diag.sh 2>&1 | tee kanban-diag.txt

i'm looking for: disk near 100% / kernel i/o errors (→ may 27 = disk), and whether stop timeout still shows 120s (→ may 30 = sigkill). send kanban-diag.txt back andi'll give you exact fix steps.

quick q: is ~/.hermes on local disk or any mounted/network volume? feel free to push back if anything doesn't match.

earnest copper
#

Here's the full diagnostic analysis:

  1. Disk ✅ — 67% used, 21G free. Not the cause.

  2. Filesystem ✅ — ext4, all good.

  3. I/O errors ✅ — no kernel disk faults in the last 7 days. May-27-style storage issue hasn't recurred.

  4. 🔴 Corrupt backup blast radius — 735 .corrupt.* backup files accumulated! Each corruption event spins off one, and they're never cleaned up. That's a lot of wasted space and a sign the crash-loop has been running for a while.

  5. DB integrity ✅ — all 4 databases (kanban.db, twenty-ai/kanban.db, phos-family/kanban.db, state.db) pass quick_check. Currently consistent. The corruption is transient — gets caught and restored from backup, but keeps happening.

  6. Journal mode — no mismatch concern.

  7. 🔴 TimeoutStopSec = 2min (120s) — below 210s threshold. This is the May-30 pattern: when the gateway restarts, systemd sends SIGKILL after 2 min if the process hasn't exited, potentially catching a SQLite write mid-transaction.

  8. dispatch_in_gateway: false — dispatcher runs as a separate process. This plus dashboard + kanban worker = multiple processes hitting the same DB concurrently, which is the root cause of the recurring corruption.

Bottom line: Two issues:

  1. Timeouts too short — hermes gateway service install --replace fixes the 120s → longer timeout
  2. Concurrent SQLite writes — dashboard, gateway, cron, and CLI all hit the same kanban.db. The fix is to move kanban to a dedicated database or increase the timeout so SIGKILL doesn't hit mid-write.
marsh rune
# earnest copper Here's the full diagnostic analysis: 1. Disk ✅ — 67% used, 21G free. Not the ca...

results are clear — thanks for running it. correcting my own earlier framing too:

not the cause: disk is fine (67%, no i/o errors — the may-27 storage issue isn't recurring), and you're on local ext4 so wal is safe. your writers are coordinated — concurrent dashboard/gateway/cli access alone isn't tearing the db. so ignore the "move kanban to a dedicated database" idea; it'd add complexity for nothing.

the actual root cause: TimeoutStopSec=2min. on every gateway restart, systemd SIGKILLs the process after 120s, which can land mid-write and desync the index (your may-30 idx_events_run signature). one fix:

#

hermes gateway service install --replace

#

that regenerates the unit with a safe stop timeout (≥210s) so restarts drain cleanly instead of getting killed mid-transaction.

#

the 735 .corrupt backups: that's not 735 separate corruptions — it's the probe crash-loop fanning one bad-file event across every restart + 2s dashboard poll (they're content-addressed, so far fewer real events). all 4 of your dbs pass quick_check right now, so you're consistent. safe to clear them:

#

find ~/.hermes -name 'kanban.db.corrupt.*.bak' -delete

#

also: you're 60 commits behind. there are open fixes upstream for exactly this (the integrity-probe retry/backoff that stops the .bak loop, #31795, plus related kanban hardening). worth a hermes update once you've snapshotted — good chance some are already merged into those 60.

#

tl;dr: hermes gateway service install --replace is the real fix, clear the .bak files, then hermes update. shout if it recurs after the timeout fix and we'll dig further.

#

@earnest copper

earnest copper
#

already ran it after running the diag.. will report back of any future corruption incidents