I keep seeing people share stories of Hermes running 30-60+ minutes and churning out features autonomously. I'd love to get there but my experience has been very different, and I'm trying to understand if it's my project, my workflow, or expectations that need adjusting.
My question: Are people actually getting productive long autonomous runs on codebases with tight invariants and semantic correctness requirements? Or is the sweet spot really greenfield/CRUD/frontend-heavy work where "working" = "it renders"?
Context:
I'm building a prop/fintech platform leveraging CF Workers, Neon, Drizzle, TypeScript. The codebase has idempotency semantics, optimistic concurrency, BigInt levy distribution, RLS/audit triggers, manual SQL migrations synced with Drizzle schema and a multi-phase PRD system where phases depend on prior ones being truthfully complete (this part is important).
What happened today:
Hermes spent most of the day looping on Phase 2 backend tasks for a PRD. It correctly diagnosed that tests were failing, but instead of fixing the tests to match the (correct) service code, it kept questioning the service design and adding scope. It then found a stale code comment saying "Neon HTTP driver doesn't support transactions" and used that as a blocker — without checking that two other services in the same repo already use db.transaction() successfully. Each step was locally reasonable but the aggregate was a spiral that never closed.
I handed it to Claude Code CLI and the fix was mechanical. It added missing idempotencyKey params to 3 test calls, added a mock response to 4 test queues, added 2 columns to a migration. 202/202 tests green, clean typecheck, done in minutes.
Would love to hear what's actually working for people and any patterns for keeping agents on-task when the gap between "tests pass" and "honestly complete" matters.