still `SQLITE_BUSY` | Dagger | Page 1

halcyon tendon Nov 21, 2025, 7:58 PM

#

Seems like this is still popping up @silent fox

time="2025-11-21T19:51:38Z" level=error msg="failed to emit telemetry" error="map[error:database is locked (5) (SQLITE_BUSY) kind:*sqlite.Error stack:<nil>]"
^ and lots of similar logs

tbh I guess I'm not 100% sure the hanging is caused by sqlite vs. just co-occurring, but seems suspicious

Dagger Cloud

Browse and visualize Dagger traces.

#

not sure where to go from here yet, just an FYI

silent fox Nov 21, 2025, 7:58 PM

#

grr

halcyon tendon Nov 21, 2025, 8:00 PM

#

on a terminal in there

/var/lib/dagger/worker/clientdbs # du -h
613.2M .
not that it should matter, but that's a lot of telemetry

silent fox Nov 21, 2025, 8:25 PM

#

wasn't this all fine for a long period of time? or was it just less frequent / we weren't paying attention to it

halcyon tendon Nov 21, 2025, 8:27 PM

#

silent fox wasn't this all fine for a long period of time? or was it just less frequent / w...

the hangs started sometime in the last month or so. my best guess is that it's probably always been possible but the performance fixes changed execution enough to trigger it

#

we definitely used to have multi second pauses between operations all over the place due to the boltdb syncing, so we are definitely assaulting the telemetry dbs way faster than we used to

#

maybe the best route from here is to figure out why this situation results in hanging at all?

#

honestly it might even be expected to hit the busy timeout now, given there's 613MB of data being written over the course of less than 10m

#

I also noticed in the logs of that engine that the first busy timeout occurred right after the engine received hundreds of request to /query over the course of a few seconds

halcyon tendon Nov 21, 2025, 8:30 PM

#

halcyon tendon maybe the best route from here is to figure out why this situation results in ha...

@silent fox there's no expectation that we hang indefinitely just because of failing to write telemetry right? any ideas on what would trigger that?

#

I guess we could simulate it by just changing the code to always return an error there

silent fox Nov 21, 2025, 8:32 PM

#

normally otel has retry with exponential backoff, but it shouldn't be indefinite, unless we took a shortcut with its config because "it's all local and not worth thinking about an extra layer of complexity"

#

not seeing anything too egregious, we batch writes to every 100ms, which is what we'd want for responsive updates

#

i feel like we're missing something, but also, our usage of sqlite seems so simple...

#

wonder what PRAGMA journal_mode = MEMORY does thinkspin https://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

halcyon tendon Nov 21, 2025, 8:40 PM

#

silent fox i feel like we're missing something, but also, our usage of sqlite seems so simp...

yeah could also be worth biting the bullet and trying to swap out for mattn, which is at least more battle tested. Just on the off chance this actually is modernc (though still doubtful). Building the engine w/ cgo would be painful though, especially cause of cross-compilation

halcyon tendon Nov 21, 2025, 8:42 PM

#

halcyon tendon yeah could also be worth biting the bullet and trying to swap out for `mattn`, w...

https://github.com/cvilsmeier/go-sqlite-bench

silent fox Nov 21, 2025, 8:45 PM

#

https://tenor.com/s19wjjYYbYQ.gif

Tenor

#

i wonder if we could repro the SQLITE_BUSY with a simulated test

#

or get an agent to flail until it repros 😂

halcyon tendon Nov 21, 2025, 8:55 PM

#

silent fox i wonder if we could repro the SQLITE_BUSY with a simulated test

Yeah I'm trying rn, put return fmt.Errorf("fuk u") for all of the writes to the db, but it actually finishes executing (even though I can't see what's happening besides engine logs)

#

Trying the read-side now

silent fox Nov 21, 2025, 8:59 PM

#

https://claude.ai/share/666a60f4-0cc8-4504-bb8e-ec719919962c

lots of interesting stuff here

Optimizing SQLite for maximum performance

Shared via Claude, an AI assistant from Anthropic

halcyon tendon Nov 21, 2025, 10:59 PM

#

hmmm... I noticed in namespace logs where a hang occurred that a little bit earlier there was a context cancelled in the middle of a DB write operation. So I changed the engine code to cancel contexts for those writes after 1ms. And that managed to get all the engine tests to hang seemingly indefinitely; not just in the trace but also the GHA job itself 🎉

And even stranger, the output according to GHA is 10s of thousands of lines of nothing? https://github.com/dagger/dagger/actions/runs/19585605887/job/56093699337?pr=11466

GitHub

[DNM] attempt to repro hang during lint/tidy · dagger/dagger@d6d553b

An open-source runtime for composable workflows. Great for AI agents and CI/CD. - [DNM] attempt to repro hang during lint/tidy · dagger/dagger@d6d553b

halcyon tendon Nov 21, 2025, 11:34 PM

#

halcyon tendon hmmm... I noticed in namespace logs where a hang occurred that a little bit earl...

also interesting that when I do that I do indeed end up with a few SQLITE BUSY errors amongst all the context deadline exceeded ones...

#

I'm getting the feeling that canceling the context of a write operation to the db causes something to catastrophically break...

halcyon tendon Nov 21, 2025, 11:41 PM

#

halcyon tendon I'm getting the feeling that canceling the context of a write operation to the d...

or perhaps cause all concurrent writes to fail with SQLITE BUSY? hyperthinkspin

halcyon tendon Nov 22, 2025, 12:19 AM

#

extremely fun fact: the order of pragmas matter? https://gitlab.com/cznic/sqlite/-/issues/115#note_1156289731 and someone said they had broken behavior when busy_timeout was after journal_mode?

GitLab

SQLITE_BUSY error on concurrent SELECT queries while in WAL mode (#...

Scanning the result of multiple concurrent SELECT queries (eg. each in their own goroutine) results in the following error:

#

@silent fox this is a consistent repro: https://github.com/dagger/dagger/pull/11466/commits/f47216af38aaf87a1dace595bdbcdcdc3b6decc5

load bearing parts being:

randomly cancel contexts in middle of writes (which seems to cause other non-canceled writes to block for up to busy_timeout)
set extremely long busy timeout

The engine essentially grinds to a halt forever. From a goroutine dump of the engine, it seems like we can't progress due to this defer stdio.Close(): https://github.com/sipsma/dagger/blob/f47216af38aaf87a1dace595bdbcdcdc3b6decc5/core/telemetry.go#L268-L268

It gets blocked because it's waiting up to busy_timeout. It halts all forward progress because its happening at the end of a call, and the call can't finish until its done running.

So I bet what's happening in CI is we hit 1. (just randomly) and then every single operation after that has to wait for at least 10s, which ends up looking to us like it's all hanging (especially since we can't actually get any further telemetry).

#

1. is the really fundamental problem and smells strongly like a modernc bug. There are some issues in their repo that sound vaguely similar with people saying they didn't get that behavior on mattn... so might see how much the overhead of compiling the engine w/ cgo will ruin our lives in practice? mostly concerned w/ how slow it is

silent fox Nov 22, 2025, 12:59 AM

#

whoa nice find

halcyon tendon Nov 22, 2025, 1:37 AM

#

silent fox whoa nice find

Yeah unfortunately when I switch from modernc->mattn it all works perfect even with that repro code in place; there's some missing telemetry here and there since half got canceled 1ms into a write, but most of it made it and it all finished running...

#

also unfortunately, cgo is a PITA, but probably dealable-with

silent fox Nov 22, 2025, 1:39 AM

#

would putting that all in a sidecar binary help at all? or just move the problem

halcyon tendon Nov 22, 2025, 1:40 AM

#

silent fox would putting that all in a sidecar binary help at all? or just move the problem

possibly... though for the longer term of using sqlite for the engine cache I'm not 100% sure how that'd scale

#

also that sidecar binary approach is essentially how https://github.com/cvilsmeier/sqinn-go works

silent fox Nov 22, 2025, 1:40 AM

#

oh yea. easy for otel, not so much for that

silent fox Nov 22, 2025, 6:51 PM

#

@halcyon tendon alternatively, could we just not use a ctx for sqlite, or prevent it from being canceled? these should all be really fast queries, doesn't seem like we'd lose much

halcyon tendon Nov 24, 2025, 5:12 PM

#

silent fox <@949034677610643507> alternatively, could we just not use a `ctx` for sqlite, o...

definitely worth consideration, but if switching to cgo ends up not being all that painful in terms of engine build times I'd probably prefer that just so we aren't on shaky ground indefinitely

#

i didn't get numbers yet but it felt like the first build w/ cgo was noticeably slower but rebuilds after that were not as noticeably bad, so might be benefitting from the go build cache? not sure how that interacts w/ c code. I'll get some actual numbers to confirm/deny

#

plus if we go this route we get the opportunity to be super cool and use zig for cross compilation https://github.com/goreleaser/example-zig-cgo 😎

halcyon tendon Nov 24, 2025, 6:47 PM

#

halcyon tendon i didn't get numbers yet but it felt like the first build w/ cgo was noticeably ...

using zig for cc (which is nice even in the non-cross-compile case since it has its own caching), seems that a rebuild of the engine (cache from previous build but code change) has an overhead of 4s? 44s vs. 48s. That feels pretty tolerable to me

silent fox Nov 24, 2025, 6:52 PM

#

yeah I probably won't notice that 😛 - not bad

#still `SQLITE_BUSY`