#still `SQLITE_BUSY`

1 messages ยท Page 1 of 1 (latest)

halcyon tendon
#

Seems like this is still popping up @silent fox

time="2025-11-21T19:51:38Z" level=error msg="failed to emit telemetry" error="map[error:database is locked (5) (SQLITE_BUSY) kind:*sqlite.Error stack:<nil>]"
^ and lots of similar logs

tbh I guess I'm not 100% sure the hanging is caused by sqlite vs. just co-occurring, but seems suspicious

#

not sure where to go from here yet, just an FYI

silent fox
#

grr

halcyon tendon
#

on a terminal in there

/var/lib/dagger/worker/clientdbs # du -h
613.2M .
not that it should matter, but that's a lot of telemetry notsureif

silent fox
#

wasn't this all fine for a long period of time? or was it just less frequent / we weren't paying attention to it

halcyon tendon
#

we definitely used to have multi second pauses between operations all over the place due to the boltdb syncing, so we are definitely assaulting the telemetry dbs way faster than we used to

#

maybe the best route from here is to figure out why this situation results in hanging at all?

#

honestly it might even be expected to hit the busy timeout now, given there's 613MB of data being written over the course of less than 10m

#

I also noticed in the logs of that engine that the first busy timeout occurred right after the engine received hundreds of request to /query over the course of a few seconds

halcyon tendon
#

I guess we could simulate it by just changing the code to always return an error there

silent fox
#

normally otel has retry with exponential backoff, but it shouldn't be indefinite, unless we took a shortcut with its config because "it's all local and not worth thinking about an extra layer of complexity"

#

not seeing anything too egregious, we batch writes to every 100ms, which is what we'd want for responsive updates

#

i feel like we're missing something, but also, our usage of sqlite seems so simple...

halcyon tendon
silent fox
#

i wonder if we could repro the SQLITE_BUSY with a simulated test

#

or get an agent to flail until it repros ๐Ÿ˜‚

halcyon tendon
#

Trying the read-side now

silent fox
halcyon tendon
#

hmmm... I noticed in namespace logs where a hang occurred that a little bit earlier there was a context cancelled in the middle of a DB write operation. So I changed the engine code to cancel contexts for those writes after 1ms. And that managed to get all the engine tests to hang seemingly indefinitely; not just in the trace but also the GHA job itself ๐ŸŽ‰

And even stranger, the output according to GHA is 10s of thousands of lines of nothing? https://github.com/dagger/dagger/actions/runs/19585605887/job/56093699337?pr=11466

GitHub

An open-source runtime for composable workflows. Great for AI agents and CI/CD. - [DNM] attempt to repro hang during lint/tidy ยท dagger/dagger@d6d553b

halcyon tendon
#

I'm getting the feeling that canceling the context of a write operation to the db causes something to catastrophically break...

halcyon tendon
halcyon tendon
#

@silent fox this is a consistent repro: https://github.com/dagger/dagger/pull/11466/commits/f47216af38aaf87a1dace595bdbcdcdc3b6decc5

load bearing parts being:

  1. randomly cancel contexts in middle of writes (which seems to cause other non-canceled writes to block for up to busy_timeout)
  2. set extremely long busy timeout

The engine essentially grinds to a halt forever. From a goroutine dump of the engine, it seems like we can't progress due to this defer stdio.Close(): https://github.com/sipsma/dagger/blob/f47216af38aaf87a1dace595bdbcdcdc3b6decc5/core/telemetry.go#L268-L268

It gets blocked because it's waiting up to busy_timeout. It halts all forward progress because its happening at the end of a call, and the call can't finish until its done running.

So I bet what's happening in CI is we hit 1. (just randomly) and then every single operation after that has to wait for at least 10s, which ends up looking to us like it's all hanging (especially since we can't actually get any further telemetry).

#

1. is the really fundamental problem and smells strongly like a modernc bug. There are some issues in their repo that sound vaguely similar with people saying they didn't get that behavior on mattn... so might see how much the overhead of compiling the engine w/ cgo will ruin our lives in practice? mostly concerned w/ how slow it is

silent fox
#

whoa nice find

halcyon tendon
# silent fox whoa nice find

Yeah unfortunately when I switch from modernc->mattn it all works perfect even with that repro code in place; there's some missing telemetry here and there since half got canceled 1ms into a write, but most of it made it and it all finished running...

#

also unfortunately, cgo is a PITA, but probably dealable-with

silent fox
#

would putting that all in a sidecar binary help at all? or just move the problem

halcyon tendon
silent fox
#

oh yea. easy for otel, not so much for that

silent fox
#

@halcyon tendon alternatively, could we just not use a ctx for sqlite, or prevent it from being canceled? these should all be really fast queries, doesn't seem like we'd lose much

halcyon tendon
#

i didn't get numbers yet but it felt like the first build w/ cgo was noticeably slower but rebuilds after that were not as noticeably bad, so might be benefitting from the go build cache? not sure how that interacts w/ c code. I'll get some actual numbers to confirm/deny

halcyon tendon
silent fox
#

yeah I probably won't notice that ๐Ÿ˜› - not bad