#@Erik Sipsma ack - i'm rebased on `main`
1 messages ยท Page 1 of 1 (latest)
I hit it seemingly very randomly, can't repro when I actually want to. Last night I ran a loop trying to repro it with the tweaks in place and ctrl-C'd after a few hours with no success in hitting it again. So would want more data points before assuming it's gotten worse and not just your turn at the freak occurence ๐
fair - i assumed it was more likely to happen under load so was a bit surprised to see it happen immediately
exiting the command and restarting it works fine
kind of feels like a deadlock? are we not doing BEGIN IMMEDIATE everywhere?
(can check once i pop the stack)
looks like the code tries to, but it's all just URL params so hard to say if it's being respected:
RawQuery: url.Values{
"_pragma": []string{
"foreign_keys=ON", // we don't use em yet, but makes sense anyway
"journal_mode=WAL", // readers don't block writers and vice versa
"synchronous=OFF", // we don't care about durability and don't want to be surprised by syncs
"busy_timeout=20000", // wait up to 20s when there are concurrent writers
},
"_txlock": []string{"immediate"}, // use BEGIN IMMEDIATE for transactions
}.Encode(),
Yeah I kind of doubt at this point it has to do with actual load, it seems like some bug somewhere.
One mildly interesting thing I saw is that BUSY_TIMEOUT can be returned in other scenarios such as if the db needs recovery: https://sqlite.org/forum/forumpost/4d93f49a04?t=h&unf
wonder if we're hitting that somehow
I also was wondering if the fact that we open the underlying DB multiple times (shared one for writing, but each sse handler re-opens for itself, which makes sense in that the client may be gone) could somehow be doing something funky? fwict that should be totally valid, but it raised my eyebrow a millimeter or two
the fact that it's transpiled sqlite makes me slightly more suspicious about something weird going on there, wouldn't be surprised if there's not many users of the modernc lib that open db multiple times like we do, which raises the chances of us hitting a bug not noticed previously
possibly, yeah i'm not sure how the difference shakes out vs. having a single read/write db connection
Actually I just went to the docs that post links to: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
And some of the scenarios do sound like something we could hit?
If another database connection has the database mode open in exclusive locking mode then all queries against the database will return SQLITE_BUSY. Both Chrome and Firefox open their database files in exclusive locking mode, so attempts to read Chrome or Firefox databases while the applications are running will run into this problem, for example.
When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.
That makes me wonder more if the opening of the db several times is a culprit...
I feel like it wouldn't be that hard to avoid doing that. We'd need some light ref counting around a shared db so we can know when to leave it open and when to close, but probably not too bad?
they're already per-client so we can probably just keep a single one open for the lifetime of the client
my understanding, which may be wrong, is that we do a separate open in the sse handler because it's possible that the client in question closed down, in which case the single shared one won't be around anymore https://github.com/dagger/dagger/blob/979ed22e8763953a0745b4beeeddb16a73aea0a9/engine/server/telemetry.go#L620-L620
if that's not true then even better
did some spelunking: https://github.com/vito/dagger/commit/7b1d13e20db1817732d625842183b89703ae024b
- client DB is closed on session shutdown, so we don't want to use that
one - subscribers wait for client shutdown instead of session shutdown
Signed-off-by: Alex Suraci alex@dagger.io
ah, it's to avoid it being closed, so the client can safely read the last few bits of telemetry (draining, my old friend)
so yep needs something like this (only close when final consumer closes)
I might give it a quick shot right now if you haven't started already, just so we can get it in the next release
haven't started it, go for it! ๐