#test-module-runtimes real bad
1 messages · Page 1 of 1 (latest)
the first bad one is "ensure cache volumes are namespaed"
but "fixup dependabot config schema violation" is still fast
which kind of indicates a level of flakiness maybe?
the two candidates potentially are the cache namespacing (would definitely explain it) and maybe the custom init (which might have some very subtle performance bug)
yeah i've got the init open here
lmao discord giving us a big pic of @white latch
huh my browser really doesn't like any of these traces
it seems pretty convinced that the trace never ends as well
i've had this happen a few times too (safari? might not be the browser)
gonna ping @vocal scaffold as well just in case this is a known issue 😢
you're talking about the trace showing green in GHA, but then you open it in v3 and random spans are pending, yeah?
got a trace link?
i'm out on a branch where i've been mildly fucking with spans and inserting printlns in bad places so i assumed it was my changes
hm it panik
runtime: out of memory: cannot allocate 4194304-byte block (2134278144 in use) rrweb-plugin-console-record.js:2447:17
fatal error: out of memory
ah yep, the 2gb memory limit rears its ugly head
probably some busted assumptions about memory usage since I made the change to send all the data from backend => frontend when the trace ends
will look into it soon
for now, you can use v2
hm okay i have a suspicion related to cache volumes
so the go sdk creates cache volumes
but we try and namespace them based on what "module" we're inside of
i wonder if module X has a go dependency Y - I wonder if this cache module is namespaced as part of Y
hm that would be very surprising actually.
that doesn't seem like it's possible 😛
i do wonder if there's something on the code path of using module-sdks though
because of the fact we see the slow-down in suites that explicitly test a lot of those
hm, maybe this is somewhat possible actually 🤔
cacheVolume in python-sdk , . -> modgobuildcache
diff --git a/core/schema/cache.go b/core/schema/cache.go
index c5c593b73..a838ad534 100644
--- a/core/schema/cache.go
+++ b/core/schema/cache.go
@@ -3,6 +3,7 @@ package schema
import (
"context"
"errors"
+ "fmt"
"github.com/dagger/dagger/core"
"github.com/dagger/dagger/dagql"
@@ -49,6 +50,19 @@ func (s *cacheSchema) cacheVolume(ctx context.Context, parent dagql.Instance[*co
if err != nil && !errors.Is(err, core.ErrNoCurrentModule) {
return inst, err
}
+ if m == nil {
+ fmt.Println("cacheVolume in _ -> ", args.Key)
+ } else {
+ name, err := m.Source.Self.ModuleName(ctx)
+ if err != nil {
+ panic(err)
+ }
+ symbolic, err := m.Source.Self.Symbolic()
+ if err != nil {
+ panic(err)
+ }
+ fmt.Println("cacheVolume in", name, ",", symbolic, "-> ", args.Key)
+ }
namespaceKey := ""
if m != nil {
the python-sdk module never creates the modgobuildcache volume - that's create by the builtin go sdk
(in a basic setup of having a python module call a go module)
this definitely doesn't seem right - it means that somehow we're ending up with distinct cache volumes where we should be getting the same ones
but also not sure how this would lead to a x2 slowdown 🤔
(heading off for the evening, but will pick this up again tomorrow, definitely has performance regressions vibes)
testing to see if this could be the issue: https://github.com/dagger/dagger/pull/9389
oh man, trying to run dagger call test specific --run="TestGo|TestPython|TestTypescript|TestElixir|TestPHP" locally... it chokes soooooooooo hard with 8g of memory
just the 24 tests in module_typescript_test.go take forever, saturate 8 CPUs, and over 12G of RAM (grain of salt on the ram obviously, htop)
loool sweet lord it even gets 137'd on a 16g vm
pushed a few optimizations to reduce memory pressure in the TUI/web UI - now that trace renders without crashing: https://v2-v3-redirect-v3.preview.dagger.cloud/dagger/traces/054d196b882a21f4e081e218521e54db?span=6cd71c7e08050aae
still TODO: figure out why the web UI thinks it's still running. this seems to come from spans that never finished (all the canceled ones - they're not actually canceled, that's the UI compensating, which I want to remove)
hm okay, in https://github.com/dagger/dagger/pull/9389, i tried reverting the cache volumes namespacing - and we're still at 20m
gonna try reverting the init change to see if that does it
playing around with the typescript module runtime tests, there does appear to be quite a lot of npm+pnpm+corepack+bun diskrw spikes.... but i'm also getting clean 6 minute runs from doing go test on my host instead of dagger call test specific
admittedly this is a shot in the dark, bc i'm not fully following what was added to le dumb-init copy-over--- but is it plausible that that is behaving badly specifically when we run engine-in-engine?
oh there we go, tripped the oomkill: https://dagger.cloud/dagger/traces/41be13b6523f4679c1ba1490bf9f01d0#7f30d5aec6f1d019 (it is the engine getting 137'd)
(continuing conversation from meeting this morning) The biggest contributor to the OOM's I was getting in https://github.com/dagger/dagger/pull/9204 was from that PR specifically (something incredibly dumb around buildkit op unmarshaling). But after fixing that locally I'm still getting some crazy RSS in the engine, just barely managing to avoid an OOM. I suspect I started seeing those consistent OOMs after rebasing on main, which combined with the pre-existing LLB unmarshal mem usage sent everything over the edge.
Either way, looking at pprof heap usage shows the worst offender now being:
1539.11MB 28.80% 28.80% 1539.11MB 28.80% go.opentelemetry.io/otel/sdk/log.newRing (inline)
Which is a new player I haven't seen in this game before... Not sure if I just haven't noticed it previously or if some otel dep change introduced it in the last few weeks? Going back in commits to see
Actually just realized we haven't bumped otel i na while so probably not that...
man you beat me to the heap dump result by 20 minutes after me spending all afternoon tryna get to that pprof port 
similar result out the gate on my end, haven't gotten the timing right to where it shows 1.5G yet
Yeah also interesting is that switching from inuse_space to alloc_space shows this behemoth, 133GB over a few mins of running TestModule lol
(that's running on main now, not my PR)
@vocal scaffold I'm gonna keep spelunking but just checking if you know of any tweaks we made to otel logging recently. Might be a totally innocent looking change that somehow accidentally resulted in us calling some otel func way more than before (speculating)
hmmmm i don't think so. is it possible this has been around for a while?
yes, but it also got marginally worse on 1/7
Okay good to know thanks, and yeah that's what I'll try to figure out now. Will do a manual git bisect I guess
Do you know down to which commit?
Yeah I see the same thing back to v0.14.0, so I guess it has been there for a while at least. I've definitely looked at heap profiles since then but probably not while running all of TestModule specifically... either way I'll open an issue in linear to fix this since it's kind of crazy in terms of number of allocations but probably not the culprit we were looking for yet
Wouldn't be the first time that there was an inefficiency in in the OTel logging Go SDK - looks like it's this call: https://github.com/open-telemetry/opentelemetry-go/blob/cbc3b6a5bb245ccce687559b92c135d680c5699b/sdk/log/batch.go#L162
Which already has this TODO lol: https://github.com/open-telemetry/opentelemetry-go/blob/cbc3b6a5bb245ccce687559b92c135d680c5699b/sdk/log/batch.go#L140
// TODO: investigate using a sync.Pool instead of cloning.
curious what the actual logs are that are going down this path - I guess stdout/stderr from things running, going into the OTel/clientdb pipeline?
Another weird thing is that when I disconnect everything from the engine, hit the manual debug/gc endpoint + give it a sec, the memory allocated by log.newRing never disappears, so it's somehow leaking seemingly. But goroutine dumps don't show anything left that seems like it could plausibly still be referencing it. So idk what could be happening other than that memory ending up in a global map/pool/thing somewhere
And sadly the current state of go tooling doesn't really answer the question of "what is still referencing this memory", which is a real shame since it's almost always the next question you ask after you see a heap profile
This might be a viable path though: https://github.com/dailymotion/hands-on-diagnosing-golang-apps/blob/master/core-files/README.md
Confirmed the above a genuine leak (which is a little tricky since it's hard to tell the diff between a leak and the GC just choosing to not free memory from sync.Pools and such), it seems pretty bad actually: https://linear.app/dagger/issue/DEV-4922/memory-leak-excessive-allocations-in-otel-log-code
Looks like over a GB is leaked over the course of just running TestModule.
Mention this in that issue, but apparently the go team is working on something that does exactly what we want here (take a core dump and then tell you exactly why a given object is still considered referenced by the GC), but sadly it doesn't seem to quite work yet despite jumping through a million hoops in desperation 😢 https://pkg.go.dev/golang.org/x/debug/cmd/viewcore
@late moat @vocal scaffold would either or both of you have time to look into that issue?
taking a look! i wonder if this is due to each client getting its own log processor - maybe we don't clean em up properly?
hmm... seems like it at least tries to; we shut down all client log processors when the session goes away
I would still expect to see high alloc_space here, since we unquestionablly alloc a queue of size 2048 for each client, but inuse_space shouldn't be leaking for sure
maybe there's a dangling reference somewhere tricky preventing GC
is it possible that all these clients accumulate before the session goes away? Like all tests are actually running many clients within one session or something? (I don't think so, right?)
do the clients (or sessions) have any concept of nesting on the server side? like a bunch of the module tests use a dagger binary inside of a container created by a dagger.Connect client
The important thing is that we aren't leaking goroutines. That's what I'd expect to see in an obvious leak of something related to client/session state. There are a handful of goroutines (besides the standard go runtime ones) still around once everything disconnects (no tests running) but they are all what you'd expect in terms of listeners/scheduler loops/etc. None of them had any obvious way of holding state that referenced the Value in these log ring things.
I also saw that the sessions map in the engine server is empty, so no obvious leaks there.
I guess the possibilities in my mind are:
- Some sneaky leak somewhere in the engine server or maybe buildkit. I do see that the cumulative mem usage for a few other parts of sessions/clients/dagql classes seems to stick around, but nothing in flat usage.
- Some global var literally anywhere in our whole dependency tree that's holding onto state
- An outside possibility but wouldn't rule it out: the OTEL libs do some crazy magic with unsafe, i.e. getting the
*byteunderlying a string and then reconstructing new strings with it (basically turning what's supposed to be immutable into more or less a C string): https://github.com/open-telemetry/opentelemetry-go/blob/cbc3b6a5bb245ccce687559b92c135d680c5699b/log/keyvalue.go#L69-L70- Even if this doesn't result in a leak, I wouldn't be shocked if it somehow confuses the tooling around all this and means it's reporting the source of the allocation incorrectly? </speculation>
other thing about leaks: it's really not guaranteed at all that this leak is related to poor performance on the module-runtime tests
also @white latch , have you been pulling heap dumps from ./hack/dev engines? or dind dagger call test engines? both?
also curious what errors you got out of viewcore if you have them handy, I can take a swing at it on my machine (low odds of working but might as well)
besides that I figure I can use runtime.SetFinalizer on some key things to see if they're ever GCed
upgrade to 1.24 preview and you can runtime.AddCleanup instead of ye olden finalizers
lol, thought about it but hopefully a simple println is easy enough to do with the OG api
I just was using ./hack/dev for simplicity
Yeah for sure, the second latest commit in viewcore is also "update for go 1.24" so might be related to the fact that we're on 1.23. Not sure if trying to build 1.24 from source is worth it vs just waiting, but I'll assemble the instructions for getting the core dump. It was not straightforward so will send after current meeting
Right sorry I sort of hijacked this thread. I'm still not 100% sure about the increase in runtime. It's not 100% clear to me yet if it's real or just noise, based on the graph. It looks "bumpier" than the last two weeks but there are other periods of random bumpiness in the past, so hard to tell what's happening
Okay to get a core dump, there is potentially a pretty easy path by installing gdb and then use sudo gcore <pid of engine>. However, that's from a live process and somewhat outside the tooling go says it explicitly supports, so when I started hitting errors later I decided to try to do it in the more explicitly doc'd way. Possible this is irrelevant.
Either way, to get the core dump in the way go docs say is supported:
Add these lines to the engine startup
ulimit -c unlimited
export GOTRACEBACK=crash
Change the ./hack/dev engine to have an init instead of being pid 1 (no clue why I had to do it, was a random shot in the dark that happened to work; pid 1 is weird). Can append --init here
Start the engine, run stuff. To get the core dump run docker exec -it dagger-engine.dev kill -ABRT 7
docker kill -s SIGABRT dagger-engine.devshould work, but doesn't 🤷♂️
To extra the core dump first run docker diff dagger-engine.dev | grep core, find the core file (should be under /). Then run docker cp dagger-engine.dev:<path to core> .
Also copy out the dagger-engine binary with docker cp dagger-engine.dev:/usr/local/bin/dagger-engine .
Then after installing viewcore, the command is like go run golang.org/x/debug/cmd/viewcore <core file path>--exe <dagger-engine exe path>.
I get errors at start about giving up on tracing some parts of the core dump. Also usually I get a panic when I run objects, but if I run it again it works more but there's a lot of types marked as unknown (and no occurences of anything interesting from otel/log pkgs that I saw). I basically don't know if the information is complete/correct.
I'll dump this in the linear issue as a comment for posterity since hopefully soon enough viewcore actually works and this will be relevant again
so far SetFinalizer 10-0 Alex - can't seem to get it to run even after forcing GC and even with a simple example:
slog.Warn("!!! SETTING FINALIZER")
type t struct{}
val := &t{}
runtime.SetFinalizer(val, func(v *t) {
slog.Warn("!!! IT'S FINALIZER TIME !!!", "tag", "welp", "type", fmt.Sprintf("%T", v))
})
will keep poking after lunch
If you didn't already, might be worth trying to set something like
export GOGC=1
export GOMEMLIMIT=1
to attempt to tell the GC to really always please go out of your way to collect everything even if it doesn't seem necessary
yeah i yoinked your flags from the linear issue, still no cigar
Could also be worth trying to do something with val, maybe the compiler is smart enough to optimize it out? Especially might be worth passing it out of a function so it for sure is allocated on the heap and not just in the stack, not sure if that could impact something here
oh, in that example it also doesn't help that it's struct{} lol - tinkering in the playground, and i need to add at least one field for it to finalize
alright - got finalizers working for my canary example, but I'm not seeing them run for the "real" values which is ... concerning?
real values being the ring objects? or something else?
slog.Warn("!!! SETTING FINALIZER")
type t struct{ a [1024000]byte }
snitchGC(&t{}, "canary")
// runtime.SetFinalizer(&t{}, func(v *t) {
// slog.Warn("!!! IT'S FINALIZER TIME !!!", "tag", "welp", "type", fmt.Sprintf("%T", v))
// })
snitchGC(client.loggerProvider, client.clientID)
snitchGC(client, client.clientID)
I see the finalizer log for the canary, but not for the client or loggerProvider. Also set finalizers on all the batch processors, and don't see them run either
Interesting... In the heap dumps I collected after tests ended and nothing was happening in the engine showed no "flat" memory usage from the engine/server package, only some cumulative due to other packages being called by that package having memory laying around still
If the client itself was never GC'd I would have expected >0 flat usage
I could be misunderstanding the tooling though
yeah fair - at this point it's most likely I'm holding SetFinalizer wrong
but, it worked for the canary 
Grr... it seems like using SetFinalizer was causing things to not finalize. I added a waste [10485760]byte to *daggerClient with a rand.Read(client.waste[:]) for good measure to make leaks more obvious, and I could see each client leaking with a SetFinalizer installed, using just while dagger core version; end to test. But once I removed the SetFinalizer, it still "leaked" but plateaued at a smaller amount of space, and once I stopped the loop everything eventually got fully GC'd. But, after the final GC, even the newRing space was gone! So it's hard to say if anything's actually going wrong here, or if it's the GC just being mercurial/lazy.
well this is cursed: grew impatient and decided to run multiple dagger-dev core version loops in parallel, and ... sometimes i get v0.15.2 back??? (along with a weird error)
❯ while dagger-dev core version; end
# command-line-arguments
open ./mage_output_file.go: no such file or directory
Error: error compiling magefiles
✔ connect 0.3s
✔ loading type definitions 0.6s
✔ parsing command line arguments 0.0s
✔ version: String! 0.0s
v0.15.2
✔ connect 0.4s
✔ loading type definitions 1.2s
✔ parsing command line arguments 0.0s
✔ version: String! 0.0s
v0.15.3-010101000000-dev-dfcd66e83e45
guessing due to parallelism somehow
Yeah if something crashes or I ctrl-C at the wrong time I occasionally have to manually delete that ./mage_output_file.go for some reason. Mage really doesn't need to exist anymore. Not sure if the version thing is related, also sounds possibly like https://github.com/dagger/dagger/pull/9400
Huh, did you try with running ModuleTest in it's entirety to see if you can repro the leak with that? I was afraid of the GC just being lazy which is why I added all those env vars to make it cleanup more and then tried to limit the memory of the engine container. It did get OOM killed so if the GC is just deciding to not cleanup memory that seems like a severe issue in the go runtime
ahh i see, it was failing to generate the devenv, hence no override, hence running with the default engine
trying to do that as a last resort since my WSL VM has been really wonky and tends to hard-lock-up with enough CPU/RAM usage 😅 (it just happened from my while loop with SetFinalizer set up...)
lol good to know WSL also has problems next time my rosetta network decides to disappear on wake
not bc it's useful information but because maybe i'll be slightly less annoyed
this SetFinalizer behavior is just baffling - I even pulled out the func so it's not closing over the value or anything silly like that, and still the whole client leaks, and stays leaked even after runtime.GC()
i guess i'll try 1.24 after all (...maybe - looks like we use Wolfi for this now which doesn't have it yet, possible dead end anyway. edit: nvm wasnt hard)
The new runtime.AddCleanup function is a finalization mechanism that is more flexible, more efficient, and less error-prone than runtime.SetFinalizer. AddCleanup attaches a cleanup function to an object that will run once the object is no longer reachable. However, unlike SetFinalizer, multiple cleanups may be attached to a single object, cleanups may be attached to interior pointers, cleanups do not generally cause leaks when objects form a cycle, and cleanups do not delay the freeing of an object or objects it points to. New code should prefer AddCleanup over SetFinalizer.
"cleanups do not generally cause leaks when objects form a cycle" sounds especially relevant - client/session are cyclic
yep. runtime.AddCleanup works 
ok, and this time after two calls to the /debug/gc endpoint, all the clients (and newRing) are gone - saw 4 clients get GC'd with each call.
I'll run TestModule now, rebooted a bit ago so hopefully the VM doesn't fry
(i do indeed repro an OOM there)
Stopping for now, but here's the last flamegraph I caught before OOM - still seems to imply *daggerClients are sticking around, which explains the newRing "leaking" since it's part of it. Not really satisfied with that answer though. It seems surprising that Go's GC would be at fault, but it seems to at least exhibit a behavior where runtime.GC() doesn't always clear everything - sometimes it takes multiple passes. Maybe if you're about to OOM, the fact that it leaves objects for a later pass results in it OOMing at the threshold? Purely speculative though.
Tomorrow I'll try setting GOMEMLIMIT to something like 1GB and see if it's able to respect it - right now it's probably not doing anything because I'm forcing it to always be surpassed with that 10MB array?
Pushed changes to https://github.com/dagger/dagger/compare/main...vito:dagger:oom-investigate?expand=1
i think this is probably happening because the mage stuff fails, and then the env variables aren't injected - so you're falling back to the not-dev engine
😱 sigh i want to get rid of mage
knowing what i know now i kinda wanna give killing mage another shot
@lone forum it was you, right, that showed some docker socket trick that'd let us plop the dev engine into docker from a dagger build? that was one of the problems that made me abandon my first attempt
Yeah that's in now : 🎉
I started hacking on maybe having a dagger shell script to do this earlier
But then came across legit shell bugs
I still would expect there to be memory usage from the engine/server package in that case though, was the flat usage of that zero for you too? I'm not sure the extent to which "zero really means zero" (i.e. is the heap snapshot not 100% complete?), but if it does then all the data I got was saying that engine/server showed up just because it was in the call stack that initially allocated the ring, even though everything created directly by engine/server was all gone.
I also still got this same behavior even when doing things like waiting 10m after all clients disconnected, running a few debug/gc calls throughout. I agree we can't rule out the go GC just deciding not to collect anything in that case but in combation with all the GOGC/GOMEMLIMIT stuff it seems like it must just be utterly broken in some way
Been looking into this again because I am still getting what appears to be OOM kills on https://github.com/dagger/dagger/pull/9204 despite some improvements and pprof heaps locally show similar sort of leaks (otel log related the worst but scattered other stuff)...
Reaching my wits end. I tried building w/ 1.24 and using viewcore but got same sort of errors as before. Did a ton of more manual digging w/ delve and the core dump but couldn't find anything leaked (can't be exhaustive there though).
My only remaining lead is that I saw buildkit's cache refs seem to be leaking. Running some integ tests creates unpruneable cache entries. That certainly should lead to some extra memory, but I can't at all see the connection to the packages that show up in the heap dump and what would be stored there, but will keep pursuing it tomorrow.
If that fails too, might need to just keep going back further in our commit history to really figure out when this started happening so it can be bisected (manually, very hard to automate I think)
cc @here, any other help/eyes would be appreciated. Kind of worried how much of a time sink this becoming and that something obvious is lurking somewhere
not quite managing to repro the oom kills? what technique are people using to get them?
i do spot some memory leak things though, i am seeing the same log.newRing stuff as pointed out above
but also
this isn't the biggest thing, but it's odd to me that even after having no clients, we somehow seem to keep hold of MkFile unmarshalling
this is on @white latch's pr, that supposedly has a fix for this?
The fix I had improved that situation w MkFile but didn’t get rid of it
Which is why Im still in a state of misery trying to figure out wtf is going on there
But yeah, the log ring thing is what seems to grow the most and ultimately trigger the OOM
It outpaces everything else, but I feel like it’s probably all connected somehow
There’s instructions in the linear issue I linked above to repro
My best guess is that cache refs are becoming unpruneable and refs to all these various leaking things are somehow pointed to via the buildkit cache refs
Another thing that occurred to me was if any dep is using SetFinalizer it could trigger the situation vito mentioned where circular objects become un-gc-able. I tried rm’ing all of buildkits usage of SetFinalizer on a hunch but didn’t make a difference
Could be any dep though
i wonder if maybe we just up-ped the size of logs
i picked this "fast" run
and this slow run
the call.log for the fast one is 50.9 MB
but for the slow one, it's 129.9 MB
so potentially some change in that window just increased the amount of logging we're doing
and exposed a pre-existing issue - since i'm struggling to see how we could possibly have changed some memory-related thing in that windows
can confirm that we see the massive jump in size after merging https://github.com/dagger/dagger/pull/8724
so something there is causing some massive jump in the amount of logs we produce
i think that's unexpected, so i'm going to dive in there
🤔 I have a thought. I think some CacheVolumes getting returned are impure. Where we do the Select with the elective purity is only in the case of creating a cache volume from another module.
But creating it from outside of a module (e.g. how we do the go stuff) I think means we don't ever rewrite the ID?
cool, i think i can confirm this is what's going on: https://github.com/dagger/dagger/pull/9428
in the above patch i rework the NewInstanceForCurrentID calls to only apply to pure calls (and re-call itself as pure if it's not)
this reduces the log sizes, and seems to go back to previous performance
@white latch I think this is probably fixable now with your ID modification work? without that, not sure what a reasonable solution is
will write this into the linear issue (update: https://linear.app/dagger/issue/DEV-4922/memory-leak-excessive-allocations-in-otel-log-code#comment-6a84d94e)
also found this perf issue while doing some profiling: https://github.com/dagger/dagger/pull/9430
While doing some profiling while investigating the slow down discussed in #1328789786718371861 message, I figured I'd look at the ...
if that is the root cause, I wonder what sort of metrics dashboard might have surfaced it sooner. Seems pretty subtle: no behavior changes, some queries just run more often?
yea, was thinking about this - honestly, having some idea of log size exports would be useful.
maybe having some info on trace size in cloud would be good?
that would have made us realize, "hey this doubled in size at some point, not great"
what was all the logging coming from?
i did also have a slightly unrelated thought about OTEL stuff in general - i guess it's clear that each span has a cost that builds up when you have as many as we do.
i wonder if we should try and tidy up what gets exported and span-ified by default. the smaller the traces, the quicker cloud can be
uh.
i think it was stuff being re-evaluated
my hunch is:
- the cache volume impurity meant things were being recomputed - this caused the slowdown
- more recomputations = more logs causing log size to boom
- more logs = more leaks (because of unknown weird OTELy issue)
Thank you @lone forum ❤️ , will see if I can fix the cache volume purity in my PR. I'll also see if my hunch about bk cache refs leaking is right since it would be nice to squash this entirely if feasible, but will timebox that.
i think auditing the span and log emissions may be a win-win here... usually you lose a lot of obs fidelity to gain a little perf, but currently it feels to me like we have too much fidelity even as an engine dev, and if we feel that way it's gotta be much worse for typical users
So at first it seemed like fixing the cache volume thing in my PR had no effect, but much later realized I just goofed something and the new cache key stuff works well for it and avoids the previous problems 🎉
In between those two points I went back to seeing if the buildkit cache ref leaking was behind the memory leaks since it seemed like there was no escaping it. We are indeed leaking lots of buildkit cache refs (Run a bunch of tests like TestModule, then run ./hack/with-dev ./bin/dagger core engine local-cache prune and after you'll still see a lot of refs sitting around "actively used" in ./hack/with-dev ./bin/dagger core engine local-cache entry-set entries).
I managed to trace where the leaked refs were originating from (exec op mount code) but then to figure out where they went I ended up deep in the buildkit solver 🤮 I then did some hacks to forcibly release all cache refs so bk's cache manager will empty it's references from memory, but that still didn't fix the memory leak. So that's just another issue.
I tried about 10,000 other shots in the dark at tracking down the leak but still no cigar, so giving up for now.
Then finally after figuring out my stupidity with the cache volume and fixing that, I got reasonable runs in CI to discover some legit failures not just from the engine dying. Most were easily but there's some remaining viztest failures...
- It looks like some custom spans are not showing up for module initialization. This is legit, will figure it out tomorrow
- Some of the changes I made to ID handling in dagql are resulting in default values showing up in telemetry. So e.g. you see in output
withExec(args: ["cmd"], expect: SUCCESS)- This one I'm a bit more ambivalent about, in some ways it might be preferable? Maybe I'm just trying to convince myself 😅
Huh actually I misread the diff yesterday, in my PR we actually get more custom spans showing up in the TUI, which seems more correct? e.g. this one https://github.com/sipsma/dagger/blob/71a30c50c7674ed0f105616da67f8a0a6320c833/core/schema/module.go#L930-L930 and some similar. Not sure how, but 🤷♂️
Also, side note for later, my nvim config has a ton of extra linters enabled and some interesting ones from one called spancheck are showing up: span is unassigned, probable memory leak here and span.End is not called on all paths, possible memory leak here
No clue if it's a false alarm, but intriguing given our recent woes, cc @vocal scaffold in case it's obvious one way or another to you whether those are legit
that should be fine, the linter just isn't recognizing that it intentionally outlives the func
since the whole point of that func is to create/wrap spans
Okay cool, good to know
Huh actually I misread the diff yesterday, in my PR we actually get more custom spans showing up in the TUI, which seems more correct? e.g. this one https://github.com/sipsma/dagger/blob/71a30c50c7674ed0f105616da67f8a0a6320c833/core/schema/module.go#L930-L930 and some similar. Not sure how, but 🤷♂️
Actually I was even more confused than I thought, mainly thrown off by the fact that there's some dupes of viztest data I think? e.g. old and new
Either way, merged that PR so the test runtimes seem like they are back to their old baseline
in case y'all were as curious as I was about how many more logs/spans were generated:
trace id | spans | logs
8f1d609a9e6e572f549fc847af362355 | 105927 | 444692
054d196b882a21f4e081e218521e54db | 291370 | 791546
so, yeah, if we had something displaying that it'd be super noticeable
Managed to figure the leak out finally. I basically grepped a core dump for human readable strings that seemed like they were contents of the stuff we were leaking, used hexdump to find the address in the core file, mapped that to the virtual memory address in the heap and then asked viewcore why that address was reachable and it actually output the chain of address refs, which while lacking type data for most of the nodes was just enough to manually poke around with in delve until I found it.
The problem is buildkit stores entire trace provider objects in the struct for every cache ref. That ends up including a ton of stuff, including our buildkitTracer object.
The root of the pointers I followed lead to this delightful global var https://github.com/moby/buildkit/blob/81d49f78197cd471bb92ed800a4873384389bfd4/cache/contenthash/checksum.go#L31 but I think there's likely other roots that lead to the same place.
I suspect we'll just need to patch this in our fork, it doesn't seem obvious atm how to fix it in another non-invasive way that would work upstream.
Also the log.ring stuff leaks here because there's attributes stored in spans which are sourced from that ring buffer (which is actually just a circular linked list that's entirely pre-allocated, and thus one attribute getting ref'd somewhere prevents the entire ring buffer from being free'd)
viewcore why that address was reachable and it actually output the chain of address refs, which while lacking type data for most of the nodes was just enough to manually poke around with in delve until I found it.
color me extremely impressed
eyyy nice find ❤️
for reference, our bk tracer structs... client.ops seems potentially expensive to hold on to like this, too...
type buildkitTracer struct {
embedded.Tracer
provider *buildkitTraceProvider
tracer trace.Tracer
}
type buildkitTraceProvider struct {
embedded.TracerProvider
tp trace.TracerProvider
lp *sdklog.LoggerProvider
mp *sdkmetric.MeterProvider
client *Client
}
// Client is dagger's internal interface to buildkit APIs
type Client struct {
*Opts
closeCtx context.Context
cancel context.CancelCauseFunc
closeMu sync.RWMutex
execMap sync.Map
ops map[digest.Digest]opCtx
opsmu sync.RWMutex
}
tryna figure out what a fix for this might look like... like on the one hand we could try to not pass our tracer through, but i imagine it needs to be passed through for trace continuity inside of bk. it'd be interesting to know if the tracers get reconstituted from cache, too, and whether that effects the downstream result...
in all of this, though, im struggling to go deeper because it's unclear to me where in the code entire trace providers get stored on cache refs... greps of both bk and d/d coming up empty rn
I'm working on some possible fixes, was gonna write up an explanation of the whole chain of pointers and details on everything as part of the commit but can now if helpful.
Quick version
What I found grepping in hexdump output for a core dump that looked like a leaked string related to what we saw in pprof:
01bfd1b0 7b 22 43 6c 69 65 6e 74 49 44 22 3a 22 76 38 73 |{"ClientID":"v8s|
01bfd1c0 75 74 31 33 61 63 79 6a 7a 6b 65 39 63 6a 7a 72 |ut13acyjzke9cjzr|
01bfd1d0 32 7a 65 62 36 38 22 2c 22 53 65 73 73 69 6f 6e |2zeb68","Session|
01bfd1e0 49 44 22 3a 22 69 71 35 6e 72 6d 35 37 6c 6f 73 |ID":"iq5nrm57los|
01bfd1f0 69 68 64 76 34 30 7a 31 6c 61 75 6f 6b 6a 22 2c |ihdv40z1lauokj",|
01bfd200 22 53 65 63 72 65 74 54 6f 6b 65 6e 22 3a 22 22 |"SecretToken":""|
01bfd210 2c 22 48 6f 73 74 6e 61 6d 65 22 3a 22 22 2c 22 |,"Hostname":"","|
01bfd220 43 6c 69 65 6e 74 53 74 61 62 6c 65 49 44 22 3a |ClientStableID":|
01bfd230 22 22 2c 22 45 78 65 63 49 44 22 3a 22 6d 6c 32 |"","ExecID":"ml2|
01bfd240 30 63 74 7a 36 6d 36 6e 64 61 61 63 31 61 32 72 |0ctz6m6ndaac1a2r|
01bfd250 61 6b 38 36 73 35 22 2c 22 49 6e 74 65 72 6e 61 |ak86s5","Interna|
....
Load core in viewcore, run mappings to find heap and do arithmetic to map 01bfd1b0->40001D6880 (the addr in the heap in virtual memory to that string)
Run reachable 40001D6880 in viewcore (first time it always errors but if you run it again it works 🤷♂️ ).
(viewcore) reachable 40001D6880
github.com/moby/buildkit/cache/contenthash.defaultManager →
4002a36660 github.com/moby/buildkit/cache/contenthash.cacheManager .lru → f0
4002bd52c0 unk32 f16 → f0
4002748bd0 unk48 f16 → f8
4042390000 unk896 f160 → f0
400cbf8a50 unk80 f8 → f0
400a634c80 unk80 f40 → f0
400a4af440 unk96 f32 → f0
4001f918c0 unk48 f16 → f0
4003a98e70 unk48 f16 → f0
4004732000 unk416 f136 → f0
40046ff440 unk64 f0 → f0
400625f458 unk24 f8 → f0
400197d400 unk512 f24 → f0
4004796aa0 unk80 f64 → f0
4002e91d40 unk288 f40 → f0
40056bf230 unk16 f8 → f0
40056fb590 unk24 f16 → f0
400399a900 unk64 f48 → f0
4001c55b00 unk128 f0 → f0
400297b360 unk160 f96 → f0
400078b410 unk48 f16 → f8
400b482000 unk10240 f104 → f0
400aa06d50 unk24 f8 → f0
401387e1c0 unk224 f16 → f0
401653d080 unk96 f48 → f0
400dd2ba00 unk512 f80 → f0
40116fbc20 unk48 f16 → f0
401cab4620 unk224 f160 → f0
4012651470 unk48 f8 → f0
4014457c40 unk32 f24 → f0
40126513e0 unk48 f16 → f0
40159404b0 unk24 f0 → f0
401c4a1320 unk144 f40 → f0
4006296c60 unk96 f24 → f0
4015ba2360 unk48 f32 → f0
4003a6cc90 unk24 f0 → f0
4003bbf430 unk16 f8 → f0
40039e0b00 unk64 f40 → f0
40022cac60 unk352 f32 → f0
400184fa40 unk224 f8 → f0
40022db320 unk288 f40 → f0
400275f3f0 unk16 f8 → f0
40013b3c08 unk24 f16 → f0
40022d7080 unk64 f48 → f0
400067e400 unk128 f0 → f0
400043ee60 unk160 f96 → f0
4000b3cde0 unk48 f16 → f8
4013c06000 unk5376 f40 → f0
4015d91d70 unk24 f16 → f0
400ad3e380 unk128 f88 → f0
4010effb60 unk48 f16 → f8
4004407800 unk3072 f288 → f0
401cd26000 unk144 f16 → f0
401c8e9bc0 unk48 f8 → f0
4014451d70 unk48 f16 → f0
400e920000 unk288 f136 → f0
40001d6880 unk3456
Most types are prefixed w/ unk since viewcore is buggy and can't figure them out, but if you open the core dump up in delve you can use the p command to print the structs and find each addr manually. This is a huge PITA still and sometimes you have to jump a few pointers forward since you end up with some that are internal to map datastructures and such. Thank god it at least could tell me the root type.
I didn't go all the way down the chain, just to the point I was satisfied I knew what was happening, but here's my annotated notes on the p command corresponding the various points in the chain:
(viewcore) reachable 40001D6880
github.com/moby/buildkit/cache/contenthash.defaultManager →
4002a36660 github.com/moby/buildkit/cache/contenthash.cacheManager .lru → f0
4002bd52c0 unk32 f16 → f0
4002748bd0 unk48 f16 → f8
4042390000 unk896 f160 → f0
400cbf8a50 unk80 f8 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"]
400a634c80 unk80 f40 → f0
400a4af440 unk96 f32 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value
4001f918c0 unk48 f16 → f0
4003a98e70 unk48 f16 → f0
4004732000 unk416 f136 → f0
40046ff440 unk64 f0 → f0. p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"]
400625f458 unk24 f8 → f0
400197d400 unk512 f24 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p
4004796aa0 unk80 f64 → f0
4002e91d40 unk288 f40 → f0 p &contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.jobs BECAUSE FIRST STRUCT FIELD ACTUAL ADDRES IS FOR g
40056bf230 unk16 f8 → f0 p &contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span
40056fb590 unk24 f16 → f0
400399a900 unk64 f48 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span.tp
4001c55b00 unk128 f0 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span.tp.client
400297b360 unk160 f96 → f0
400078b410 unk48 f16 → f8
400b482000 unk10240 f104 → f0
400aa06d50 unk24 f8 → f0
401387e1c0 unk224 f16 → f0
401653d080 unk96 f48 → f0
400dd2ba00 unk512 f80 → f0
40116fbc20 unk48 f16 → f0
401cab4620 unk224 f160 → f0
4012651470 unk48 f8 → f0
4014457c40 unk32 f24 → f0
40126513e0 unk48 f16 → f0
40159404b0 unk24 f0 → f0
401c4a1320 unk144 f40 → f0
4006296c60 unk96 f24 → f0
4015ba2360 unk48 f32 → f0
4003a6cc90 unk24 f0 → f0
4003bbf430 unk16 f8 → f0
40039e0b00 unk64 f40 → f0
40022cac60 unk352 f32 → f0
400184fa40 unk224 f8 → f0
40022db320 unk288 f40 → f0
400275f3f0 unk16 f8 → f0
40013b3c08 unk24 f16 → f0
40022d7080 unk64 f48 → f0
400067e400 unk128 f0 → f0
400043ee60 unk160 f96 → f0
4000b3cde0 unk48 f16 → f8
4013c06000 unk5376 f40 → f0
4015d91d70 unk24 f16 → f0
400ad3e380 unk128 f88 → f0
4010effb60 unk48 f16 → f8
4004407800 unk3072 f288 → f0
401cd26000 unk144 f16 → f0
401c8e9bc0 unk48 f8 → f0
4014451d70 unk48 f16 → f0
400e920000 unk288 f136 → f0
40001d6880 unk3456
So
contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span.tp.client
Is something like this (just the most significant parts of it)
- https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/cache/contenthash/checksum.go#L29-L29
- https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/cache/refs.go#L481-L481
- https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/cache/opts.go#L13-L13
- https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/util/resolver/pool.go#L148-L148
- https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/solver/jobs.go#L56-L56
- https://github.com/sipsma/dagger/blob/9a18a0bd7f346ad9a9c2d05e2d7ac77999d3051d/engine/buildkit/telemetry.go#L109-L109
I agree that if there's a way to avoid putting the whole client in that struct it might help, but I'm not 100% convinced yet that would totally solve this. The root problem is buildkit holding onto all this stuff in the first place. It's not even just OTEL stuff but also tons of other crap
2 things:
-
yeah, to me the chain makes sense, but doesn't directly imply a solution. are you thinking we try to move the provider-onwards or resolver-onwards parts of the chain to the stack? or patch bk to be more aggressive about cleaning up dirty=false lru items?
-
ok im lying a bit that the chain was entirely followable for me in code on the first go round, i actually didn't make the RefMetadata.descHandlers jump... lemme poke around with that grep bc if there's an avoidable assignment to descHandlers bc that could also be a fix
yeah, to me the chain makes sense, but doesn't directly imply a solution. are you thinking we try to move the provider-onwards or resolver-onwards parts of the chain to the stack? or patch bk to be more aggressive about cleaning up dirty=false lru items?
I'm going to start by rm'ing that global var in contenthash entirely. It's supposed to be a perf optimization but I question if it's actually making any difference. If that fixes it entirely, I'll call it good enough. If not we may need something else that just avoids these pointers existing.
ok im lying a bit that the chain was entirely followable for me in code on the first go round, i actually didn't make the RefMetadata.descHandlers jump... lemme poke around with that grep bc if there's an avoidable assignment to descHandlers bc that could also be a fix
descHandlersending up there happens in a bunch of codepaths in our own code and also internal in buildkit
- our code: https://github.com/sipsma/dagger/blob/9a18a0bd7f346ad9a9c2d05e2d7ac77999d3051d/core/contenthash.go#L106-L106 (the
ref) - internal in buildkit: https://github.com/sipsma/buildkit/blob/09d1f0bb38d1a3c88b75db7ee6c757e2072dd3b7/solver/llbsolver/ops/opsutils/contenthash.go#L48-L48 (the
ref.ImmutableRef)
The thing is, the contenthash pkg only needs some fields in the ref, not the whole thing. So another possibility is to somehow give it a canned impl of the ImmutableRef interface that doesn't include all the crap that's leaking tons of memory
I'm going to start by rm'ing that global var in contenthash entirely
this sounds audacious from where i'm sitting, didn't even seem like an option to me XD --- but i like it, cut off the problem at the root
descHandlers ending up there happens in a bunch of codepaths in our own code and also internal in buildkit
lol yeah i wish i had read this before running around via grep, "a bunch" is an understatement, it's like central to how ref works and theninterface Mountableis everywhere too.
this sounds audacious from where i'm sitting, didn't even seem like an option to me XD --- but i like it, cut off the problem at the root
Sadly at least on my first attempt this didn't work. I'm guessing that's just one root where cache refs leak... Doing another round of core dump inspection w/ that fix in place to see what else is happening
~~in your notes, 01bfd1b0/40001D6880, is that the end of the chain? like corresponding to 40001d6880 unk3456 , and your p invocations are climbing down back towards it to find what's going on?~~did viewcore just give you that one way it was reachable? (admittedly i don't know how to read that output and the docs are garbage) (also realized after hitting send that 40001d6880=40001D6880)
the fX bits are gibberish lmao, that's my main complaint about the output, what is f0?
Yeah exactly, that's the start of the {"ClientID"... string I found in hexdump -C <corefile> | less, it's the bottom and something that leaked. So the notes are me starting at the top and figuring out in delve the chain between
no clue lol
if I had to fathom a guess might be "f as in field", it could be that it's in a struct and telling us the byte offset in the struct? like f48 means "the 48th byte offset in the struct"?
f0 is the beginning of the struct, etc.
reachable: find path from root to an object says the code, but path like just the first one it can find? or path like we didn't pluralize "paths" correctly lmao
Oh yeah, that I also don't know about. I feel like it must just be one path because conceptually there must be multiple paths possible, but it seems to only ever show one
yeah looking at the impl it's a BFS from target object to root that finds the root, prints it, then unwheels the path it found there via a map of stored depths
so it only finds the shortest path to root basically
you could ostensibly change the done logic fairly easily to get all of paths to root https://cs.opensource.google/go/x/debug/+/bb5bb248:cmd/viewcore/main.go
i started reading the code with the intent to figure out wtf fX is, and you're definitely correct about the offsets btw
Yeah I'm getting to the point of possibly patching viewcore to make this a little easier...
Interestingly though, after rm'ing that global from contenthash the leak still exists but I can't repro my previous technique; when I ask viewcore why the object is reachable it says it's not found.
I have a mildly educated guess that:
- Previously this worked because the root was a global var and thus easy for viewcore to find
viewcorebugs out when trying to find things that still exist due to the root being in a stack of a running goroutine- it prints a bunch of errors at start about giving up on a bunch of stacks due to obscure errors after a few frames
- There's a bunch of lines in the relevant code for walking stacks that suggest it's hardcoded for amd64 (e.g. here and some other places)
- So, it's worth a try to replicate all this on an amd64 machine (mine is arm64) and see if I get better luck actually finding everything in the stacks. Spinning up an ec2 instance....
Okay seems like the magic formula for getting viewcore to have (seemingly) complete information is:
- run on
amd64 - get the core dump via sending
SIGABRTto the engine,gcorecreated ones will load w/ no error but most of the heap will be marked as unreachable and gc'able even though it's not (???)
With that you don't need any shenanigans like I was doing before, viewcore histogram and viewcore objects just say "here's all the addresses of all the objects of each type".
I found a couple of other sources of leaks, more buildkit related things that I'm patching in our fork. I can currently get the RSS down from 2+GB while tests are running to ~400MB after ctrl-C (whereas before it barely went down if at all). There does seem to be at least a little bit leaking left (e.g. 2MB of log ring vs. hundreds of MBs before), so I'll see if it's obvious. But think we should be able to get it for the release on wed
What does rss mean in this context?
resident set size, it's the amount of space a process is currently occupying in memory (what's actually being used and has pages allocated in RAM, not including virtual memory that's not been allocated a page somewhere, or is in swap/etc)
Okay took a while to sort through all my various attempts and find what was strictly necessary to patch in buildkit, but PR to our buildkit fork here: https://github.com/dagger/buildkit/pull/4
Interestingly buildkit uses SetFinalizer in 3 places and all 3 seem to be broken and result in unfreeable memory, as in running TestModule for a couple of minutes accumulates 1+GB of unfreeable garbage in the heap.
There's only one change for dagger, which is to just rm the dagger client field from buildkitTraceProvider since it's not used and seemed to be making the leaks worse (what @late moat pointed out earlier). So may as well rm it for if/when buildkit leaks stuff again. Sending that PR out in a min based on that dagger/buildkit commit to double check tests
lmao the go 24 finalizer release note is haunting us
cc @lone forum on the buildkit changes, would appreciate a double check the changes don't seem too risky
pffff, i was looking at them right now lol
who knows lol
like this feels terrifying for some reason
Ya, exactly...
hm. how do you feel about the concept of pushing this out to v0.15.4.
Yeah, fwiw we seem to handle those reasonably in our code https://github.com/sipsma/dagger/blob/bc8146a76ed55513c82d208850dd9f9a2073d5b9/engine/buildkit/ref.go#L303-L303
v0.15.3 has the fix that makes it no worse than before
i guess. there's no harm in shipping
sgtm, more baking time would be nice
huh, it seems mildly spooky that buildkit was using SetFinalizer in the first place
yeah, lol, wtf
i always assumed you needed a Rob Pike Certificate of Authority or something
Yeah, and trying to discern why is probably impossible
so for the resolver i think that's a little odd
those should be safe to remove
eventually the parent context should get cancelled, and we'll clean it up
assuming containerd does reasonable things there
the other one i have no idea, but also i've never seen that leak warning
though. not really sure how it gets cleaned up at all lol.
ah the answer is magical interface
Yeah I wasn't able to figure out exactly how any of those things w/ Finalizers actually connected to the leaked stuff we see in pprof.
Unfortunately, viewcore doesn't seem to include anything in the heap that's unreferenced, even if it still exists due to being unfreeable via finalizers. It tells you the amount of garbage the exists (just based on arithmetic, looking at its code) but no details on it. So all I can really say is that when all 3 finalizers are rm'd RSS drops after clients disconnect and garbage isn't in the heap anymore.
It's also just really really difficult to think through manually because SetFinalizer is so tricky and the finalizers being used are on extremely convoluted structs and/or callbacks (which then pull in anything they pick up from the closure, which gets even more confusing)
i guess the worst case is that we'd see long-running engines end up with refs that aren't ever free-d - but this seems like it was always a possibility anyways, if it was leaking anyways, and the engine died at the right point
(side note - i would love to have a way of tracking metrics like this over time, being able to track how things like disk usage / ref counts change after an update would be wonderful)
but yea, let's go for it
Oh right, the other thing is we are for sure leaking refs. If you run TestModule for a bit, then ./hack/with-dev ./bin/dagger core engine local-cache prune you will still see "active" records left in ./hack/with-dev ./bin/dagger core engine local-cache entry-set entries
Last week I spent a day trying to track it down, has something to do with exec mounts, but reached the point where I was drawing diagrams for ref logic in the wrappers the solver pkg used and remembered it's more worth our time to move off of all this than understand it lol