#test-module-runtimes real bad

1 messages · Page 1 of 1 (latest)

lone forum
#

the first bad one is "ensure cache volumes are namespaed"

#

but "fixup dependabot config schema violation" is still fast

#

which kind of indicates a level of flakiness maybe?

#

the two candidates potentially are the cache namespacing (would definitely explain it) and maybe the custom init (which might have some very subtle performance bug)

late moat
#

lmao discord giving us a big pic of @white latch

lone forum
#

huh my browser really doesn't like any of these traces

#

it seems pretty convinced that the trace never ends as well

late moat
lone forum
#

gonna ping @vocal scaffold as well just in case this is a known issue 😢

late moat
#

you're talking about the trace showing green in GHA, but then you open it in v3 and random spans are pending, yeah?

lone forum
#

yes

#

exactly that 😄

vocal scaffold
#

got a trace link?

lone forum
#

this is what i'm trying to get open

late moat
#

i'm out on a branch where i've been mildly fucking with spans and inserting printlns in bad places so i assumed it was my changes

lone forum
#
runtime: out of memory: cannot allocate 4194304-byte block (2134278144 in use) rrweb-plugin-console-record.js:2447:17
fatal error: out of memory
vocal scaffold
#

ah yep, the 2gb memory limit rears its ugly head

#

probably some busted assumptions about memory usage since I made the change to send all the data from backend => frontend when the trace ends

#

will look into it soon

#

for now, you can use v2

lone forum
#

hm okay i have a suspicion related to cache volumes

#

so the go sdk creates cache volumes

#

but we try and namespace them based on what "module" we're inside of

#

i wonder if module X has a go dependency Y - I wonder if this cache module is namespaced as part of Y

#

hm that would be very surprising actually.

#

that doesn't seem like it's possible 😛

#

i do wonder if there's something on the code path of using module-sdks though

#

because of the fact we see the slow-down in suites that explicitly test a lot of those

lone forum
#

cacheVolume in python-sdk , . -> modgobuildcache

diff --git a/core/schema/cache.go b/core/schema/cache.go
index c5c593b73..a838ad534 100644
--- a/core/schema/cache.go
+++ b/core/schema/cache.go
@@ -3,6 +3,7 @@ package schema
 import (
     "context"
     "errors"
+    "fmt"
 
     "github.com/dagger/dagger/core"
     "github.com/dagger/dagger/dagql"
@@ -49,6 +50,19 @@ func (s *cacheSchema) cacheVolume(ctx context.Context, parent dagql.Instance[*co
     if err != nil && !errors.Is(err, core.ErrNoCurrentModule) {
         return inst, err
     }
+    if m == nil {
+        fmt.Println("cacheVolume in _ -> ", args.Key)
+    } else {
+        name, err := m.Source.Self.ModuleName(ctx)
+        if err != nil {
+            panic(err)
+        }
+        symbolic, err := m.Source.Self.Symbolic()
+        if err != nil {
+            panic(err)
+        }
+        fmt.Println("cacheVolume in", name, ",", symbolic, "-> ", args.Key)
+    }
 
     namespaceKey := ""
     if m != nil {
#

the python-sdk module never creates the modgobuildcache volume - that's create by the builtin go sdk

#

(in a basic setup of having a python module call a go module)

#

this definitely doesn't seem right - it means that somehow we're ending up with distinct cache volumes where we should be getting the same ones

#

but also not sure how this would lead to a x2 slowdown 🤔

#

(heading off for the evening, but will pick this up again tomorrow, definitely has performance regressions vibes)

late moat
#

oh man, trying to run dagger call test specific --run="TestGo|TestPython|TestTypescript|TestElixir|TestPHP" locally... it chokes soooooooooo hard with 8g of memory

late moat
#

just the 24 tests in module_typescript_test.go take forever, saturate 8 CPUs, and over 12G of RAM (grain of salt on the ram obviously, htop)

late moat
#

loool sweet lord it even gets 137'd on a 16g vm

vocal scaffold
lone forum
#

gonna try reverting the init change to see if that does it

late moat
#

playing around with the typescript module runtime tests, there does appear to be quite a lot of npm+pnpm+corepack+bun diskrw spikes.... but i'm also getting clean 6 minute runs from doing go test on my host instead of dagger call test specific

#

admittedly this is a shot in the dark, bc i'm not fully following what was added to le dumb-init copy-over--- but is it plausible that that is behaving badly specifically when we run engine-in-engine?

white latch
#

(continuing conversation from meeting this morning) The biggest contributor to the OOM's I was getting in https://github.com/dagger/dagger/pull/9204 was from that PR specifically (something incredibly dumb around buildkit op unmarshaling). But after fixing that locally I'm still getting some crazy RSS in the engine, just barely managing to avoid an OOM. I suspect I started seeing those consistent OOMs after rebasing on main, which combined with the pre-existing LLB unmarshal mem usage sent everything over the edge.

Either way, looking at pprof heap usage shows the worst offender now being:

 1539.11MB 28.80% 28.80%  1539.11MB 28.80%  go.opentelemetry.io/otel/sdk/log.newRing (inline)

Which is a new player I haven't seen in this game before... Not sure if I just haven't noticed it previously or if some otel dep change introduced it in the last few weeks? Going back in commits to see

#

Actually just realized we haven't bumped otel i na while so probably not that...

late moat
#

man you beat me to the heap dump result by 20 minutes after me spending all afternoon tryna get to that pprof port lolsob

#

similar result out the gate on my end, haven't gotten the timing right to where it shows 1.5G yet

white latch
#

Yeah also interesting is that switching from inuse_space to alloc_space shows this behemoth, 133GB over a few mins of running TestModule lol

#

(that's running on main now, not my PR)

#

@vocal scaffold I'm gonna keep spelunking but just checking if you know of any tweaks we made to otel logging recently. Might be a totally innocent looking change that somehow accidentally resulted in us calling some otel func way more than before (speculating)

vocal scaffold
#

hmmmm i don't think so. is it possible this has been around for a while?

late moat
#

yes, but it also got marginally worse on 1/7

white latch
white latch
white latch
#

Yeah I see the same thing back to v0.14.0, so I guess it has been there for a while at least. I've definitely looked at heap profiles since then but probably not while running all of TestModule specifically... either way I'll open an issue in linear to fix this since it's kind of crazy in terms of number of allocations but probably not the culprit we were looking for yet

vocal scaffold
#

curious what the actual logs are that are going down this path - I guess stdout/stderr from things running, going into the OTel/clientdb pipeline?

white latch
#

Another weird thing is that when I disconnect everything from the engine, hit the manual debug/gc endpoint + give it a sec, the memory allocated by log.newRing never disappears, so it's somehow leaking seemingly. But goroutine dumps don't show anything left that seems like it could plausibly still be referencing it. So idk what could be happening other than that memory ending up in a global map/pool/thing somewhere

#

And sadly the current state of go tooling doesn't really answer the question of "what is still referencing this memory", which is a real shame since it's almost always the next question you ask after you see a heap profile

white latch
#

Confirmed the above a genuine leak (which is a little tricky since it's hard to tell the diff between a leak and the GC just choosing to not free memory from sync.Pools and such), it seems pretty bad actually: https://linear.app/dagger/issue/DEV-4922/memory-leak-excessive-allocations-in-otel-log-code

Looks like over a GB is leaked over the course of just running TestModule.

Mention this in that issue, but apparently the go team is working on something that does exactly what we want here (take a core dump and then tell you exactly why a given object is still considered referenced by the GC), but sadly it doesn't seem to quite work yet despite jumping through a million hoops in desperation 😢 https://pkg.go.dev/golang.org/x/debug/cmd/viewcore

#

@late moat @vocal scaffold would either or both of you have time to look into that issue?

vocal scaffold
#

hmm... seems like it at least tries to; we shut down all client log processors when the session goes away

#

I would still expect to see high alloc_space here, since we unquestionablly alloc a queue of size 2048 for each client, but inuse_space shouldn't be leaking for sure

#

maybe there's a dangling reference somewhere tricky preventing GC

#

is it possible that all these clients accumulate before the session goes away? Like all tests are actually running many clients within one session or something? (I don't think so, right?)

late moat
#

do the clients (or sessions) have any concept of nesting on the server side? like a bunch of the module tests use a dagger binary inside of a container created by a dagger.Connect client

white latch
#

The important thing is that we aren't leaking goroutines. That's what I'd expect to see in an obvious leak of something related to client/session state. There are a handful of goroutines (besides the standard go runtime ones) still around once everything disconnects (no tests running) but they are all what you'd expect in terms of listeners/scheduler loops/etc. None of them had any obvious way of holding state that referenced the Value in these log ring things.

I also saw that the sessions map in the engine server is empty, so no obvious leaks there.

I guess the possibilities in my mind are:

  1. Some sneaky leak somewhere in the engine server or maybe buildkit. I do see that the cumulative mem usage for a few other parts of sessions/clients/dagql classes seems to stick around, but nothing in flat usage.
  2. Some global var literally anywhere in our whole dependency tree that's holding onto state
  3. An outside possibility but wouldn't rule it out: the OTEL libs do some crazy magic with unsafe, i.e. getting the *byte underlying a string and then reconstructing new strings with it (basically turning what's supposed to be immutable into more or less a C string): https://github.com/open-telemetry/opentelemetry-go/blob/cbc3b6a5bb245ccce687559b92c135d680c5699b/log/keyvalue.go#L69-L70
    • Even if this doesn't result in a leak, I wouldn't be shocked if it somehow confuses the tooling around all this and means it's reporting the source of the allocation incorrectly? </speculation>
late moat
#

other thing about leaks: it's really not guaranteed at all that this leak is related to poor performance on the module-runtime tests

#

also @white latch , have you been pulling heap dumps from ./hack/dev engines? or dind dagger call test engines? both?

vocal scaffold
#

also curious what errors you got out of viewcore if you have them handy, I can take a swing at it on my machine (low odds of working but might as well)

#

besides that I figure I can use runtime.SetFinalizer on some key things to see if they're ever GCed

late moat
#

upgrade to 1.24 preview and you can runtime.AddCleanup instead of ye olden finalizers

vocal scaffold
#

lol, thought about it but hopefully a simple println is easy enough to do with the OG api

white latch
white latch
white latch
white latch
# vocal scaffold also curious what errors you got out of `viewcore` if you have them handy, I can...

Okay to get a core dump, there is potentially a pretty easy path by installing gdb and then use sudo gcore <pid of engine>. However, that's from a live process and somewhat outside the tooling go says it explicitly supports, so when I started hitting errors later I decided to try to do it in the more explicitly doc'd way. Possible this is irrelevant.

Either way, to get the core dump in the way go docs say is supported:

Add these lines to the engine startup

ulimit -c unlimited
export GOTRACEBACK=crash

Change the ./hack/dev engine to have an init instead of being pid 1 (no clue why I had to do it, was a random shot in the dark that happened to work; pid 1 is weird). Can append --init here

Start the engine, run stuff. To get the core dump run docker exec -it dagger-engine.dev kill -ABRT 7

  • docker kill -s SIGABRT dagger-engine.dev should work, but doesn't 🤷‍♂️

To extra the core dump first run docker diff dagger-engine.dev | grep core, find the core file (should be under /). Then run docker cp dagger-engine.dev:<path to core> .

Also copy out the dagger-engine binary with docker cp dagger-engine.dev:/usr/local/bin/dagger-engine .

Then after installing viewcore, the command is like go run golang.org/x/debug/cmd/viewcore <core file path>--exe <dagger-engine exe path>.

I get errors at start about giving up on tracing some parts of the core dump. Also usually I get a panic when I run objects, but if I run it again it works more but there's a lot of types marked as unknown (and no occurences of anything interesting from otel/log pkgs that I saw). I basically don't know if the information is complete/correct.

#

I'll dump this in the linear issue as a comment for posterity since hopefully soon enough viewcore actually works and this will be relevant again

white latch
#

Oh I forget you also need to re-enable debugging symbols, just comment out this line

vocal scaffold
#

so far SetFinalizer 10-0 Alex - can't seem to get it to run even after forcing GC and even with a simple example:

    slog.Warn("!!! SETTING FINALIZER")
    type t struct{}
    val := &t{}
    runtime.SetFinalizer(val, func(v *t) {
        slog.Warn("!!! IT'S FINALIZER TIME !!!", "tag", "welp", "type", fmt.Sprintf("%T", v))
    })

will keep poking after lunch

white latch
vocal scaffold
#

yeah i yoinked your flags from the linear issue, still no cigar

white latch
vocal scaffold
#

oh, in that example it also doesn't help that it's struct{} lol - tinkering in the playground, and i need to add at least one field for it to finalize

vocal scaffold
#

alright - got finalizers working for my canary example, but I'm not seeing them run for the "real" values which is ... concerning?

white latch
vocal scaffold
#
    slog.Warn("!!! SETTING FINALIZER")
    type t struct{ a [1024000]byte }
    snitchGC(&t{}, "canary")
    // runtime.SetFinalizer(&t{}, func(v *t) {
    //     slog.Warn("!!! IT'S FINALIZER TIME !!!", "tag", "welp", "type", fmt.Sprintf("%T", v))
    // })

    snitchGC(client.loggerProvider, client.clientID)
    snitchGC(client, client.clientID)

I see the finalizer log for the canary, but not for the client or loggerProvider. Also set finalizers on all the batch processors, and don't see them run either

white latch
#

Interesting... In the heap dumps I collected after tests ended and nothing was happening in the engine showed no "flat" memory usage from the engine/server package, only some cumulative due to other packages being called by that package having memory laying around still

#

If the client itself was never GC'd I would have expected >0 flat usage

#

I could be misunderstanding the tooling though

vocal scaffold
#

yeah fair - at this point it's most likely I'm holding SetFinalizer wrong

#

but, it worked for the canary thinkies

vocal scaffold
#

Grr... it seems like using SetFinalizer was causing things to not finalize. I added a waste [10485760]byte to *daggerClient with a rand.Read(client.waste[:]) for good measure to make leaks more obvious, and I could see each client leaking with a SetFinalizer installed, using just while dagger core version; end to test. But once I removed the SetFinalizer, it still "leaked" but plateaued at a smaller amount of space, and once I stopped the loop everything eventually got fully GC'd. But, after the final GC, even the newRing space was gone! So it's hard to say if anything's actually going wrong here, or if it's the GC just being mercurial/lazy.

vocal scaffold
#

well this is cursed: grew impatient and decided to run multiple dagger-dev core version loops in parallel, and ... sometimes i get v0.15.2 back??? (along with a weird error)

❯ while dagger-dev core version; end
# command-line-arguments
open ./mage_output_file.go: no such file or directory
Error: error compiling magefiles
✔ connect 0.3s
✔ loading type definitions 0.6s
✔ parsing command line arguments 0.0s

✔ version: String! 0.0s

v0.15.2

✔ connect 0.4s
✔ loading type definitions 1.2s
✔ parsing command line arguments 0.0s

✔ version: String! 0.0s

v0.15.3-010101000000-dev-dfcd66e83e45
#

guessing due to parallelism somehow

white latch
#

Yeah if something crashes or I ctrl-C at the wrong time I occasionally have to manually delete that ./mage_output_file.go for some reason. Mage really doesn't need to exist anymore. Not sure if the version thing is related, also sounds possibly like https://github.com/dagger/dagger/pull/9400

white latch
vocal scaffold
vocal scaffold
late moat
#

not bc it's useful information but because maybe i'll be slightly less annoyed

vocal scaffold
#

this SetFinalizer behavior is just baffling - I even pulled out the func so it's not closing over the value or anything silly like that, and still the whole client leaks, and stays leaked even after runtime.GC()

#

i guess i'll try 1.24 after all (...maybe - looks like we use Wolfi for this now which doesn't have it yet, possible dead end anyway. edit: nvm wasnt hard)

#

The new runtime.AddCleanup function is a finalization mechanism that is more flexible, more efficient, and less error-prone than runtime.SetFinalizer. AddCleanup attaches a cleanup function to an object that will run once the object is no longer reachable. However, unlike SetFinalizer, multiple cleanups may be attached to a single object, cleanups may be attached to interior pointers, cleanups do not generally cause leaks when objects form a cycle, and cleanups do not delay the freeing of an object or objects it points to. New code should prefer AddCleanup over SetFinalizer.
"cleanups do not generally cause leaks when objects form a cycle" sounds especially relevant - client/session are cyclic

#

yep. runtime.AddCleanup works facepalm

vocal scaffold
#

ok, and this time after two calls to the /debug/gc endpoint, all the clients (and newRing) are gone - saw 4 clients get GC'd with each call.

#

I'll run TestModule now, rebooted a bit ago so hopefully the VM doesn't fry

vocal scaffold
#

(i do indeed repro an OOM there)

vocal scaffold
#

Stopping for now, but here's the last flamegraph I caught before OOM - still seems to imply *daggerClients are sticking around, which explains the newRing "leaking" since it's part of it. Not really satisfied with that answer though. It seems surprising that Go's GC would be at fault, but it seems to at least exhibit a behavior where runtime.GC() doesn't always clear everything - sometimes it takes multiple passes. Maybe if you're about to OOM, the fact that it leaves objects for a later pass results in it OOMing at the threshold? Purely speculative though.

Tomorrow I'll try setting GOMEMLIMIT to something like 1GB and see if it's able to respect it - right now it's probably not doing anything because I'm forcing it to always be surpassed with that 10MB array?

Pushed changes to https://github.com/dagger/dagger/compare/main...vito:dagger:oom-investigate?expand=1

lone forum
#

😱 sigh i want to get rid of mage

late moat
#

knowing what i know now i kinda wanna give killing mage another shot

#

@lone forum it was you, right, that showed some docker socket trick that'd let us plop the dev engine into docker from a dagger build? that was one of the problems that made me abandon my first attempt

lone forum
#

I started hacking on maybe having a dagger shell script to do this earlier

#

But then came across legit shell bugs

white latch
# vocal scaffold Stopping for now, but here's the last flamegraph I caught before OOM - still see...

I still would expect there to be memory usage from the engine/server package in that case though, was the flat usage of that zero for you too? I'm not sure the extent to which "zero really means zero" (i.e. is the heap snapshot not 100% complete?), but if it does then all the data I got was saying that engine/server showed up just because it was in the call stack that initially allocated the ring, even though everything created directly by engine/server was all gone.

#

I also still got this same behavior even when doing things like waiting 10m after all clients disconnected, running a few debug/gc calls throughout. I agree we can't rule out the go GC just deciding not to collect anything in that case but in combation with all the GOGC/GOMEMLIMIT stuff it seems like it must just be utterly broken in some way

white latch
#

Been looking into this again because I am still getting what appears to be OOM kills on https://github.com/dagger/dagger/pull/9204 despite some improvements and pprof heaps locally show similar sort of leaks (otel log related the worst but scattered other stuff)...

Reaching my wits end. I tried building w/ 1.24 and using viewcore but got same sort of errors as before. Did a ton of more manual digging w/ delve and the core dump but couldn't find anything leaked (can't be exhaustive there though).

My only remaining lead is that I saw buildkit's cache refs seem to be leaking. Running some integ tests creates unpruneable cache entries. That certainly should lead to some extra memory, but I can't at all see the connection to the packages that show up in the heap dump and what would be stored there, but will keep pursuing it tomorrow.

If that fails too, might need to just keep going back further in our commit history to really figure out when this started happening so it can be bisected (manually, very hard to automate I think)

#

cc @here, any other help/eyes would be appreciated. Kind of worried how much of a time sink this becoming and that something obvious is lurking somewhere

lone forum
#

not quite managing to repro the oom kills? what technique are people using to get them?
i do spot some memory leak things though, i am seeing the same log.newRing stuff as pointed out above

#

but also

#

this isn't the biggest thing, but it's odd to me that even after having no clients, we somehow seem to keep hold of MkFile unmarshalling

#

this is on @white latch's pr, that supposedly has a fix for this?

white latch
#

The fix I had improved that situation w MkFile but didn’t get rid of it

#

Which is why Im still in a state of misery trying to figure out wtf is going on there

#

But yeah, the log ring thing is what seems to grow the most and ultimately trigger the OOM

#

It outpaces everything else, but I feel like it’s probably all connected somehow

#

There’s instructions in the linear issue I linked above to repro

#

My best guess is that cache refs are becoming unpruneable and refs to all these various leaking things are somehow pointed to via the buildkit cache refs

#

Another thing that occurred to me was if any dep is using SetFinalizer it could trigger the situation vito mentioned where circular objects become un-gc-able. I tried rm’ing all of buildkits usage of SetFinalizer on a hunch but didn’t make a difference

#

Could be any dep though

lone forum
#

i wonder if maybe we just up-ped the size of logs

#

i picked this "fast" run

#

and this slow run

#

the call.log for the fast one is 50.9 MB

#

but for the slow one, it's 129.9 MB

#

so potentially some change in that window just increased the amount of logging we're doing

#

and exposed a pre-existing issue - since i'm struggling to see how we could possibly have changed some memory-related thing in that windows

#

so something there is causing some massive jump in the amount of logs we produce

#

i think that's unexpected, so i'm going to dive in there

#

🤔 I have a thought. I think some CacheVolumes getting returned are impure. Where we do the Select with the elective purity is only in the case of creating a cache volume from another module.

#

But creating it from outside of a module (e.g. how we do the go stuff) I think means we don't ever rewrite the ID?

lone forum
#

cool, i think i can confirm this is what's going on: https://github.com/dagger/dagger/pull/9428
in the above patch i rework the NewInstanceForCurrentID calls to only apply to pure calls (and re-call itself as pure if it's not)
this reduces the log sizes, and seems to go back to previous performance

#

@white latch I think this is probably fixable now with your ID modification work? without that, not sure what a reasonable solution is

lone forum
vocal scaffold
#

if that is the root cause, I wonder what sort of metrics dashboard might have surfaced it sooner. Seems pretty subtle: no behavior changes, some queries just run more often?

lone forum
#

yea, was thinking about this - honestly, having some idea of log size exports would be useful.
maybe having some info on trace size in cloud would be good?

#

that would have made us realize, "hey this doubled in size at some point, not great"

vocal scaffold
#

what was all the logging coming from?

lone forum
#

i did also have a slightly unrelated thought about OTEL stuff in general - i guess it's clear that each span has a cost that builds up when you have as many as we do.
i wonder if we should try and tidy up what gets exported and span-ified by default. the smaller the traces, the quicker cloud can be

lone forum
#

i think it was stuff being re-evaluated

#

my hunch is:

  • the cache volume impurity meant things were being recomputed - this caused the slowdown
  • more recomputations = more logs causing log size to boom
  • more logs = more leaks (because of unknown weird OTELy issue)
white latch
#

Thank you @lone forum ❤️ , will see if I can fix the cache volume purity in my PR. I'll also see if my hunch about bk cache refs leaking is right since it would be nice to squash this entirely if feasible, but will timebox that.

late moat
white latch
#

So at first it seemed like fixing the cache volume thing in my PR had no effect, but much later realized I just goofed something and the new cache key stuff works well for it and avoids the previous problems 🎉

In between those two points I went back to seeing if the buildkit cache ref leaking was behind the memory leaks since it seemed like there was no escaping it. We are indeed leaking lots of buildkit cache refs (Run a bunch of tests like TestModule, then run ./hack/with-dev ./bin/dagger core engine local-cache prune and after you'll still see a lot of refs sitting around "actively used" in ./hack/with-dev ./bin/dagger core engine local-cache entry-set entries).

I managed to trace where the leaked refs were originating from (exec op mount code) but then to figure out where they went I ended up deep in the buildkit solver 🤮 I then did some hacks to forcibly release all cache refs so bk's cache manager will empty it's references from memory, but that still didn't fix the memory leak. So that's just another issue.

I tried about 10,000 other shots in the dark at tracking down the leak but still no cigar, so giving up for now.

Then finally after figuring out my stupidity with the cache volume and fixing that, I got reasonable runs in CI to discover some legit failures not just from the engine dying. Most were easily but there's some remaining viztest failures...

  1. It looks like some custom spans are not showing up for module initialization. This is legit, will figure it out tomorrow
  2. Some of the changes I made to ID handling in dagql are resulting in default values showing up in telemetry. So e.g. you see in output withExec(args: ["cmd"], expect: SUCCESS)
    • This one I'm a bit more ambivalent about, in some ways it might be preferable? Maybe I'm just trying to convince myself 😅
white latch
#

Huh actually I misread the diff yesterday, in my PR we actually get more custom spans showing up in the TUI, which seems more correct? e.g. this one https://github.com/sipsma/dagger/blob/71a30c50c7674ed0f105616da67f8a0a6320c833/core/schema/module.go#L930-L930 and some similar. Not sure how, but 🤷‍♂️

Also, side note for later, my nvim config has a ton of extra linters enabled and some interesting ones from one called spancheck are showing up: span is unassigned, probable memory leak here and span.End is not called on all paths, possible memory leak here

No clue if it's a false alarm, but intriguing given our recent woes, cc @vocal scaffold in case it's obvious one way or another to you whether those are legit

vocal scaffold
#

that should be fine, the linter just isn't recognizing that it intentionally outlives the func

#

since the whole point of that func is to create/wrap spans

white latch
#

Okay cool, good to know

white latch
#

Either way, merged that PR so the test runtimes seem like they are back to their old baseline

vocal scaffold
#

in case y'all were as curious as I was about how many more logs/spans were generated:

trace id                         | spans  | logs
8f1d609a9e6e572f549fc847af362355 | 105927 | 444692
054d196b882a21f4e081e218521e54db | 291370 | 791546

so, yeah, if we had something displaying that it'd be super noticeable

white latch
#

Managed to figure the leak out finally. I basically grepped a core dump for human readable strings that seemed like they were contents of the stuff we were leaking, used hexdump to find the address in the core file, mapped that to the virtual memory address in the heap and then asked viewcore why that address was reachable and it actually output the chain of address refs, which while lacking type data for most of the nodes was just enough to manually poke around with in delve until I found it.

The problem is buildkit stores entire trace provider objects in the struct for every cache ref. That ends up including a ton of stuff, including our buildkitTracer object.

The root of the pointers I followed lead to this delightful global var https://github.com/moby/buildkit/blob/81d49f78197cd471bb92ed800a4873384389bfd4/cache/contenthash/checksum.go#L31 but I think there's likely other roots that lead to the same place.

I suspect we'll just need to patch this in our fork, it doesn't seem obvious atm how to fix it in another non-invasive way that would work upstream.

GitHub

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit - moby/buildkit

#

Also the log.ring stuff leaks here because there's attributes stored in spans which are sourced from that ring buffer (which is actually just a circular linked list that's entirely pre-allocated, and thus one attribute getting ref'd somewhere prevents the entire ring buffer from being free'd)

late moat
#

viewcore why that address was reachable and it actually output the chain of address refs, which while lacking type data for most of the nodes was just enough to manually poke around with in delve until I found it.
color me extremely impressed

lone forum
#

eyyy nice find ❤️

late moat
# white latch Managed to figure the leak out finally. I basically grepped a core dump for huma...

for reference, our bk tracer structs... client.ops seems potentially expensive to hold on to like this, too...

type buildkitTracer struct {
    embedded.Tracer
    provider *buildkitTraceProvider
    tracer   trace.Tracer
}

type buildkitTraceProvider struct {
    embedded.TracerProvider
    tp     trace.TracerProvider
    lp     *sdklog.LoggerProvider
    mp     *sdkmetric.MeterProvider
    client *Client
}

// Client is dagger's internal interface to buildkit APIs
type Client struct {
    *Opts

    closeCtx context.Context
    cancel   context.CancelCauseFunc
    closeMu  sync.RWMutex
    execMap  sync.Map

    ops   map[digest.Digest]opCtx
    opsmu sync.RWMutex
}

tryna figure out what a fix for this might look like... like on the one hand we could try to not pass our tracer through, but i imagine it needs to be passed through for trace continuity inside of bk. it'd be interesting to know if the tracers get reconstituted from cache, too, and whether that effects the downstream result...

in all of this, though, im struggling to go deeper because it's unclear to me where in the code entire trace providers get stored on cache refs... greps of both bk and d/d coming up empty rn

white latch
# late moat for reference, our bk tracer structs... client.ops seems potentially expensive t...

I'm working on some possible fixes, was gonna write up an explanation of the whole chain of pointers and details on everything as part of the commit but can now if helpful.

Quick version

What I found grepping in hexdump output for a core dump that looked like a leaked string related to what we saw in pprof:

01bfd1b0  7b 22 43 6c 69 65 6e 74  49 44 22 3a 22 76 38 73  |{"ClientID":"v8s|
01bfd1c0  75 74 31 33 61 63 79 6a  7a 6b 65 39 63 6a 7a 72  |ut13acyjzke9cjzr|
01bfd1d0  32 7a 65 62 36 38 22 2c  22 53 65 73 73 69 6f 6e  |2zeb68","Session|
01bfd1e0  49 44 22 3a 22 69 71 35  6e 72 6d 35 37 6c 6f 73  |ID":"iq5nrm57los|
01bfd1f0  69 68 64 76 34 30 7a 31  6c 61 75 6f 6b 6a 22 2c  |ihdv40z1lauokj",|
01bfd200  22 53 65 63 72 65 74 54  6f 6b 65 6e 22 3a 22 22  |"SecretToken":""|
01bfd210  2c 22 48 6f 73 74 6e 61  6d 65 22 3a 22 22 2c 22  |,"Hostname":"","|
01bfd220  43 6c 69 65 6e 74 53 74  61 62 6c 65 49 44 22 3a  |ClientStableID":|
01bfd230  22 22 2c 22 45 78 65 63  49 44 22 3a 22 6d 6c 32  |"","ExecID":"ml2|
01bfd240  30 63 74 7a 36 6d 36 6e  64 61 61 63 31 61 32 72  |0ctz6m6ndaac1a2r|
01bfd250  61 6b 38 36 73 35 22 2c  22 49 6e 74 65 72 6e 61  |ak86s5","Interna|
....

Load core in viewcore, run mappings to find heap and do arithmetic to map 01bfd1b0->40001D6880 (the addr in the heap in virtual memory to that string)

Run reachable 40001D6880 in viewcore (first time it always errors but if you run it again it works 🤷‍♂️ ).

(viewcore) reachable 40001D6880
github.com/moby/buildkit/cache/contenthash.defaultManager → 
4002a36660 github.com/moby/buildkit/cache/contenthash.cacheManager .lru → f0
4002bd52c0 unk32 f16 → f0
4002748bd0 unk48 f16 → f8
4042390000 unk896 f160 → f0
400cbf8a50 unk80 f8 → f0
400a634c80 unk80 f40 → f0
400a4af440 unk96 f32 → f0
4001f918c0 unk48 f16 → f0
4003a98e70 unk48 f16 → f0
4004732000 unk416 f136 → f0
40046ff440 unk64 f0 → f0
400625f458 unk24 f8 → f0
400197d400 unk512 f24 → f0
4004796aa0 unk80 f64 → f0
4002e91d40 unk288 f40 → f0
40056bf230 unk16 f8 → f0
40056fb590 unk24 f16 → f0
400399a900 unk64 f48 → f0
4001c55b00 unk128 f0 → f0
400297b360 unk160 f96 → f0
400078b410 unk48 f16 → f8
400b482000 unk10240 f104 → f0
400aa06d50 unk24 f8 → f0
401387e1c0 unk224 f16 → f0
401653d080 unk96 f48 → f0
400dd2ba00 unk512 f80 → f0
40116fbc20 unk48 f16 → f0
401cab4620 unk224 f160 → f0
4012651470 unk48 f8 → f0
4014457c40 unk32 f24 → f0
40126513e0 unk48 f16 → f0
40159404b0 unk24 f0 → f0
401c4a1320 unk144 f40 → f0
4006296c60 unk96 f24 → f0
4015ba2360 unk48 f32 → f0
4003a6cc90 unk24 f0 → f0
4003bbf430 unk16 f8 → f0
40039e0b00 unk64 f40 → f0
40022cac60 unk352 f32 → f0
400184fa40 unk224 f8 → f0
40022db320 unk288 f40 → f0
400275f3f0 unk16 f8 → f0
40013b3c08 unk24 f16 → f0
40022d7080 unk64 f48 → f0
400067e400 unk128 f0 → f0
400043ee60 unk160 f96 → f0
4000b3cde0 unk48 f16 → f8
4013c06000 unk5376 f40 → f0
4015d91d70 unk24 f16 → f0
400ad3e380 unk128 f88 → f0
4010effb60 unk48 f16 → f8
4004407800 unk3072 f288 → f0
401cd26000 unk144 f16 → f0
401c8e9bc0 unk48 f8 → f0
4014451d70 unk48 f16 → f0
400e920000 unk288 f136 → f0
40001d6880 unk3456
#

Most types are prefixed w/ unk since viewcore is buggy and can't figure them out, but if you open the core dump up in delve you can use the p command to print the structs and find each addr manually. This is a huge PITA still and sometimes you have to jump a few pointers forward since you end up with some that are internal to map datastructures and such. Thank god it at least could tell me the root type.

I didn't go all the way down the chain, just to the point I was satisfied I knew what was happening, but here's my annotated notes on the p command corresponding the various points in the chain:

(viewcore) reachable 40001D6880
github.com/moby/buildkit/cache/contenthash.defaultManager → 
4002a36660 github.com/moby/buildkit/cache/contenthash.cacheManager .lru → f0
4002bd52c0 unk32 f16 → f0
4002748bd0 unk48 f16 → f8
4042390000 unk896 f160 → f0
400cbf8a50 unk80 f8 → f0         p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"]
400a634c80 unk80 f40 → f0
400a4af440 unk96 f32 → f0     p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value
4001f918c0 unk48 f16 → f0
4003a98e70 unk48 f16 → f0
4004732000 unk416 f136 → f0
40046ff440 unk64 f0 → f0. p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"]
400625f458 unk24 f8 → f0
400197d400 unk512 f24 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p
4004796aa0 unk80 f64 → f0
4002e91d40 unk288 f40 → f0 p &contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.jobs BECAUSE FIRST STRUCT FIELD ACTUAL ADDRES IS FOR g
40056bf230 unk16 f8 → f0 p &contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span
40056fb590 unk24 f16 → f0
400399a900 unk64 f48 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span.tp
4001c55b00 unk128 f0 → f0 p contenthash.defaultManager.lru.items["7t62qv4iwc4x586dgl086omc7"].prev.Value.md.RefMetadata.descHandlers["sha256:608d1fc5862244876c9d43c1240e97d02d3fb2e878fdf220ace7b4f613241ddf"].Provider.p.Resolver.g.mspan.Span.tp.client
400297b360 unk160 f96 → f0
400078b410 unk48 f16 → f8
400b482000 unk10240 f104 → f0
400aa06d50 unk24 f8 → f0
401387e1c0 unk224 f16 → f0
401653d080 unk96 f48 → f0
400dd2ba00 unk512 f80 → f0
40116fbc20 unk48 f16 → f0
401cab4620 unk224 f160 → f0
4012651470 unk48 f8 → f0
4014457c40 unk32 f24 → f0
40126513e0 unk48 f16 → f0
40159404b0 unk24 f0 → f0
401c4a1320 unk144 f40 → f0
4006296c60 unk96 f24 → f0
4015ba2360 unk48 f32 → f0
4003a6cc90 unk24 f0 → f0
4003bbf430 unk16 f8 → f0
40039e0b00 unk64 f40 → f0
40022cac60 unk352 f32 → f0
400184fa40 unk224 f8 → f0
40022db320 unk288 f40 → f0
400275f3f0 unk16 f8 → f0
40013b3c08 unk24 f16 → f0
40022d7080 unk64 f48 → f0
400067e400 unk128 f0 → f0
400043ee60 unk160 f96 → f0
4000b3cde0 unk48 f16 → f8
4013c06000 unk5376 f40 → f0
4015d91d70 unk24 f16 → f0
400ad3e380 unk128 f88 → f0
4010effb60 unk48 f16 → f8
4004407800 unk3072 f288 → f0
401cd26000 unk144 f16 → f0
401c8e9bc0 unk48 f8 → f0
4014451d70 unk48 f16 → f0
400e920000 unk288 f136 → f0
40001d6880 unk3456
#
white latch
late moat
#

2 things:

  1. yeah, to me the chain makes sense, but doesn't directly imply a solution. are you thinking we try to move the provider-onwards or resolver-onwards parts of the chain to the stack? or patch bk to be more aggressive about cleaning up dirty=false lru items?

  2. ok im lying a bit that the chain was entirely followable for me in code on the first go round, i actually didn't make the RefMetadata.descHandlers jump... lemme poke around with that grep bc if there's an avoidable assignment to descHandlers bc that could also be a fix

white latch
#

yeah, to me the chain makes sense, but doesn't directly imply a solution. are you thinking we try to move the provider-onwards or resolver-onwards parts of the chain to the stack? or patch bk to be more aggressive about cleaning up dirty=false lru items?
I'm going to start by rm'ing that global var in contenthash entirely. It's supposed to be a perf optimization but I question if it's actually making any difference. If that fixes it entirely, I'll call it good enough. If not we may need something else that just avoids these pointers existing.

ok im lying a bit that the chain was entirely followable for me in code on the first go round, i actually didn't make the RefMetadata.descHandlers jump... lemme poke around with that grep bc if there's an avoidable assignment to descHandlers bc that could also be a fix
descHandlers ending up there happens in a bunch of codepaths in our own code and also internal in buildkit

The thing is, the contenthash pkg only needs some fields in the ref, not the whole thing. So another possibility is to somehow give it a canned impl of the ImmutableRef interface that doesn't include all the crap that's leaking tons of memory

late moat
#

I'm going to start by rm'ing that global var in contenthash entirely
this sounds audacious from where i'm sitting, didn't even seem like an option to me XD --- but i like it, cut off the problem at the root

descHandlers ending up there happens in a bunch of codepaths in our own code and also internal in buildkit
lol yeah i wish i had read this before running around via grep, "a bunch" is an understatement, it's like central to how ref works and then interface Mountable is everywhere too.

white latch
late moat
#

the fX bits are gibberish lmao, that's my main complaint about the output, what is f0?

white latch
late moat
#

lmao classic

white latch
#

if I had to fathom a guess might be "f as in field", it could be that it's in a struct and telling us the byte offset in the struct? like f48 means "the 48th byte offset in the struct"?

#

f0 is the beginning of the struct, etc.

late moat
#

reachable: find path from root to an object says the code, but path like just the first one it can find? or path like we didn't pluralize "paths" correctly lmao

white latch
#

Oh yeah, that I also don't know about. I feel like it must just be one path because conceptually there must be multiple paths possible, but it seems to only ever show one

late moat
#

so it only finds the shortest path to root basically

#

i started reading the code with the intent to figure out wtf fX is, and you're definitely correct about the offsets btw

white latch
# late moat you could ostensibly change the done logic fairly easily to get all of paths to ...

Yeah I'm getting to the point of possibly patching viewcore to make this a little easier...

Interestingly though, after rm'ing that global from contenthash the leak still exists but I can't repro my previous technique; when I ask viewcore why the object is reachable it says it's not found.

I have a mildly educated guess that:

  1. Previously this worked because the root was a global var and thus easy for viewcore to find
  2. viewcore bugs out when trying to find things that still exist due to the root being in a stack of a running goroutine
    • it prints a bunch of errors at start about giving up on a bunch of stacks due to obscure errors after a few frames
  3. There's a bunch of lines in the relevant code for walking stacks that suggest it's hardcoded for amd64 (e.g. here and some other places)
  4. So, it's worth a try to replicate all this on an amd64 machine (mine is arm64) and see if I get better luck actually finding everything in the stacks. Spinning up an ec2 instance....
white latch
#

Okay seems like the magic formula for getting viewcore to have (seemingly) complete information is:

  1. run on amd64
  2. get the core dump via sending SIGABRT to the engine, gcore created ones will load w/ no error but most of the heap will be marked as unreachable and gc'able even though it's not (???)

With that you don't need any shenanigans like I was doing before, viewcore histogram and viewcore objects just say "here's all the addresses of all the objects of each type".

I found a couple of other sources of leaks, more buildkit related things that I'm patching in our fork. I can currently get the RSS down from 2+GB while tests are running to ~400MB after ctrl-C (whereas before it barely went down if at all). There does seem to be at least a little bit leaking left (e.g. 2MB of log ring vs. hundreds of MBs before), so I'll see if it's obvious. But think we should be able to get it for the release on wed

late moat
lone forum
# late moat What does rss mean in this context?

resident set size, it's the amount of space a process is currently occupying in memory (what's actually being used and has pages allocated in RAM, not including virtual memory that's not been allocated a page somewhere, or is in swap/etc)

white latch
#

Okay took a while to sort through all my various attempts and find what was strictly necessary to patch in buildkit, but PR to our buildkit fork here: https://github.com/dagger/buildkit/pull/4

Interestingly buildkit uses SetFinalizer in 3 places and all 3 seem to be broken and result in unfreeable memory, as in running TestModule for a couple of minutes accumulates 1+GB of unfreeable garbage in the heap.

#

There's only one change for dagger, which is to just rm the dagger client field from buildkitTraceProvider since it's not used and seemed to be making the leaks worse (what @late moat pointed out earlier). So may as well rm it for if/when buildkit leaks stuff again. Sending that PR out in a min based on that dagger/buildkit commit to double check tests

late moat
white latch
lone forum
#

pffff, i was looking at them right now lol

#

who knows lol

#

like this feels terrifying for some reason

white latch
lone forum
#

hm. how do you feel about the concept of pushing this out to v0.15.4.

lone forum
#

i guess. there's no harm in shipping

white latch
vocal scaffold
#

huh, it seems mildly spooky that buildkit was using SetFinalizer in the first place

vocal scaffold
#

i always assumed you needed a Rob Pike Certificate of Authority or something

white latch
lone forum
#

so for the resolver i think that's a little odd

#

those should be safe to remove

#

eventually the parent context should get cancelled, and we'll clean it up

#

assuming containerd does reasonable things there

#

the other one i have no idea, but also i've never seen that leak warning

#

though. not really sure how it gets cleaned up at all lol.

white latch
#

Yeah I wasn't able to figure out exactly how any of those things w/ Finalizers actually connected to the leaked stuff we see in pprof.

Unfortunately, viewcore doesn't seem to include anything in the heap that's unreferenced, even if it still exists due to being unfreeable via finalizers. It tells you the amount of garbage the exists (just based on arithmetic, looking at its code) but no details on it. So all I can really say is that when all 3 finalizers are rm'd RSS drops after clients disconnect and garbage isn't in the heap anymore.

It's also just really really difficult to think through manually because SetFinalizer is so tricky and the finalizers being used are on extremely convoluted structs and/or callbacks (which then pull in anything they pick up from the closure, which gets even more confusing)

lone forum
#

i guess the worst case is that we'd see long-running engines end up with refs that aren't ever free-d - but this seems like it was always a possibility anyways, if it was leaking anyways, and the engine died at the right point

#

(side note - i would love to have a way of tracking metrics like this over time, being able to track how things like disk usage / ref counts change after an update would be wonderful)

#

but yea, let's go for it

white latch
# lone forum i guess the worst case is that we'd see long-running engines end up with refs th...

Oh right, the other thing is we are for sure leaking refs. If you run TestModule for a bit, then ./hack/with-dev ./bin/dagger core engine local-cache prune you will still see "active" records left in ./hack/with-dev ./bin/dagger core engine local-cache entry-set entries

Last week I spent a day trying to track it down, has something to do with exec mounts, but reached the point where I was drawing diagrams for ref logic in the wrappers the solver pkg used and remembered it's more worth our time to move off of all this than understand it lol

late moat
#

me any time work makes me draw diagrams

lone forum
#

Approved both prs, but imo if we could hold off meeting the dagger/dagger one just until after the release tomorrow, that would give me a lot more confidence 🙏

#

Letting it soak for a bit would be pretty good, given the subtlety of some of those set finalizers