#buildkit/progrock/scheduling horrors
1 messages ยท Page 1 of 1 (latest)

here's a de-duped stack dump of when I've seen the nightmare deadlock from upgrading buildkit
i've fixed about 5 deadlocks in the last week, and found all sorts of bizarre inconsistencies in buildkit, most recently: https://github.com/moby/buildkit/pull/4587
and yet, it still occurs very very very easily
it's kind of funny that you've been debugging a deadlock during shutdown, while i've been debugging the lack of draining, suggesting we don't wait long enough synchronize on shutdown 
there's something in here that causes an issue: https://github.com/moby/buildkit/compare/c4f191410a411d7b7cb31426e2bd17fae305acf1...9d84cdcda004a471834309185730d6e7fc9e95c0
i'm not quite sure what it is, bisecting is kind of not a great option given how miserable it is to even repro consistently
Also, I took this process trace in the engine container when I noticed the stack trace - I suspect the /runtimes are modules that are connected over grpc to the progrock socket, which is why the server shutdown is an issue.
I dumped some of those, and they just seemed to be hanging on sending some data to the engine, so I suspect the deadlock isn't there, it's likely to actually be in the engine.
My current list of ideas:
- There's a subtle change in WithContext/TimeoutCause that for some reason means context.Canceled is not correctly propagated - but https://github.com/moby/buildkit/pull/4587 is the only case I found, and the issue still kind of occurs with this
- I wonder if somehow something subtle has changed with the runc handling - maybe a process should be being SIG(SOMETHING)ed, but for a reason this doesn't happen, and so it hangs around and locks everything up.
- Maybe there's a new race in buildkit somewhere - a subprocess somewhere has a race (I see this mostly in testimportant which has
-race), it crashes, and then some of our error handling means that we don't actually get a message for this anywhere. Seems unlikely, since data raced programs should exit with status code 66 by default. - Maybe this is OTEL again (somehow). OTEL was disabled for a super long time for us, and now the versioning means this kicks in again.
- There could still be something insanely wrong with the idea of progress writer chaining - but the scheduler does appear perfectly happy in all the stack traces I've managed to take.
I have a "sort of reproducer":
#!/bin/bash
set -x
for i in {0..10}; do
set +x
echo "--------------------------------------------------------------------------------"
echo "Doing $i..."
echo "--------------------------------------------------------------------------------"
echo
set -x
docker rm -f $(docker ps -aq)
docker volume rm $(docker volume ls -q)
./hack/make engine:testimportant
done
with the bad patches, the logs will just start hanging ๐ข
honestly, would not be surprised if this investigation ends up somewhere in the middle lol ๐
@amber dock can you confirm I'm thinking about this right:
- the progrock forwarder hosts the unix socket
- this gets exposed to nested clients, so we have a way of propagating that back
- the client disconnects, calls gracefulstop on the progrock socket
- but somehow one of the nested clients is still connected
how is this even possible? how are nested clients (through withexecs I'm assuming) still able to be running here? shouldn't their context have been cancelled, which should kill these execs off?
where do you see the gracefulstop on step 3? i only see a CloseAndRecv.
btw one thing I've noticed is that during engine/client.Client.Close the progrock RPC server actually gets interrupted and exits with context canceled, which I suspect is where we're losing updates
i'm gonna need to stare at this code for a while longer to really confirm, it's a bit of a rube-goldberg machine. ๐ตโ๐ซ
just about to type the same thing lol
the multiple defers are ... fun? also it's kinda unclear what contexts manage anything
we have a srv.Wait in an errgroup somewhere, but the srv.Close only appears after the errgroup definitely terminates? which i think makes sense, but is very confusing
session always unconditionally returns an error as well ๐ฑ https://github.com/dagger/dagger/pull/6499
lol, I've stared at that before and had a local fix but never pushed it, 
i need to head off for the day now, but i think i might just give up and attempt to bisect the buildkit thing tomorrow. it might be a multi-day job, but at this point i'm struggling to work out what the alternative possibly is
i've read through every commit in that upgrade in detail, and nothing stands out at all ๐ฆ
Yeah ๐ not that it helps now but we definitely need to abandon trying to re-use the existing buildkit controller API and just have our own thing there. And maybe abandon the whole grpc-in-grpc session stuff along the way if feasible
on the plus side, i thiiiiiink we can actually just delete all the progrock crap from the shim now
it seems like the big server-in-engine refactor and/or the ClientCallContext code path makes it redundant; i can completely comment it out, and everything is fine?!
which would be great, because that means my PR is completely unblocked
just double/triple checking now
Oh.. well yeah even better haha ๐
Wait so who's calling into the progrock socket at all then? If the shim doesn't
it should definitely be the shim, but it seems like 99% of the progress we're forwarding actually just come from the API server anyway, now i'm just seeing what the last 1% might be
Agree agree ๐ฏ
Is it just the "connecting to engine" logs? which don't matter for modules anyways
Mayyyybe removing it kills the deadlock as well if we're lucky riiight? Riiight?
seems like it, plus checkVersionCompatibility
yeah, seems like connect + checkVersionCompatibility, seems like a small price to pay for now
Ah yeah, also doesn't matter since modules are still going to be built using latest engine anyways, and all the version handling will be done separately
will do a full diff soon, right now i'm just eyeballing dagviz ๐
i'm a little surprised that checkVersionCompatibility goes away since that's just another API call
The go/ts codegen uses a recorder too IIRC, so the SDK modules would maybe be impacted also, but since modules are just execs anything they say could in theory just be to stdout and it should still show up
(going off memory, could be wrong)
relevant diff: (this is cat journal.log | jq -r '(.vertexes // [])[].name' | sort | uniq -c with all the noise removed)
11,12c11
< 2 check hut14q0g6ablk.ei5211547ktm4.dagger.local 1234/tcp
< 12 checkVersionCompatibility
---
> 2 check hut14q0g6ablk.b51rbqli15fg4.dagger.local 1234/tcp
14c13
< 40 connect
---
> 2 connect
22c21
< 16 defaultPlatform
---
> 2 defaultPlatform
so yeah, seems like it's all internal client plumbing that's not super interesting to show anyway
still not 100% sure why defaultPlatform and checkVersionCompatibility aren't showing up though
i've actually been wondering where that defaultPlatform call even comes from
yup, just ran into this. definitely seems simpler to turn this into plain old logging
teeny tiny DX hack: might be cool to exec /runtime as /runtime <function name> or something instead and let the arg be ignored, just so we can distinguish them in the process tree / raw buildkit vertex data
ok got it:
defaultPlatformis the engine client connect health-check above, so really not important to showcheckVersionCompatibilitycomes from the codegendagger.Connectcall, which relied on the progrock socket forwarding, and I just made that all go to stdout instead
if we're lucky. ๐ there's still a bit of progrock grpc forwarding going on, but this at least removes a super confusing part of the picture. we also don't call progrock.ServeRPC anymore, so WaitWriter is no more
@quick wolf https://github.com/dagger/dagger/pull/6456 is ready if you wanna take another look, everything is ๐ฅฌ
hm, now investigating a strange doubling-up of logs (notice the identity.NewID() I added is the exact same). only seems to happen for some vertices.
Seems like something broke on main since v0.9.7 (maybe the Buildkit bump? cc @frank quartz). First screenshot is main, second screenshot is v0.9.7.
My guess is that exec op output is being repeated for every subsequent cache hit within the same session. I wonder if this is related to Buildkit's behavior when multiple clients hit the same vertex, where it multiplexes output to both.
ah, that's https://github.com/moby/buildkit/pull/4347 which has @frank quartz 's fingerprints all over it already from investigating the deadlocks. wonder if the two are related?
this issue is already on main so I'll open an issue with this context and merge my PR (edit: opened https://github.com/dagger/dagger/issues/6501)
I've seen this type of flake a few times now: https://github.com/dagger/dagger/actions/runs/7662630188/job/20884170927?pr=6456#step:5:964 
failed to get state for index 0 on copy /bin/dagger-shim /usr/local/bin/dagger-shim
last time it was for dnsmasq or maybe dnsname
(just recording this so I can search for later, trying to keep track of flakes...)
ok this is a pain. what's happening is that:
- somehow the module generation is called from two different client-metadata contexts (not sure how/why, but this isn't really unexpected or an issue)
- we then pack this into the vertex metadata - this gives the vertexes different IDs, but the same cache key
- we do an edge merge - with https://github.com/moby/buildkit/commit/e1da8b7f82f09d3e622a98961f1483f613852390 in buildkit, the idea is to make sure that with an edge merge we keep progress. for example, suppose i have two parts of the graph that resolve two image tags to the same underlying image, then we can merge them, but keep progress for both, since they're different parts of the graph.
the core issue here is that from our perspective, the vertex isn't really different - it's the same execop, but because of our network hack, they have different vertexes.
Couple points/questions:
- ...are these actually kinda valid to merge at all? if i have an exec run within two different "networks" (server id/parent client id combination), are they actually safe to merge/combine/etc? We do this today, so it's "fine", but it does feel a liiiiitle strange.
- if the merging is definitely valid, then we should stop relying on edge-merge magic, and have the vertex ids actually be the same:
- this fixes the progress thing above
- this means we rely less on the edge merging paths in code, which are insanely annoying and tricky, so i'm happy
- genuinely just a nice cleanup
https://github.com/dagger/dagger/issues/6111 ๐ enjoy โค๏ธ
ok i've drafted something: https://github.com/dagger/dagger/pull/6505
gonna watch ci pass/fail, and then tidy it up some more
so this might have done the trick ๐ since merging, i don't see any instances of https://github.com/dagger/dagger/actions/workflows/engine-and-cli.yml timing out by hitting the hour mark ๐๐๐ maybe i am finally freed from my torment ๐
hm, i just saw another timeout on main: https://github.com/dagger/dagger/actions/runs/7697084682/job/20973328070
what's bizarre about this - looking at the stack dump we get, there's nothing in there that go thinks has been running for longer than a couple minutes (a few waiting for commands to exit for a long time - could be that the session is getting stuck, but nothing in the testing process seems to actually be depending on that??? which is also a weird thing.)
this seems to indicate to me that the tests are working - but that they're just now all of a sudden going a lot slower.
a couple things i noticed: from memory it seems like tests that do dagger call fail a bit more: not quite sure why that might be the case though.
panic: test timed out after 1h0m0s
running tests:
TestModuleConstructor (43m59s)
TestModuleConstructor/basic (30m39s)
TestModuleConstructor/basic/go (3m50s)
TestModuleConstructor/basic/python (3m51s)
TestModuleConstructor/basic/typescript (3m51s)
TestModuleDaggerCallArgTypes (38m12s)
TestModuleDaggerCallArgTypes/directory_arg_inputs (6m27s)
TestModuleDaggerCallArgTypes/directory_arg_inputs/git_dir (41s)
TestModuleDaggerCallArgTypes/directory_arg_inputs/local_dir (41s)
TestModuleDaggerCallArgTypes/secret_args (6m27s)
TestModuleDaggerCallArgTypes/secret_args/cmd (39s)
TestModuleDaggerCallArgTypes/secret_args/explicit_env (22s)
TestModuleDaggerCallArgTypes/secret_args/file (39s)
TestModuleDaggerCallArgTypes/secret_args/implicit_env (22s)
TestModuleDaggerCallArgTypes/service_args (6m27s)
TestModuleDaggerCallArgTypes/service_args/used_as_service_binding (22s)
TestModuleDaggerCallArgTypes/service_args/used_directly (18s)
TestModulePythonWithOtherModuleTypes (43m51s)
TestModulePythonWithOtherModuleTypes/arg_as_other_module_object (22m21s)
TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/direct (3m35s)
TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/list (3m35s)
TestModulePythonWithOtherModuleTypes/field_as_other_module_object (22m21s)
TestModulePythonWithOtherModuleTypes/field_as_other_module_object/direct (3m35s)
TestModulePythonWithOtherModuleTypes/field_as_other_module_object/list (3m34s)
TestModulePythonWithOtherModuleTypes/return_as_other_module_object (22m21s)
TestModulePythonWithOtherModuleTypes/return_as_other_module_object/direct (3m43s)
TestModulePythonWithOtherModuleTypes/return_as_other_module_object/list (3m35s)
TestModuleReservedWords (40m8s)
TestModuleReservedWords/id (10m55s)
TestModuleReservedWords/id/arg (2m15s)
TestModuleReservedWords/id/field (2m15s)
TestModuleReservedWords/id/field/go (13s)
TestModuleReservedWords/id/field/typescript (13s)
TestModuleReservedWords/id/fn (2m15s)
TestModuleReservedWords/id/fn/python (12s)
TestModuleReservedWords/id/fn/typescript (12s)
e.g. these look quite sus, it's always quite similar to these tests failing
one theory that sprung to mind: i wonder if maybe just producing more output (which the edge merging progress stuff would do) causes us to slow down a ton - we are producing like 50M+ output each run of testdev.
maybe now's the time to clean up our output a little bit? all of the data seems to be from the dagger engine logs in --debug mode ๐ข
in support of my slowness point: see a passed testdev run that took over an hour: https://github.com/dagger/dagger/actions/runs/7697146109/job/20973537705
looks like this just beat the timeout
one thought that's playing on my mind: our use of t.parallel does confuse things here. see https://github.com/dagger/dagger/blob/8483a5e7ace6174c60e37ba395ddd1ad9b849c1e/core/integration/module_test.go#L4852-L4865
because t.Parallel can block, it seems we can end up in scenarios where we can start a test, create a client, and then block while we wait for another test to finish somewhere - holding the client open longer than it's actually being used.
yeah ugh, i hate how t.Parallel works in that regard, also makes it really hard to read output when you're just trying to run one specific test, since technically they all start running and then decide not to (or something like that). makes me miss ginkgo
even if we could just move all the buildkit logs to a separate place to be read in a separate github step, that could be a huge help
where are we producing more output? that's going to the buildkit logs? i thought i would just go to parallel clients
the massive output that we get is almost entirely debug logs from buildkit
which tbf is somewhat useful upon occasion
ah gotcha
buildkit test don't have this issue - since we spin up a buildkit instance per test, so everything is nice and isolated
we could do something like that, but i do quite like that we really hammer the instance, the stress testing does bring me confidence ๐
maybe it's time to whip out the cpu profiler, could also be slowness somewhere in DagQL in fairness
i remember doing an empirical local benchmark and it was slightly slower, like within the ballpark enough for me to not worry about it and move on at the time, but maybe it's worse in CI
honestly, i'd also just love cpu+mem graphs during jobs as well โค๏ธ
do we get any of that from our test infra?
not as i'm aware today
(firing the @hardy rain bat signal)
perhaps we could lazy our connection? but there's still no guarantee that testA-subtest2 runs anywhere close to testA-subtest1 i don't think
looooove the "feature" that test order seems to be completely and utterly undefined in go
you might be able to set a seed (ginkgo lets you do that), but it all kinda goes out the window with parallelism anyway
or we could just create a new connection per subtest ๐ค
though that's suddenly a lot of sessions
actually, only 16 ๐
since parallel should restrict to that many
true true
loading context...
I think we're just wondering if there's a dashboard for CPU/memory stats for the infra that runs our tests
unless @frank quartz had a more nuanced request (per-test?!?!)
Only what the Engines are sending to Dagger Cloud.
We are working towards it with @barren fractal - including Parca.dev integration - but we are not there yet. Happy to discuss more on #infrastructure
@amber dock do services support stdout/stderr redirection? https://github.com/dagger/dagger/blob/main/internal/mage/engine.go#L425-L433
essentially, can i chuck the output of that somewhere else?
opened an issue for this https://github.com/dagger/dagger/issues/6553#issue-2112551755