buildkit/progrock/scheduling horrors | Dagger | Page 1

frank quartz Jan 25, 2024, 4:47 PM

#

👋

#

buildkit/progrock/scheduling horrors

amber dock Jan 25, 2024, 4:47 PM

#

delurk

frank quartz Jan 25, 2024, 4:48 PM

#

here's a de-duped stack dump of when I've seen the nightmare deadlock from upgrading buildkit

📎 out.txt

#

i've fixed about 5 deadlocks in the last week, and found all sorts of bizarre inconsistencies in buildkit, most recently: https://github.com/moby/buildkit/pull/4587
and yet, it still occurs very very very easily

amber dock Jan 25, 2024, 4:49 PM

#

it's kind of funny that you've been debugging a deadlock during shutdown, while i've been debugging the lack of draining, suggesting we don't ~~wait long enough~~ synchronize on shutdown thinkies

frank quartz Jan 25, 2024, 4:50 PM

#

there's something in here that causes an issue: https://github.com/moby/buildkit/compare/c4f191410a411d7b7cb31426e2bd17fae305acf1...9d84cdcda004a471834309185730d6e7fc9e95c0
i'm not quite sure what it is, bisecting is kind of not a great option given how miserable it is to even repro consistently

#

📎 message.txt

#

Also, I took this process trace in the engine container when I noticed the stack trace - I suspect the /runtimes are modules that are connected over grpc to the progrock socket, which is why the server shutdown is an issue.
I dumped some of those, and they just seemed to be hanging on sending some data to the engine, so I suspect the deadlock isn't there, it's likely to actually be in the engine.

#

My current list of ideas:

There's a subtle change in WithContext/TimeoutCause that for some reason means context.Canceled is not correctly propagated - but https://github.com/moby/buildkit/pull/4587 is the only case I found, and the issue still kind of occurs with this
I wonder if somehow something subtle has changed with the runc handling - maybe a process should be being SIG(SOMETHING)ed, but for a reason this doesn't happen, and so it hangs around and locks everything up.
Maybe there's a new race in buildkit somewhere - a subprocess somewhere has a race (I see this mostly in testimportant which has -race), it crashes, and then some of our error handling means that we don't actually get a message for this anywhere. Seems unlikely, since data raced programs should exit with status code 66 by default.
Maybe this is OTEL again (somehow). OTEL was disabled for a super long time for us, and now the versioning means this kicks in again.
There could still be something insanely wrong with the idea of progress writer chaining - but the scheduler does appear perfectly happy in all the stack traces I've managed to take.

#

I have a "sort of reproducer":

#!/bin/bash

set -x

for i in {0..10}; do
    set +x
    echo "--------------------------------------------------------------------------------"
    echo "Doing $i..."
    echo "--------------------------------------------------------------------------------"
    echo
    set -x

    docker rm -f $(docker ps -aq)
    docker volume rm $(docker volume ls -q)
    ./hack/make engine:testimportant
done

with the bad patches, the logs will just start hanging 😢

frank quartz Jan 25, 2024, 5:10 PM

#

amber dock it's kind of funny that you've been debugging a deadlock during shutdown, while ...

honestly, would not be surprised if this investigation ends up somewhere in the middle lol 😄

frank quartz Jan 25, 2024, 5:41 PM

#

@amber dock can you confirm I'm thinking about this right:

the progrock forwarder hosts the unix socket
this gets exposed to nested clients, so we have a way of propagating that back
the client disconnects, calls gracefulstop on the progrock socket
but somehow one of the nested clients is still connected
how is this even possible? how are nested clients (through withexecs I'm assuming) still able to be running here? shouldn't their context have been cancelled, which should kill these execs off?

amber dock Jan 25, 2024, 5:46 PM

#

frank quartz <@108011715077091328> can you confirm I'm thinking about this right: - the progr...

where do you see the gracefulstop on step 3? i only see a CloseAndRecv.

btw one thing I've noticed is that during engine/client.Client.Close the progrock RPC server actually gets interrupted and exits with context canceled, which I suspect is where we're losing updates

frank quartz Jan 25, 2024, 5:46 PM

#

📎 message.txt

#

^from the above stack trace

#

"13 minutes" yes seems reasonable

amber dock Jan 25, 2024, 5:58 PM

#

frank quartz <@108011715077091328> can you confirm I'm thinking about this right: - the progr...

i'm gonna need to stare at this code for a while longer to really confirm, it's a bit of a rube-goldberg machine. 😵‍💫

frank quartz Jan 25, 2024, 5:59 PM

#

just about to type the same thing lol

#

the multiple defers are ... fun? also it's kinda unclear what contexts manage anything

#

we have a srv.Wait in an errgroup somewhere, but the srv.Close only appears after the errgroup definitely terminates? which i think makes sense, but is very confusing

#

session always unconditionally returns an error as well 😱 https://github.com/dagger/dagger/pull/6499

amber dock Jan 25, 2024, 6:03 PM

#

frank quartz session always unconditionally returns an error as well 😱 https://github.com/da...

lol, I've stared at that before and had a local fix but never pushed it,

frank quartz Jan 25, 2024, 6:12 PM

#

i need to head off for the day now, but i think i might just give up and attempt to bisect the buildkit thing tomorrow. it might be a multi-day job, but at this point i'm struggling to work out what the alternative possibly is
i've read through every commit in that upgrade in detail, and nothing stands out at all 😦

quick wolf Jan 25, 2024, 7:03 PM

#

amber dock i'm gonna need to stare at this code for a while longer to really confirm, it's ...

Yeah 😔 not that it helps now but we definitely need to abandon trying to re-use the existing buildkit controller API and just have our own thing there. And maybe abandon the whole grpc-in-grpc session stuff along the way if feasible

amber dock Jan 25, 2024, 7:05 PM

#

on the plus side, i thiiiiiink we can actually just delete all the progrock crap from the shim now

#

it seems like the big server-in-engine refactor and/or the ClientCallContext code path makes it redundant; i can completely comment it out, and everything is fine?!

#

which would be great, because that means my PR is completely unblocked

#

just double/triple checking now

quick wolf Jan 25, 2024, 7:09 PM

#

amber dock it seems like the big server-in-engine refactor and/or the ClientCallContext cod...

Oh.. well yeah even better haha 😄

frank quartz Jan 25, 2024, 7:10 PM

#

amber dock on the plus side, i thiiiiiink we can actually just delete all the progrock crap...

Wait so who's calling into the progrock socket at all then? If the shim doesn't

amber dock Jan 25, 2024, 7:11 PM

#

it should definitely be the shim, but it seems like 99% of the progress we're forwarding actually just come from the API server anyway, now i'm just seeing what the last 1% might be

frank quartz Jan 25, 2024, 7:11 PM

#

quick wolf Yeah 😔 not that it helps now but we definitely need to abandon trying to re-use...

Agree agree 💯

quick wolf Jan 25, 2024, 7:11 PM

#

amber dock it should definitely be the shim, but it seems like 99% of the progress we're fo...

Is it just the "connecting to engine" logs? which don't matter for modules anyways

frank quartz Jan 25, 2024, 7:12 PM

#

amber dock it should definitely be the shim, but it seems like 99% of the progress we're fo...

Mayyyybe removing it kills the deadlock as well if we're lucky riiight? Riiight?

amber dock Jan 25, 2024, 7:12 PM

#

quick wolf Is it just the "connecting to engine" logs? which don't matter for modules anywa...

seems like it, plus checkVersionCompatibility

#

yeah, seems like connect + checkVersionCompatibility, seems like a small price to pay for now

quick wolf Jan 25, 2024, 7:12 PM

#

amber dock seems like it, plus `checkVersionCompatibility`

Ah yeah, also doesn't matter since modules are still going to be built using latest engine anyways, and all the version handling will be done separately

amber dock Jan 25, 2024, 7:13 PM

#

will do a full diff soon, right now i'm just eyeballing dagviz 😛

#

i'm a little surprised that checkVersionCompatibility goes away since that's just another API call

quick wolf Jan 25, 2024, 7:19 PM

#

The go/ts codegen uses a recorder too IIRC, so the SDK modules would maybe be impacted also, but since modules are just execs anything they say could in theory just be to stdout and it should still show up

#

(going off memory, could be wrong)

amber dock Jan 25, 2024, 7:23 PM

#

relevant diff: (this is cat journal.log | jq -r '(.vertexes // [])[].name' | sort | uniq -c with all the noise removed)

11,12c11
<       2 check hut14q0g6ablk.ei5211547ktm4.dagger.local 1234/tcp
<      12 checkVersionCompatibility
---
>       2 check hut14q0g6ablk.b51rbqli15fg4.dagger.local 1234/tcp
14c13
<      40 connect
---
>       2 connect
22c21
<      16 defaultPlatform
---
>       2 defaultPlatform

so yeah, seems like it's all internal client plumbing that's not super interesting to show anyway

#

still not 100% sure why defaultPlatform and checkVersionCompatibility aren't showing up though

#

i've actually been wondering where that defaultPlatform call even comes from

#

lol found it https://github.com/dagger/dagger/blob/228b48ab9d17624b8b85be2882417c5ea72f4f28/engine/client/client.go#L304

amber dock Jan 25, 2024, 7:43 PM

#

quick wolf The go/ts codegen uses a recorder too IIRC, so the SDK modules would maybe be im...

yup, just ran into this. definitely seems simpler to turn this into plain old logging

#

teeny tiny DX hack: might be cool to exec /runtime as /runtime <function name> or something instead and let the arg be ignored, just so we can distinguish them in the process tree / raw buildkit vertex data

amber dock Jan 25, 2024, 8:00 PM

#

amber dock still not 100% sure why `defaultPlatform` and `checkVersionCompatibility` aren't...

ok got it:

defaultPlatform is the engine client connect health-check above, so really not important to show
checkVersionCompatibility comes from the codegen dagger.Connect call, which relied on the progrock socket forwarding, and I just made that all go to stdout instead

amber dock Jan 25, 2024, 8:12 PM

#

frank quartz Mayyyybe removing it kills the deadlock as well if we're lucky riiight? Riiight?

if we're lucky. 😛 there's still a bit of progrock grpc forwarding going on, but this at least removes a super confusing part of the picture. we also don't call progrock.ServeRPC anymore, so WaitWriter is no more

amber dock Jan 25, 2024, 9:26 PM

#

@quick wolf https://github.com/dagger/dagger/pull/6456 is ready if you wanna take another look, everything is 🥬

amber dock Jan 25, 2024, 9:45 PM

#

hm, now investigating a strange doubling-up of logs (notice the identity.NewID() I added is the exact same). only seems to happen for some vertices.

amber dock Jan 25, 2024, 11:31 PM

#

Seems like something broke on main since v0.9.7 (maybe the Buildkit bump? cc @frank quartz). First screenshot is main, second screenshot is v0.9.7.

My guess is that exec op output is being repeated for every subsequent cache hit within the same session. I wonder if this is related to Buildkit's behavior when multiple clients hit the same vertex, where it multiplexes output to both.

amber dock Jan 26, 2024, 1:07 AM

#

Maybe it's this: https://github.com/moby/buildkit/commit/e1da8b7f82f09d3e622a98961f1483f613852390

GitHub

solver: fix printing progress messages after merged edges · moby/bu...

When different LLB vertexes (eg. parallel requests referencing
local sources from different sessions) generate same cache keys
during solve they are merged together into a single operation.

Curren...

#

ah, that's https://github.com/moby/buildkit/pull/4347 which has @frank quartz 's fingerprints all over it already from investigating the deadlocks. wonder if the two are related?

#

this issue is already on main so I'll open an issue with this context and merge my PR (edit: opened https://github.com/dagger/dagger/issues/6501)

#

I've seen this type of flake a few times now: https://github.com/dagger/dagger/actions/runs/7662630188/job/20884170927?pr=6456#step:5:964 thinkspin

failed to get state for index 0 on copy /bin/dagger-shim /usr/local/bin/dagger-shim

last time it was for dnsmasq or maybe dnsname

(just recording this so I can search for later, trying to keep track of flakes...)

frank quartz Jan 26, 2024, 11:32 AM

#

amber dock Seems like something broke on `main` since v0.9.7 (maybe the Buildkit bump? cc <...

ok this is a pain. what's happening is that:

somehow the module generation is called from two different client-metadata contexts (not sure how/why, but this isn't really unexpected or an issue)
we then pack this into the vertex metadata - this gives the vertexes different IDs, but the same cache key
we do an edge merge - with https://github.com/moby/buildkit/commit/e1da8b7f82f09d3e622a98961f1483f613852390 in buildkit, the idea is to make sure that with an edge merge we keep progress. for example, suppose i have two parts of the graph that resolve two image tags to the same underlying image, then we can merge them, but keep progress for both, since they're different parts of the graph.

the core issue here is that from our perspective, the vertex isn't really different - it's the same execop, but because of our network hack, they have different vertexes.

#

Couple points/questions:

...are these actually kinda valid to merge at all? if i have an exec run within two different "networks" (server id/parent client id combination), are they actually safe to merge/combine/etc? We do this today, so it's "fine", but it does feel a liiiiitle strange.
if the merging is definitely valid, then we should stop relying on edge-merge magic, and have the vertex ids actually be the same:
- this fixes the progress thing above
- this means we rely less on the edge merging paths in code, which are insanely annoying and tricky, so i'm happy
- genuinely just a nice cleanup

frank quartz Jan 26, 2024, 11:45 AM

#

amber dock I've seen this type of flake a few times now: <https://github.com/dagger/dagger/...

https://github.com/dagger/dagger/issues/6111 🎉 enjoy ❤️

frank quartz Jan 26, 2024, 1:00 PM

#

ok i've drafted something: https://github.com/dagger/dagger/pull/6505
gonna watch ci pass/fail, and then tidy it up some more

frank quartz Jan 26, 2024, 1:21 PM

#

amber dock if we're lucky. 😛 there's still a bit of progrock grpc forwarding going on, but...

so this might have done the trick 🎉 since merging, i don't see any instances of https://github.com/dagger/dagger/actions/workflows/engine-and-cli.yml timing out by hitting the hour mark 🎉🎉🎉 maybe i am finally freed from my torment 😄

frank quartz Jan 29, 2024, 3:44 PM

#

hm, i just saw another timeout on main: https://github.com/dagger/dagger/actions/runs/7697084682/job/20973328070
what's bizarre about this - looking at the stack dump we get, there's nothing in there that go thinks has been running for longer than a couple minutes (a few waiting for commands to exit for a long time - could be that the session is getting stuck, but nothing in the testing process seems to actually be depending on that??? which is also a weird thing.)

this seems to indicate to me that the tests are working - but that they're just now all of a sudden going a lot slower.

GitHub

docs: update zenith quickstart available functions (#6513) · dagger...

Application Delivery as Code that Runs Anywhere. Contribute to dagger/dagger development by creating an account on GitHub.

#

a couple things i noticed: from memory it seems like tests that do dagger call fail a bit more: not quite sure why that might be the case though.

#

panic: test timed out after 1h0m0s
running tests:
    TestModuleConstructor (43m59s)
    TestModuleConstructor/basic (30m39s)
    TestModuleConstructor/basic/go (3m50s)
    TestModuleConstructor/basic/python (3m51s)
    TestModuleConstructor/basic/typescript (3m51s)
    TestModuleDaggerCallArgTypes (38m12s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs (6m27s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs/git_dir (41s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs/local_dir (41s)
    TestModuleDaggerCallArgTypes/secret_args (6m27s)
    TestModuleDaggerCallArgTypes/secret_args/cmd (39s)
    TestModuleDaggerCallArgTypes/secret_args/explicit_env (22s)
    TestModuleDaggerCallArgTypes/secret_args/file (39s)
    TestModuleDaggerCallArgTypes/secret_args/implicit_env (22s)
    TestModuleDaggerCallArgTypes/service_args (6m27s)
    TestModuleDaggerCallArgTypes/service_args/used_as_service_binding (22s)
    TestModuleDaggerCallArgTypes/service_args/used_directly (18s)
    TestModulePythonWithOtherModuleTypes (43m51s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/direct (3m35s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/list (3m35s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object/direct (3m35s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object/list (3m34s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object/direct (3m43s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object/list (3m35s)
    TestModuleReservedWords (40m8s)
    TestModuleReservedWords/id (10m55s)
    TestModuleReservedWords/id/arg (2m15s)
    TestModuleReservedWords/id/field (2m15s)
    TestModuleReservedWords/id/field/go (13s)
    TestModuleReservedWords/id/field/typescript (13s)
    TestModuleReservedWords/id/fn (2m15s)
    TestModuleReservedWords/id/fn/python (12s)
    TestModuleReservedWords/id/fn/typescript (12s)

e.g. these look quite sus, it's always quite similar to these tests failing

#

one theory that sprung to mind: i wonder if maybe just producing more output (which the edge merging progress stuff would do) causes us to slow down a ton - we are producing like 50M+ output each run of testdev.

maybe now's the time to clean up our output a little bit? all of the data seems to be from the dagger engine logs in --debug mode 😢

#

in support of my slowness point: see a passed testdev run that took over an hour: https://github.com/dagger/dagger/actions/runs/7697146109/job/20973537705
looks like this just beat the timeout

#

one thought that's playing on my mind: our use of t.parallel does confuse things here. see https://github.com/dagger/dagger/blob/8483a5e7ace6174c60e37ba395ddd1ad9b849c1e/core/integration/module_test.go#L4852-L4865

because t.Parallel can block, it seems we can end up in scenarios where we can start a test, create a client, and then block while we wait for another test to finish somewhere - holding the client open longer than it's actually being used.

amber dock Jan 29, 2024, 4:40 PM

#

yeah ugh, i hate how t.Parallel works in that regard, also makes it really hard to read output when you're just trying to run one specific test, since technically they all start running and then decide not to (or something like that). makes me miss ginkgo

frank quartz Jan 29, 2024, 4:40 PM

#

even if we could just move all the buildkit logs to a separate place to be read in a separate github step, that could be a huge help

amber dock Jan 29, 2024, 4:42 PM

#

where are we producing more output? that's going to the buildkit logs? i thought i would just go to parallel clients

frank quartz Jan 29, 2024, 4:49 PM

#

the massive output that we get is almost entirely debug logs from buildkit

#

which tbf is somewhat useful upon occasion

amber dock Jan 29, 2024, 4:49 PM

#

ah gotcha

frank quartz Jan 29, 2024, 4:50 PM

#

buildkit test don't have this issue - since we spin up a buildkit instance per test, so everything is nice and isolated

#

we could do something like that, but i do quite like that we really hammer the instance, the stress testing does bring me confidence 😄

amber dock Jan 29, 2024, 4:53 PM

#

maybe it's time to whip out the cpu profiler, could also be slowness somewhere in DagQL in fairness

#

i remember doing an empirical local benchmark and it was slightly slower, like within the ballpark enough for me to not worry about it and move on at the time, but maybe it's worse in CI

frank quartz Jan 29, 2024, 4:54 PM

#

honestly, i'd also just love cpu+mem graphs during jobs as well ❤️

amber dock Jan 29, 2024, 4:55 PM

#

do we get any of that from our test infra?

frank quartz Jan 29, 2024, 4:55 PM

#

not as i'm aware today

amber dock Jan 29, 2024, 4:56 PM

#

(firing the @hardy rain bat signal)

frank quartz Jan 29, 2024, 4:56 PM

#

frank quartz one thought that's playing on my mind: our use of `t.parallel` does confuse thin...

perhaps we could lazy our connection? but there's still no guarantee that testA-subtest2 runs anywhere close to testA-subtest1 i don't think

#

looooove the "feature" that test order seems to be completely and utterly undefined in go

amber dock Jan 29, 2024, 4:58 PM

#

you might be able to set a seed (ginkgo lets you do that), but it all kinda goes out the window with parallelism anyway

frank quartz Jan 29, 2024, 4:59 PM

#

frank quartz perhaps we could lazy our connection? but there's still no guarantee that testA-...

or we could just create a new connection per subtest 🤔

#

though that's suddenly a lot of sessions

#

actually, only 16 😄

#

since parallel should restrict to that many

amber dock Jan 29, 2024, 5:01 PM

#

true true

frank quartz Jan 29, 2024, 5:01 PM

#

https://tenor.com/view/please-hold-nope-dont-like-gif-25735045

Tenor

#

one moment 😄 lemme throw that together

hardy rain Jan 29, 2024, 5:27 PM

#

amber dock (firing the <@796825768600141844> bat signal)

loading context...

amber dock Jan 29, 2024, 5:28 PM

#

I think we're just wondering if there's a dashboard for CPU/memory stats for the infra that runs our tests

#

unless @frank quartz had a more nuanced request (per-test?!?!)

hardy rain Jan 29, 2024, 5:28 PM

#

amber dock (firing the <@796825768600141844> bat signal)

Only what the Engines are sending to Dagger Cloud.

#

We are working towards it with @barren fractal - including Parca.dev integration - but we are not there yet. Happy to discuss more on #infrastructure

frank quartz Jan 29, 2024, 5:34 PM

#

frank quartz https://tenor.com/view/please-hold-nope-dont-like-gif-25735045

here: https://github.com/dagger/dagger/pull/6521

frank quartz Jan 29, 2024, 5:43 PM

#

amber dock where are we producing more output? that's going to the buildkit logs? i thought...

@amber dock do services support stdout/stderr redirection? https://github.com/dagger/dagger/blob/main/internal/mage/engine.go#L425-L433
essentially, can i chuck the output of that somewhere else?

frank quartz Feb 1, 2024, 1:49 PM

#

frank quartz <@108011715077091328> do services support stdout/stderr redirection? https://git...

opened an issue for this https://github.com/dagger/dagger/issues/6553#issue-2112551755

#buildkit/progrock/scheduling horrors