#buildkit/progrock/scheduling horrors

1 messages ยท Page 1 of 1 (latest)

frank quartz
#

๐Ÿ‘‹

#

buildkit/progrock/scheduling horrors

amber dock
frank quartz
#

here's a de-duped stack dump of when I've seen the nightmare deadlock from upgrading buildkit

#

i've fixed about 5 deadlocks in the last week, and found all sorts of bizarre inconsistencies in buildkit, most recently: https://github.com/moby/buildkit/pull/4587
and yet, it still occurs very very very easily

amber dock
#

it's kind of funny that you've been debugging a deadlock during shutdown, while i've been debugging the lack of draining, suggesting we don't wait long enough synchronize on shutdown thinkies

frank quartz
#

Also, I took this process trace in the engine container when I noticed the stack trace - I suspect the /runtimes are modules that are connected over grpc to the progrock socket, which is why the server shutdown is an issue.
I dumped some of those, and they just seemed to be hanging on sending some data to the engine, so I suspect the deadlock isn't there, it's likely to actually be in the engine.

#

My current list of ideas:

  • There's a subtle change in WithContext/TimeoutCause that for some reason means context.Canceled is not correctly propagated - but https://github.com/moby/buildkit/pull/4587 is the only case I found, and the issue still kind of occurs with this
  • I wonder if somehow something subtle has changed with the runc handling - maybe a process should be being SIG(SOMETHING)ed, but for a reason this doesn't happen, and so it hangs around and locks everything up.
  • Maybe there's a new race in buildkit somewhere - a subprocess somewhere has a race (I see this mostly in testimportant which has -race), it crashes, and then some of our error handling means that we don't actually get a message for this anywhere. Seems unlikely, since data raced programs should exit with status code 66 by default.
  • Maybe this is OTEL again (somehow). OTEL was disabled for a super long time for us, and now the versioning means this kicks in again.
  • There could still be something insanely wrong with the idea of progress writer chaining - but the scheduler does appear perfectly happy in all the stack traces I've managed to take.
#

I have a "sort of reproducer":

#!/bin/bash

set -x

for i in {0..10}; do
    set +x
    echo "--------------------------------------------------------------------------------"
    echo "Doing $i..."
    echo "--------------------------------------------------------------------------------"
    echo
    set -x

    docker rm -f $(docker ps -aq)
    docker volume rm $(docker volume ls -q)
    ./hack/make engine:testimportant
done

with the bad patches, the logs will just start hanging ๐Ÿ˜ข

frank quartz
frank quartz
#

@amber dock can you confirm I'm thinking about this right:

  • the progrock forwarder hosts the unix socket
  • this gets exposed to nested clients, so we have a way of propagating that back
  • the client disconnects, calls gracefulstop on the progrock socket
  • but somehow one of the nested clients is still connected
    how is this even possible? how are nested clients (through withexecs I'm assuming) still able to be running here? shouldn't their context have been cancelled, which should kill these execs off?
amber dock
frank quartz
#

^from the above stack trace

#

"13 minutes" yes seems reasonable

amber dock
frank quartz
#

just about to type the same thing lol

#

the multiple defers are ... fun? also it's kinda unclear what contexts manage anything

#

we have a srv.Wait in an errgroup somewhere, but the srv.Close only appears after the errgroup definitely terminates? which i think makes sense, but is very confusing

amber dock
frank quartz
#

i need to head off for the day now, but i think i might just give up and attempt to bisect the buildkit thing tomorrow. it might be a multi-day job, but at this point i'm struggling to work out what the alternative possibly is
i've read through every commit in that upgrade in detail, and nothing stands out at all ๐Ÿ˜ฆ

quick wolf
amber dock
#

on the plus side, i thiiiiiink we can actually just delete all the progrock crap from the shim now

#

it seems like the big server-in-engine refactor and/or the ClientCallContext code path makes it redundant; i can completely comment it out, and everything is fine?!

#

which would be great, because that means my PR is completely unblocked

#

just double/triple checking now

quick wolf
frank quartz
amber dock
#

it should definitely be the shim, but it seems like 99% of the progress we're forwarding actually just come from the API server anyway, now i'm just seeing what the last 1% might be

quick wolf
frank quartz
amber dock
#

yeah, seems like connect + checkVersionCompatibility, seems like a small price to pay for now

quick wolf
amber dock
#

will do a full diff soon, right now i'm just eyeballing dagviz ๐Ÿ˜›

#

i'm a little surprised that checkVersionCompatibility goes away since that's just another API call

quick wolf
#

The go/ts codegen uses a recorder too IIRC, so the SDK modules would maybe be impacted also, but since modules are just execs anything they say could in theory just be to stdout and it should still show up

#

(going off memory, could be wrong)

amber dock
#

relevant diff: (this is cat journal.log | jq -r '(.vertexes // [])[].name' | sort | uniq -c with all the noise removed)

11,12c11
<       2 check hut14q0g6ablk.ei5211547ktm4.dagger.local 1234/tcp
<      12 checkVersionCompatibility
---
>       2 check hut14q0g6ablk.b51rbqli15fg4.dagger.local 1234/tcp
14c13
<      40 connect
---
>       2 connect
22c21
<      16 defaultPlatform
---
>       2 defaultPlatform

so yeah, seems like it's all internal client plumbing that's not super interesting to show anyway

#

still not 100% sure why defaultPlatform and checkVersionCompatibility aren't showing up though

#

i've actually been wondering where that defaultPlatform call even comes from

amber dock
#

teeny tiny DX hack: might be cool to exec /runtime as /runtime <function name> or something instead and let the arg be ignored, just so we can distinguish them in the process tree / raw buildkit vertex data

amber dock
amber dock
amber dock
amber dock
#

hm, now investigating a strange doubling-up of logs (notice the identity.NewID() I added is the exact same). only seems to happen for some vertices.

amber dock
#

Seems like something broke on main since v0.9.7 (maybe the Buildkit bump? cc @frank quartz). First screenshot is main, second screenshot is v0.9.7.

My guess is that exec op output is being repeated for every subsequent cache hit within the same session. I wonder if this is related to Buildkit's behavior when multiple clients hit the same vertex, where it multiplexes output to both.

amber dock
frank quartz
# amber dock Seems like something broke on `main` since v0.9.7 (maybe the Buildkit bump? cc <...

ok this is a pain. what's happening is that:

  1. somehow the module generation is called from two different client-metadata contexts (not sure how/why, but this isn't really unexpected or an issue)
  2. we then pack this into the vertex metadata - this gives the vertexes different IDs, but the same cache key
  3. we do an edge merge - with https://github.com/moby/buildkit/commit/e1da8b7f82f09d3e622a98961f1483f613852390 in buildkit, the idea is to make sure that with an edge merge we keep progress. for example, suppose i have two parts of the graph that resolve two image tags to the same underlying image, then we can merge them, but keep progress for both, since they're different parts of the graph.

the core issue here is that from our perspective, the vertex isn't really different - it's the same execop, but because of our network hack, they have different vertexes.

#

Couple points/questions:

  • ...are these actually kinda valid to merge at all? if i have an exec run within two different "networks" (server id/parent client id combination), are they actually safe to merge/combine/etc? We do this today, so it's "fine", but it does feel a liiiiitle strange.
  • if the merging is definitely valid, then we should stop relying on edge-merge magic, and have the vertex ids actually be the same:
    • this fixes the progress thing above
    • this means we rely less on the edge merging paths in code, which are insanely annoying and tricky, so i'm happy
    • genuinely just a nice cleanup
frank quartz
frank quartz
frank quartz
#

hm, i just saw another timeout on main: https://github.com/dagger/dagger/actions/runs/7697084682/job/20973328070
what's bizarre about this - looking at the stack dump we get, there's nothing in there that go thinks has been running for longer than a couple minutes (a few waiting for commands to exit for a long time - could be that the session is getting stuck, but nothing in the testing process seems to actually be depending on that??? which is also a weird thing.)

this seems to indicate to me that the tests are working - but that they're just now all of a sudden going a lot slower.

GitHub

Application Delivery as Code that Runs Anywhere. Contribute to dagger/dagger development by creating an account on GitHub.

#

a couple things i noticed: from memory it seems like tests that do dagger call fail a bit more: not quite sure why that might be the case though.

#
panic: test timed out after 1h0m0s
running tests:
    TestModuleConstructor (43m59s)
    TestModuleConstructor/basic (30m39s)
    TestModuleConstructor/basic/go (3m50s)
    TestModuleConstructor/basic/python (3m51s)
    TestModuleConstructor/basic/typescript (3m51s)
    TestModuleDaggerCallArgTypes (38m12s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs (6m27s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs/git_dir (41s)
    TestModuleDaggerCallArgTypes/directory_arg_inputs/local_dir (41s)
    TestModuleDaggerCallArgTypes/secret_args (6m27s)
    TestModuleDaggerCallArgTypes/secret_args/cmd (39s)
    TestModuleDaggerCallArgTypes/secret_args/explicit_env (22s)
    TestModuleDaggerCallArgTypes/secret_args/file (39s)
    TestModuleDaggerCallArgTypes/secret_args/implicit_env (22s)
    TestModuleDaggerCallArgTypes/service_args (6m27s)
    TestModuleDaggerCallArgTypes/service_args/used_as_service_binding (22s)
    TestModuleDaggerCallArgTypes/service_args/used_directly (18s)
    TestModulePythonWithOtherModuleTypes (43m51s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/direct (3m35s)
    TestModulePythonWithOtherModuleTypes/arg_as_other_module_object/list (3m35s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object/direct (3m35s)
    TestModulePythonWithOtherModuleTypes/field_as_other_module_object/list (3m34s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object (22m21s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object/direct (3m43s)
    TestModulePythonWithOtherModuleTypes/return_as_other_module_object/list (3m35s)
    TestModuleReservedWords (40m8s)
    TestModuleReservedWords/id (10m55s)
    TestModuleReservedWords/id/arg (2m15s)
    TestModuleReservedWords/id/field (2m15s)
    TestModuleReservedWords/id/field/go (13s)
    TestModuleReservedWords/id/field/typescript (13s)
    TestModuleReservedWords/id/fn (2m15s)
    TestModuleReservedWords/id/fn/python (12s)
    TestModuleReservedWords/id/fn/typescript (12s)

e.g. these look quite sus, it's always quite similar to these tests failing

#

one theory that sprung to mind: i wonder if maybe just producing more output (which the edge merging progress stuff would do) causes us to slow down a ton - we are producing like 50M+ output each run of testdev.

maybe now's the time to clean up our output a little bit? all of the data seems to be from the dagger engine logs in --debug mode ๐Ÿ˜ข

amber dock
#

yeah ugh, i hate how t.Parallel works in that regard, also makes it really hard to read output when you're just trying to run one specific test, since technically they all start running and then decide not to (or something like that). makes me miss ginkgo

frank quartz
#

even if we could just move all the buildkit logs to a separate place to be read in a separate github step, that could be a huge help

amber dock
#

where are we producing more output? that's going to the buildkit logs? i thought i would just go to parallel clients

frank quartz
#

the massive output that we get is almost entirely debug logs from buildkit

#

which tbf is somewhat useful upon occasion

amber dock
#

ah gotcha

frank quartz
#

buildkit test don't have this issue - since we spin up a buildkit instance per test, so everything is nice and isolated

#

we could do something like that, but i do quite like that we really hammer the instance, the stress testing does bring me confidence ๐Ÿ˜„

amber dock
#

maybe it's time to whip out the cpu profiler, could also be slowness somewhere in DagQL in fairness

#

i remember doing an empirical local benchmark and it was slightly slower, like within the ballpark enough for me to not worry about it and move on at the time, but maybe it's worse in CI

frank quartz
#

honestly, i'd also just love cpu+mem graphs during jobs as well โค๏ธ

amber dock
#

do we get any of that from our test infra?

frank quartz
#

not as i'm aware today

amber dock
#

(firing the @hardy rain bat signal)

frank quartz
#

looooove the "feature" that test order seems to be completely and utterly undefined in go

amber dock
#

you might be able to set a seed (ginkgo lets you do that), but it all kinda goes out the window with parallelism anyway

frank quartz
#

though that's suddenly a lot of sessions

#

actually, only 16 ๐Ÿ˜„

#

since parallel should restrict to that many

amber dock
#

true true

frank quartz
#

one moment ๐Ÿ˜„ lemme throw that together

hardy rain
amber dock
#

I think we're just wondering if there's a dashboard for CPU/memory stats for the infra that runs our tests

#

unless @frank quartz had a more nuanced request (per-test?!?!)

hardy rain
#

We are working towards it with @barren fractal - including Parca.dev integration - but we are not there yet. Happy to discuss more on #infrastructure

frank quartz