#Lockfile on egraph ๐Ÿงต

1 messages ยท Page 1 of 1 (latest)

tawny apex
#

FYI @earnest spruce @uneven plinth @unique robin @whole ibex

Starting this thread to track

#

@earnest spruce let me know if you need help today

#

Oh wait there was already another thread... facepalm

#

Sorry

earnest spruce
earnest spruce
#

I'm currently working on TestLockfile/TestLiveDiscoversQueryEntries in particular -- I think there's some sort of race where the client is shutting down before the lockfile gets written.

uneven plinth
#

@tawny apex were the lockfile tests passing originally before the rebase?

tawny apex
#

I would not bet on that

#

Very possible that it's a genuine bug

earnest spruce
#

hahah..... oh dear. yeah that's a good question to see if the failing ones were initially failing.

tawny apex
#

Or a dumb test, or both

earnest spruce
#

here's the current state after my merge:

--- FAIL: TestLockfile (0.00s)
    --- PASS: TestLockfile/TestWorkspaceUpdateRequiresLockfile (1.22s)
    --- PASS: TestLockfile/TestFromLockfilePinnedUsesPinEntry (1.93s)
    --- PASS: TestLockfile/TestFromLockfileFrozenUsesFloatEntry (1.95s)
    --- FAIL: TestLockfile/TestLiveDiscoversQueryEntries (3.07s)
    --- PASS: TestLockfile/TestFromLockfileDisabledIgnoresEntry (3.76s)
    --- FAIL: TestLockfile/TestFromLockfileLiveRefreshesEntry (4.10s)
    --- FAIL: TestLockfile/TestFromLockfilePinnedRefreshesFloatEntry (4.16s)
    --- PASS: TestLockfile/TestFromLockfileFrozenRequiresEntry (1.14s)
    --- FAIL: TestLockfile/TestLiveDiscoversGitEntries (10.30s)
    --- PASS: TestLockfile/TestLockUpdateRefreshesExistingEntry (7.30s)
    --- PASS: TestLockfile/TestGitBranchFrozenUsesFloatEntry (10.43s)
    --- PASS: TestLockfile/TestLockUpdateRefreshesExistingGitEntry (6.46s)
    --- FAIL: TestLockfile/TestLiveNestedQuery (6.53s)
    --- FAIL: TestLockfile/TestModuleSourceFrozenUsesLockedEntry (8.71s)
    --- PASS: TestLockfile/TestWorkspaceUpdate (8.73s)
    --- FAIL: TestLockfile/TestLockUpdateRefreshesExistingModuleResolveEntry (10.73s)
    --- FAIL: TestLockfile/TestWorkspaceUpdateNestedQuery (9.51s)
    --- FAIL: TestLockfile/TestLiveModuleCall (6.22s)
    --- FAIL: TestLockfile/TestLiveDiscoversModuleSourceEntries (11.15s)
    --- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (12.20s)
tawny apex
#

If it's only lockfile tests failing post rebase/merge, that's good news IMO ๐Ÿ™‚

uneven plinth
#

No there's still other test failures but there the known ones I have left on egraph too

earnest spruce
#

I just re-ran it, and see that some flipped from passing to failing, so defintely feels like a race

uneven plinth
#

@earnest spruce Test failures I have left right now are:

  • some non-deterministic output for TestTelemetry, squashed most of them now but the last non-determinism bits are tricky
  • a few TestLLMs
  • Legacy enum support and legacy service support... which is a real nightmare but tackling it now
  • Some unit test failures due to stale tests which would show up in test-base

So those ones will still be red for you. But if you see something outside that lemme know

earnest spruce
#

thanks for that heads up, yeah I noticed a TestContainer ...network.None label related test was failing and went and confirmed it was also failing under your branch so I simply ignored it for now.

uneven plinth
uneven plinth
#

@earnest spruce #1488639315255890012 message I forget what the state of the lockfile PR was at the time you took it over but I had a bunch of feedback on the approach when I had reviewed it last. There were a bunch of issues so it may be worth just addressing those, might fix any race conditions as a side effect

tawny apex
#

@uneven plinth quick summary:

  • You did your initial review. Had specific concerns
  • I incorporated your feedback, but did not fully test or deslopify the result
  • Asked Alex for help, he took over the branch (2nd priority after the Dockerfile work)
#

So, yes your first round of feedback was taken into account, but no I can't tell whether it was applied fully and to your satisfaction ๐Ÿ™‚

#

I think in spirit we agreed on the direction - 1) avoid round trip on every lockfile lookup; 2) don't force every lookup to go through a recursive dagql call; 3) treat the problem as a specialized sync problem ("locksync" rather than "filesync") because of multiple clients sharing a workspace & therefore its lockfile;

earnest spruce
tawny apex
#

Important context for that:

  • 1 workspace = 1 lockfile
  • 1 workspace may be shared by multiple clients (because modules inherit their caller's workspace, and its lockfile with it)
  • Concretely: all modules in my dependency chain will populate the same lockfile
  • Consequence 1: one lockfile shared by multiple clients (so don't couple the lockfile too tightly to a client)
  • Consequence 2: one session may involve multiple workspaces (so don't couple the lockfile too tightly to a session)
earnest spruce
#

so I think I have a better understand of the race, BuildkitSessionServer.Run() is called and will wait on a context before closing: https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L743-L760

and it seems like this context gets closed before the POST /shutdown request comes back under https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L1034-L1037

The issue is that the POST triggers the lockfile sync, which then performs a filesync which depends on the filesyncer (created under https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L542 ), which is an attachable on the buildkit session.

uneven plinth
tawny apex
uneven plinth
earnest spruce
#

are the protos still the original buildkit protos, just moved into engine/session/...?

uneven plinth
#

there's probably more trimming we could do of those since we only use a subset, but yeah saving that sort of superficial stuff for last now

earnest spruce
#

yeah, just like a bunch of bk variables could eventually be renamed, but best to keep it as-is for now. we don't want to have to change every single line in the repo for this PR

earnest spruce
#

I just changed ctx, cc, err := grpcClientConn(ctx, conn) to ctx, cc, err := grpcClientConn(context.TODO(), conn), and that solved the race, so it's defintely the ctx that gets passed into handleConn(....) that's triggering the premature shutdown

uneven plinth
# earnest spruce <@949034677610643507> any chance you have an idea where the attachables session ...

It can be multiple places technically, i.e. if the main client disconnects entirely the attachables will close.

But https://github.com/sipsma/dagger/blob/0e8d78c59290c34b08265270a2aa767525f13703/engine/server/session.go#L1132-L1132 is the area you'd hit typically during graceful shutdown.

That'll get canceled after beginClosing is called https://github.com/sipsma/dagger/blob/0e8d78c59290c34b08265270a2aa767525f13703/engine/server/session.go#L340-L340

#

so I'd make sure that you take care of anything that needs session attachables before you call beginClosing()

earnest spruce
#

ahh, that's super helpful, i'll see if I can move sess.beginClosing() further down.

uneven plinth
#

yeah sorry there's a lot of cancellation and shutdown paths that have accumulated over the years.

For theseus I also had to modify this whole area of the code significantly to help ensure that all clients are really for sure fully disconnected before trying to persist the cache, to ensure it is in a settled consistent state (aka I spent like 2 days debugging why the cache was occasionally off after reloads ๐Ÿ˜…)

#

It's probably worth a cleanup+consolidation pass sometime in the future

earnest spruce
#

woohoo, moving it further down solved the TestLockfile/TestLiveDiscoversQueryEntries failure I was hunting down. and in fact many other ones, now I'm down to only 4 failures:

--- FAIL: TestLockfile (0.00s)
    --- PASS: TestLockfile/TestWorkspaceUpdateRequiresLockfile (1.36s)
    --- PASS: TestLockfile/TestFromLockfilePinnedUsesPinEntry (2.75s)
    --- PASS: TestLockfile/TestLiveDiscoversQueryEntries (4.55s)
    --- PASS: TestLockfile/TestFromLockfilePinnedRefreshesFloatEntry (4.84s)
    --- PASS: TestLockfile/TestFromLockfileDisabledIgnoresEntry (5.39s)
    --- PASS: TestLockfile/TestFromLockfileLiveRefreshesEntry (5.80s)
    --- PASS: TestLockfile/TestLockUpdateRefreshesExistingEntry (3.30s)
    --- PASS: TestLockfile/TestWorkspaceUpdate (6.27s)
    --- PASS: TestLockfile/TestFromLockfileFrozenRequiresEntry (1.22s)
    --- PASS: TestLockfile/TestWorkspaceUpdateNestedQuery (7.43s)
    --- PASS: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (7.46s)
    --- PASS: TestLockfile/TestFromLockfileFrozenUsesFloatEntry (1.76s)
    --- FAIL: TestLockfile/TestLiveDiscoversGitEntries (7.79s)
    --- FAIL: TestLockfile/TestLiveNestedQuery (3.30s)
    --- PASS: TestLockfile/TestLockUpdateRefreshesExistingGitEntry (5.48s)
    --- FAIL: TestLockfile/TestLiveModuleCall (5.52s)
    --- FAIL: TestLockfile/TestGitBranchFrozenUsesFloatEntry (5.30s)
    --- PASS: TestLockfile/TestLockUpdateRefreshesExistingModuleResolveEntry (20.95s)
    --- PASS: TestLockfile/TestModuleSourceFrozenUsesLockedEntry (19.89s)
    --- PASS: TestLockfile/TestLiveDiscoversModuleSourceEntries (21.28s)
tawny apex
#

I sense a rebase in my near future ๐Ÿ™‚

earnest spruce
uneven plinth
#

speaking of which I do still need to update my branch after all the merges to main last 24h... lemme finish my current task and I'll get to that next. I'm currently working on a fix to TestLegacy/TestContainerAsService, which despite being a stupid test for obscure things (why I saved it for the end) did manage to reveal a flaw in the new lazy evaluation logic elmofire Half-way through fixing it though

#

I'll let you know once I've done a merge commit from main. Might be easier for me to do that and then you to merge that from my branch, so then you don't have to resolve new conflicts on your own outside of the code you were actually changing

whole ibex
#

my branch after all the merges to main last 24h...
sorry about that ๐Ÿ˜‡ hope its gonna be fine ๐Ÿคž

earnest spruce
#

plus you know your branch much more ๐Ÿ˜›

uneven plinth
uneven plinth
#

Merged main into my PR. Also got all of the tests that were failing passing locally! So provided I didn't break any other tests in the process, should have all green tests in CI hyperfastparrot

earnest spruce
#

failed to read file: failed to create diff copy client: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: only one connection allowed", weeeeee.

I wonder if the filesync client needs a lock?

uneven plinth
earnest spruce
#

weird thing is I only see a single call to the lockfile flush?

earnest spruce
uneven plinth
#

Make sure there's no clientIDs or other per-client state stored on cached objects which is then being used here.

Funnily, one large area of work with theseus has been finding places where we stored per-client state on objects, which just so happened to work because caching was broken enough that we never got cache hits and thus never hit bugs where we leaked client information across cache

#

But then when caching actually works, the bug gets revealed elmofire

earnest spruce
#

yeah that sounds very similar to what I'm seeing:

    if client.clientID == sess.mainClientCallerID {
        slog.Info("main client is shutting down")
        flushCtx := context.WithoutCancel(ctx)
        if err := srv.flushWorkspaceLocks(flushCtx, client); err != nil {

the main calling client is getting passed to flushWorkspaceLocks, but I suspect the flush needs to happen in the nested clients.

uneven plinth
#

If so this might actually be more like a design and/or test harness issue

earnest spruce
#

it pulls out the session, which then has a bunch of lockfiles which are tied to workspaces which then have a workspaceCtx which pulls out a connection from there, AFAIK

uneven plinth
#

I guess the most coherent thing would be to always export back to the client that loaded the lockfile in the first place

#

that might be a significant design departure though. I don't remember whether nested CLI clients call /shutdown or not

tawny apex
#

@uneven plinth @earnest spruce will catch up as soon as release emergency is over

earnest spruce
#

I'm not entirely sure if this is a nested CLI client issue, it's getting triggered by

            "container": {
                "from": {
                    "file": {
                        "contents": "3.23.3\n"
                    }
                }
            }
        }
#

where as it seems to work when it's simply a container.from (without any file)

uneven plinth
uneven plinth
#

Some of the other tests use hostDaggerExec, which would avoid this issue technically

#

but more in the sense of hiding from it rather than solving it

earnest spruce
tawny apex
#

Now catching up...

earnest spruce
#

I'm thinking it's that the flush is occurring under the main client shutdown; however at that point the nested clients will already have been closed.

#

maybe we need to scan through the workspaces and if any are associated with the client shuttingdown (either main, or nested), then we do a flush?

uneven plinth
#

The other option of course is to not care about nested clients since it's outside the realm of 99% of real world use cases, but we would rule out some legit uses like with the engine-dev playground container

earnest spruce
#

I have a printf that's being hit when if client.clientID == sess.mainClientCallerID { ... } else { printf("here") }, so I think it's being called by nested.

#

apologies for all the noisy printfs in this

#

the main changes are that I'm calling srv.flushWorkspaceLocks from all shutdown instances now, and that I also changed the flushing code to only flush matching clients via a if state.ws.ClientID == client.clientID {

uneven plinth
#

at least directionally

earnest spruce
#

looks like my change has now caused TestGitBranchPinnedRefreshesFloatEntry to fail though, it used to pass before this ๐Ÿ˜ž

tawny apex
#

ok now REALLY catching u

earnest spruce
#

oh it's failing with a

it could very well be that it used to be passing because the string used to contain "error: this is not possible", at least that's me being optimistic ๐Ÿ˜†

tawny apex
#

@earnest spruce is the architecture of lockfile itself an issue after all;?

earnest spruce
#

I'm avoiding refactoring anything right now just to see if I can get the tests passing first, then maybe see if we can optimize things later.

tawny apex
#

Can i help get to the answer?

earnest spruce
#

I only have two failings tests now:

    --- FAIL: TestLockfile/TestLiveDiscoversGitEntries (2.79s)
    --- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (1.84s)

if I can get these working I think we're in good shape.

tawny apex
#

thanks for carrying it

#

I'm very curious to see what happens when I rebase workspace to egraph (via your new lockfile)

earnest spruce
earnest spruce
earnest spruce
#

so this git test is a bit racy, when I run ./hack/build && ./hack/with-dev go test -count=1 -v -run TestLockfile/TestGitBranchPinnedRefreshesFloatEntry ./core/integration/ it passes, however when I change my test regex to TestLockfile/TestGit, it fails since it runs two tests:

--- FAIL: TestLockfile (0.00s)
    --- PASS: TestLockfile/TestGitBranchFrozenUsesFloatEntry (2.75s)
    --- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (3.96s)

i think the only real difference between these tests is that dagger is run with either "--lock=pinned" or "--lock=frozen". I wonder if these args are missing from the arg digest and are getting deduped?

uneven plinth
#

I was under the impression when I last reviewed that all the lockfile operations ran in operations that were per-client (which would avoid this), but I must have missed this case

#

Kind of gets at the broader problem that the approach of trying to do this stuff "inline" in operations is gonna be finnicky for these sorts of reasons

#

conceptually it would be better to just integrate it into the cache itself

#

the cache has some client/session awareness already to deal with stuff like secrets/sockets

earnest spruce
#

yeah it's a whole other dimension of things that can affect the execution, and changes the op's result ๐Ÿ˜

earnest spruce
tawny apex
#

Pre-theseus I wouldn't have either dared to explore that direction

uneven plinth
tawny apex
#

What's finnicky about the current design though?

#

as opposed to less elegant / fast / simple

uneven plinth
#

Actually, now that I say this out loud... That "worked" in the previous system because the cache just dissappeared and wasn't actually persistent. But now in theseus, cache is actually retained when it's referenced by something that is persistent (a.k.a. it actually makes sense now)

But that may actually not be what we desire anymore. We don't want to have persistent cache for the per-client Query.git op

#

(This has been the story of my life for the last 2 months, some part of our system relied on broken caching behavior and thus once caching actually works it needs updating)

tawny apex
#

After "the engine is too fast", introducing "the cache works too well"

#

reverse bottlenecks

uneven plinth
uneven plinth
uneven plinth
# earnest spruce why yes it does

I know I'm a broken record but it feels soooooooooooo good when it's actually possible to predict the behavior of caching, it's such a goddamn relief after years of nonsense

uneven plinth
# earnest spruce why yes it does

Ok so I think we probably do want to delete that code. But we need replacements elsewhere. Basically need to "shift" the places where we scope cache by content.

E.g. by deleting that chunk of code each GitRepository operation will be per-client (so run once per-client), but we definitely want to attach a content digest at the end of GitRepository.ref. That way we retain content caching but in a more fine grain manner

#

That should avoid the issue for the lockfile stuff, we can probably put off trying to do deep cache integration (even though that might be a good long term direction)

earnest spruce
tawny apex
#

@uneven plinth if we punt on deep cache integration, does the overall shape of the lockfile feature (CLI flags, "lock modes", location and format of the lockfile itself...) allow doing it later without breaking the contract?

uneven plinth
# tawny apex <@949034677610643507> if we punt on deep cache integration, does the overall sha...

Yeah for sure. It'd be an implementation detail only. Changing the implementing to have deep cache integration just helps avoid situations like what we currently hit. We were lucky in this case since we actually really wanted to delete that chunk of code anyways for basic correctness, and it also happened to unblock the lockfile problem.

But highly possible we might hit a situation in the future where we aren't so lucky. Deeper cache integration would make it all way easier

tawny apex
#

BTW one stopgap available to us if needed: we could reduce the scope of lockfile-enabled operations in this branch. The only one we really need for merging workspace, is module.resolve. I added git and container operations mostly to make sure it generalized as I hoped, and also because it's a cool feature to have with little marginal cost.

But, if removing container and git ops from lockfile helps get this branch merged, it doesn't affect workspace

earnest spruce
tawny apex
#

If you want to pull the ripcord and punt on git, it's an option

#

(we need module lookup though)

#

your call

earnest spruce
tawny apex
# earnest spruce do the modules use any of the dagger git API?

What I mean is: all the lockfile features are awesome, but only one is a blocker for merging the workspace branch: patching moduleSource() to use lockfile when looking up pins for module addresses. That's because the new workspace config file .dagger/config.toml doesn't contain pins; so we need a lockfile to persist module pins in .dagger/lockfile. The beauty of the lockfile feature is that it generalizes beyond just pinning dependencies for dagger modules installed in your workspace: really it can pin any engine lookup operation, at runtime. It just happens that one of those operations is looking up the pinned ref of a module

earnest spruce
tawny apex
#

I'm relatively confident that they are decoupled - yes there is a dependency between module ref lookup and git ref lookup. But from a lockfile perspective they are cleanly decoupled. (so a single module lookup on a remote git address, will produce 2 lockfile entries, one of each type)

tawny apex
earnest spruce
#

I can give it a try, if you're feeling a bit of pressure to get the workspace PR rebased. We can always work on the git changes in a seperate PR

tawny apex
#

It depends how you feel about the remaining git issues

earnest spruce
#

I would be tempted to Erik's idea of moving the digests elsewhere, and see if that doesn't break any of the existing regular (non-lockfile) git tests. But if those explode then maybe yes cut out the git pinning otherwise

tawny apex
#

@earnest spruce aloha ๐Ÿ™‚ How is this looking?

#

(Trying to make a plan for workspace for this week)

earnest spruce
#

I got nerdsniped on cgroups stuff ๐Ÿ˜

#

I wrote a TestGit/... test to verify caching is still working after I removed the content digest overrides, and it still is, so I'm wondering if we even need to add them back in.

#

so I think we would be good to try rebasing workspace on the lockfile-on-egraph branch we have.

tawny apex
#

ah! nice ๐Ÿ™‚

#

maybe we just overwrite lockfile with the egraph version? And keep the current version as backup somewhere

#

It's not like we have any path to shipping the non-lockfile version..

earnest spruce
tawny apex
#

Yes ๐Ÿ‘

#

(don't forget the backup though)

earnest spruce
#

well maybe not a good thing since I would be able to directly push to lockfile, but I can push it to my own fork and maybe you can do the push on my behalf when its ready?'

tawny apex
#

@earnest spruce can you try again? I upgraded from "write" to "maintain"

#

Weird that "write" doesn't let you push a branch

earnest spruce
#
$ git push dagger lockfile-before-egraph 
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
remote: error: GH013: Repository rule violations found for refs/heads/lockfile-before-egraph.
remote: Review all repository rules at https://github.com/dagger/dagger/rules?ref=refs%2Fheads%2Flockfile-before-egraph
remote: 
remote: - Cannot create ref due to creations being restricted.
remote: 
To github.com:dagger/dagger.git
 ! [remote rejected]     lockfile-before-egraph -> lockfile-before-egraph (push declined due to repository rule violations)
error: failed to push some refs to 'github.com:dagger/dagger.git'
earnest spruce
tawny apex
#

Ah I see

#

@earnest spruce so that means you should be able to do the push -f lockfile then

earnest spruce
tawny apex
#

If it's ready, then yeah - no point in waiting. Nobody else is using that branch

earnest spruce
#

looks like I need to do another rebase onto egraph first, plus fix a lint issue. I'll do that and if I get my tests passing I'll do the push.

earnest spruce
tawny apex
#

@earnest spruce try again? I upgraded you to maintain again

whole ibex
earnest spruce
#

I was sort of afraid to step on people's toes if I pushed it directly to the egraph branch

whole ibex
#

its just you / lockfile

earnest spruce
earnest spruce
tawny apex
#

@earnest spruce nice, so you were able to push lockfile?

earnest spruce
#

another rebase laughcry

#

then trying the --force push

earnest spruce
# tawny apex <@691818044368093268> try again? I upgraded you to `maintain` again

๐Ÿ’€ I have no idea why it's still not working:

remote: error: GH013: Repository rule violations found for refs/heads/lockfile.
remote: Review all repository rules at https://github.com/dagger/dagger/rules?ref=refs%2Fheads%2Flockfile
remote: 
remote: - Cannot force-push to this branch
remote: 
To github.com:dagger/dagger.git
 ! [remote rejected]     lockfile -> lockfile (push declined due to repository rule violations)
#

@tawny apex do you want to just try running git checkout lockfile && git reset --hard c9aeb2d70079ef27e96cc55a8d4cf199657289ef && git push --force-with-lease?

tawny apex
#

@earnest spruce try again?

earnest spruce
#

it worked!

earnest spruce
#

@tawny apex should I try out the workspace branch next, and see if I can get it rebased onto the new branch I just force pushed?

tawny apex
#

@earnest spruce sorry got pulled into something else..

tawny apex
tawny apex
#

@earnest spruce any luck? (ignore if you're off today, I don't know how you allocate your time throughout the week ๐Ÿ™ )

whole ibex
#

Happy to sync this week too if you wanna have a rubber duck, once release if out, will try to resolve some of the egraph conflicts, in the meantime

earnest spruce
tawny apex
#

nice!

#

@earnest spruce just to be clear, the tests were not passing before

earnest spruce
#

ah well that's good news ๐Ÿ˜…

tawny apex
#

So unfortunately it will be hard to identity if the rebase itself broke anything - moving target

#

BTW make sure to get latest workspace tip if you're going to mess with the tests, I'm actively refactoring them

earnest spruce
#

I think I introduced a race/cache related bug to TestDirectory/TestDiff, because it's intermittently failing, and is also doing that on the lockfile branch too.

tawny apex
#

@whole ibex is egraph itself very far behind main, with all the release fire-fighting?

earnest spruce
tawny apex
#

@earnest spruce since we don't have the objective truth of "tests are green again" to tell us if rebase is successful: we should define an arbitrary truth, let's call it "Alex gut check" ๐Ÿ™‚

If you think we're close to passing the Alex Gut Check on latest tip, I can freeze while you catch up to tip, then can do the switch?

#

I'm doing a pass on docs right now anyway

#

I think you only have very superficial changes to catch up to (like UI polish, changing the text of an error message, docs, and tests)

earnest spruce
#

gut check defintely has some food poisioning right now, since TestDirectory should be failing ๐Ÿ˜

tawny apex
#

New success metric: "tests are red again" ๐Ÿ˜›

earnest spruce
#

do we think we'll be able to merge egraph this or next week?

tawny apex
#

It's been non-stop firefighting on main... 0.20.4, .5, now struggling to get .6 out. So everyone has been distracted.

#

But we can't afford to wait too long on egraph, or it will get harder and harder

earnest spruce
#

yeah, i've been slightly afraid to ask.... seems like it's been a struggle

tawny apex
#

We're in the final stretch though

#

The root cause for most of the problems (although not all) has been our attempt to merge workspace-plumbing fast, to sneak in as much of the workspace branch as we could without breaking backwards compat. Turns out, that was harder than planned, we had gaps in backwards compatibility. And, the problem was compounded by us not catching it before merge; or even after merge... we only caught it after release. Which created more problems because Dagger Cloud could no longer upgrade to latest release, but customers did start to upgrade...anyway

#

So anyway:

  1. We are still committed to merging egraph asap
  2. I don't think anyone had time to actually work on that
  3. So in practice I don't know what we can realistically aim to merge it
  4. Are you interested in working on that? ๐Ÿ˜›
earnest spruce
#

I can help with another merge from main if needed

tawny apex
#

@earnest spruce I believe there's a specific failing test holding back merge

#

(plus right now, main is probably a few commits ahead too because of exceptions made to the freeze for the release)

earnest spruce
#

before our cloud/release issues I think it was only the python-sdk tests that were failing, but now it looks like there's a merge conflict plus many other tests that were failing due to release issues.

tawny apex
#

assuming we get the tests green again on main (should be imminent), then the remaining issue are 1) the original python-sdk test, 2) the merge conflicts - to be expected since the release fixes involved module loading, routing of context directory... but hopefully the conflicts are superficial

whole ibex
whole ibex
earnest spruce
#

I just tried a quick merge, and the one that's confusing me is:

func (mod *Module) GetContextSource() *ModuleSource {
    if !mod.ContextSource.Valid {
        return nil
    }
<<<<<<< HEAD
    return mod.ContextSource.Value.Self()
=======

    return hashutil.HashStrings(
        contentDigest,
        contentCacheScope,
        "asModule",
        mod.AsModuleVariantDigest,
    ).String()
>>>>>>> dagger/main
}
#

from core/module.go

#

@unique robin added AsModuleVariantDigest under b35e37b0e042e8ca148e8580b52bec65846576b6 however that was to func (mod *Module) ContentDigestCacheKey() string { ....}, but in the egraph source the function it's trying to merge into is a different (or maybe renamed?) function called func (mod *Module) GetContextSource() *ModuleSource {

whole ibex
#

I think its ok to remove that cache key change, and implement it in egraph style. But we need the test coverage for that (as a followup)

earnest spruce
unique robin
#

That cache commit is about the following case: Two modules using the SAME toolchain, but with differing customizations fields (example: ignore pattern for a directory arg)

#

One should not cache hit for the other

tawny apex
#

@earnest spruce @whole ibex can we clarify ownership of getting egraph merged? Can I / should I make you Alex the owner of that? Not to do all the work necessarily, just making sure someone knows what's going on and can authoritatively pull others to help

whole ibex
tawny apex
#

IMO it makes sense for Alex to own it, since 1) merging egraph, and 2) rebasing lockfile on egraph, and 3) rebasing workspace on lockfile, are tightly coupled and would require a lot of back-and-forth - better to have the same person own all 3 IMO

#

It boils down to: do you have the bandwidth this week @earnest spruce , since I know you're not full time

earnest spruce
#

I only have a few more hours left this week (I got some house stuff to tackle tomorrow and friday), I have a halfway merge here locally, just no test or anything for it.

#

I could see if my merge can pass a quick local test.

whole ibex
earnest spruce
#

I can have a quick try at manually testing it while you're wrapping up with the release

whole ibex
whole ibex
whole ibex
earnest spruce
#

I'm currently running the test via ./hack/with-dev go test -count=1 -v -run TestToolchain/TestToolchainCustomizationCacheKey ./core/integration/ and it seems to have stalled ๐Ÿ˜•

#

but this is under egraph

#

I'll try it via dagger call engine-dev test --run '^TestToolchain/TestToolchainCustomizationCacheKey$' --test-verbose in case it's different

#

yeah it's failing here too

=== CONT  TestToolchain/TestToolchainCustomizationCacheKey                                                                                                                                        
    testctx.go:174:                                                                                                                                                                               
                Error Trace:    /app/core/integration/toolchain_test.go:522                                                                                                                       
                                                        /go/pkg/mod/github.com/dagger/testctx/oteltest@v0.1.2/trace.go:133                                                                        
                                                        /go/pkg/mod/github.com/dagger/testctx@v0.1.2/middleware.go:25                                                                             
                                                        /go/pkg/mod/github.com/dagger/testctx@v0.1.2/testctx.go:150                                                                               
                Error:          Should not be: "1776293353521819777"                                                                                                                              
                Test:           TestToolchain/TestToolchainCustomizationCacheKey                                                                                                                  
                Messages:       different toolchain customizations must not cache-hit       
#

I'm ok removing the salting of the cache key
I can push my merge in that case, and see how it goes.

#

give me a quick sec to try a few unrelated tests first.

hidden current
earnest spruce
hidden current
#

are you ok in us merging the egraph branch if we get it green on CI?

earnest spruce
#

oh yes please! the only last test failure are the python-sdk errors (and some linting, which seems easy enough to clean up).

#

once egraph is in main, I can rebase the lockfile and workspace branches onto it -- no need to worry about those.

hidden current