#Lockfile on egraph ๐งต
1 messages ยท Page 1 of 1 (latest)
FYI @earnest spruce @uneven plinth @unique robin @whole ibex
Starting this thread to track
@earnest spruce let me know if you need help today
Oh wait there was already another thread... 
Sorry
I have https://github.com/dagger/dagger/pull/12922 in place, which was a result of first rebasing lockfile onto main, then merging that into the egraph branch. It's compiling and passing some tests there, but not all of them, which I'm trying to sort out today.
I'm currently working on TestLockfile/TestLiveDiscoversQueryEntries in particular -- I think there's some sort of race where the client is shutting down before the lockfile gets written.
@tawny apex were the lockfile tests passing originally before the rebase?
hahah..... oh dear. yeah that's a good question to see if the failing ones were initially failing.
Or a dumb test, or both
here's the current state after my merge:
--- FAIL: TestLockfile (0.00s)
--- PASS: TestLockfile/TestWorkspaceUpdateRequiresLockfile (1.22s)
--- PASS: TestLockfile/TestFromLockfilePinnedUsesPinEntry (1.93s)
--- PASS: TestLockfile/TestFromLockfileFrozenUsesFloatEntry (1.95s)
--- FAIL: TestLockfile/TestLiveDiscoversQueryEntries (3.07s)
--- PASS: TestLockfile/TestFromLockfileDisabledIgnoresEntry (3.76s)
--- FAIL: TestLockfile/TestFromLockfileLiveRefreshesEntry (4.10s)
--- FAIL: TestLockfile/TestFromLockfilePinnedRefreshesFloatEntry (4.16s)
--- PASS: TestLockfile/TestFromLockfileFrozenRequiresEntry (1.14s)
--- FAIL: TestLockfile/TestLiveDiscoversGitEntries (10.30s)
--- PASS: TestLockfile/TestLockUpdateRefreshesExistingEntry (7.30s)
--- PASS: TestLockfile/TestGitBranchFrozenUsesFloatEntry (10.43s)
--- PASS: TestLockfile/TestLockUpdateRefreshesExistingGitEntry (6.46s)
--- FAIL: TestLockfile/TestLiveNestedQuery (6.53s)
--- FAIL: TestLockfile/TestModuleSourceFrozenUsesLockedEntry (8.71s)
--- PASS: TestLockfile/TestWorkspaceUpdate (8.73s)
--- FAIL: TestLockfile/TestLockUpdateRefreshesExistingModuleResolveEntry (10.73s)
--- FAIL: TestLockfile/TestWorkspaceUpdateNestedQuery (9.51s)
--- FAIL: TestLockfile/TestLiveModuleCall (6.22s)
--- FAIL: TestLockfile/TestLiveDiscoversModuleSourceEntries (11.15s)
--- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (12.20s)
If it's only lockfile tests failing post rebase/merge, that's good news IMO ๐
No there's still other test failures but there the known ones I have left on egraph too
I just re-ran it, and see that some flipped from passing to failing, so defintely feels like a race
@earnest spruce Test failures I have left right now are:
- some non-deterministic output for TestTelemetry, squashed most of them now but the last non-determinism bits are tricky
- a few TestLLMs
- Legacy enum support and legacy service support... which is a real nightmare but tackling it now
- Some unit test failures due to stale tests which would show up in
test-base
So those ones will still be red for you. But if you see something outside that lemme know
thanks for that heads up, yeah I noticed a TestContainer ...network.None label related test was failing and went and confirmed it was also failing under your branch so I simply ignored it for now.
Goal is to finish squashing those today so egraph is green+clean ready to merge tomorrow or whenever the release is done by
Oh yeah that one I think I pushed already, was just a fat fingered codebase-wide rename
@earnest spruce #1488639315255890012 message I forget what the state of the lockfile PR was at the time you took it over but I had a bunch of feedback on the approach when I had reviewed it last. There were a bunch of issues so it may be worth just addressing those, might fix any race conditions as a side effect
@uneven plinth quick summary:
- You did your initial review. Had specific concerns
- I incorporated your feedback, but did not fully test or deslopify the result
- Asked Alex for help, he took over the branch (2nd priority after the Dockerfile work)
So, yes your first round of feedback was taken into account, but no I can't tell whether it was applied fully and to your satisfaction ๐
I think in spirit we agreed on the direction - 1) avoid round trip on every lockfile lookup; 2) don't force every lookup to go through a recursive dagql call; 3) treat the problem as a specialized sync problem ("locksync" rather than "filesync") because of multiple clients sharing a workspace & therefore its lockfile;
good point regarding the back and forth bits between the client and potential races there. I'll take a closer look and fix anything up I come across.
Important context for that:
- 1 workspace = 1 lockfile
- 1 workspace may be shared by multiple clients (because modules inherit their caller's workspace, and its lockfile with it)
- Concretely: all modules in my dependency chain will populate the same lockfile
- Consequence 1: one lockfile shared by multiple clients (so don't couple the lockfile too tightly to a client)
- Consequence 2: one session may involve multiple workspaces (so don't couple the lockfile too tightly to a session)
so I think I have a better understand of the race, BuildkitSessionServer.Run() is called and will wait on a context before closing: https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L743-L760
and it seems like this context gets closed before the POST /shutdown request comes back under https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L1034-L1037
The issue is that the POST triggers the lockfile sync, which then performs a filesync which depends on the filesyncer (created under https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/engine/client/client.go#L542 ), which is an attachable on the buildkit session.
(I still gotta rename BuildkitSessionServer, it's not buildkit anymore ๐)
That does sound plausible to me though. We should feel free to re-arrange the shutdown logic however we need.
I was just going to ask about that ๐
I'm saving the final renaming and trimming for last, mostly cause there's still bigger fish to fry left, but alos will be satisfying to finish on that note 
are the protos still the original buildkit protos, just moved into engine/session/...?
for the session attachables? yes
there's probably more trimming we could do of those since we only use a subset, but yeah saving that sort of superficial stuff for last now
yeah, just like a bunch of bk variables could eventually be renamed, but best to keep it as-is for now. we don't want to have to change every single line in the repo for this PR
@uneven plinth any chance you have an idea where the attachables session is getting closed? I've traced it to waiting on a c.ctx.Done() under https://github.com/alexcb/dagger/blob/407eebeeff14ee5bccdd13441dd61dbf6458b7b5/internal/buildkit/session/manager.go#L151
I think it's the engine that's initiating the shutdown, i.e. marking the context as done, or closing the attachbles connection rather than the client initiating the closing of the attachables connection.
I just changed ctx, cc, err := grpcClientConn(ctx, conn) to ctx, cc, err := grpcClientConn(context.TODO(), conn), and that solved the race, so it's defintely the ctx that gets passed into handleConn(....) that's triggering the premature shutdown
It can be multiple places technically, i.e. if the main client disconnects entirely the attachables will close.
But https://github.com/sipsma/dagger/blob/0e8d78c59290c34b08265270a2aa767525f13703/engine/server/session.go#L1132-L1132 is the area you'd hit typically during graceful shutdown.
That'll get canceled after beginClosing is called https://github.com/sipsma/dagger/blob/0e8d78c59290c34b08265270a2aa767525f13703/engine/server/session.go#L340-L340
Among other places that is called in the handler for /shutdown itself here: https://github.com/sipsma/dagger/blob/0e8d78c59290c34b08265270a2aa767525f13703/engine/server/session.go#L1277-L1277
so I'd make sure that you take care of anything that needs session attachables before you call beginClosing()
ahh, that's super helpful, i'll see if I can move sess.beginClosing() further down.
yeah sorry there's a lot of cancellation and shutdown paths that have accumulated over the years.
For theseus I also had to modify this whole area of the code significantly to help ensure that all clients are really for sure fully disconnected before trying to persist the cache, to ensure it is in a settled consistent state (aka I spent like 2 days debugging why the cache was occasionally off after reloads ๐ )
It's probably worth a cleanup+consolidation pass sometime in the future
woohoo, moving it further down solved the TestLockfile/TestLiveDiscoversQueryEntries failure I was hunting down. and in fact many other ones, now I'm down to only 4 failures:
--- FAIL: TestLockfile (0.00s)
--- PASS: TestLockfile/TestWorkspaceUpdateRequiresLockfile (1.36s)
--- PASS: TestLockfile/TestFromLockfilePinnedUsesPinEntry (2.75s)
--- PASS: TestLockfile/TestLiveDiscoversQueryEntries (4.55s)
--- PASS: TestLockfile/TestFromLockfilePinnedRefreshesFloatEntry (4.84s)
--- PASS: TestLockfile/TestFromLockfileDisabledIgnoresEntry (5.39s)
--- PASS: TestLockfile/TestFromLockfileLiveRefreshesEntry (5.80s)
--- PASS: TestLockfile/TestLockUpdateRefreshesExistingEntry (3.30s)
--- PASS: TestLockfile/TestWorkspaceUpdate (6.27s)
--- PASS: TestLockfile/TestFromLockfileFrozenRequiresEntry (1.22s)
--- PASS: TestLockfile/TestWorkspaceUpdateNestedQuery (7.43s)
--- PASS: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (7.46s)
--- PASS: TestLockfile/TestFromLockfileFrozenUsesFloatEntry (1.76s)
--- FAIL: TestLockfile/TestLiveDiscoversGitEntries (7.79s)
--- FAIL: TestLockfile/TestLiveNestedQuery (3.30s)
--- PASS: TestLockfile/TestLockUpdateRefreshesExistingGitEntry (5.48s)
--- FAIL: TestLockfile/TestLiveModuleCall (5.52s)
--- FAIL: TestLockfile/TestGitBranchFrozenUsesFloatEntry (5.30s)
--- PASS: TestLockfile/TestLockUpdateRefreshesExistingModuleResolveEntry (20.95s)
--- PASS: TestLockfile/TestModuleSourceFrozenUsesLockedEntry (19.89s)
--- PASS: TestLockfile/TestLiveDiscoversModuleSourceEntries (21.28s)
I sense a rebase in my near future ๐
yeah no doubt, it's pretty gnarly code with the various connection callbacks. The cache debugging issue sounds rather challenging.
speaking of which I do still need to update my branch after all the merges to main last 24h... lemme finish my current task and I'll get to that next. I'm currently working on a fix to TestLegacy/TestContainerAsService, which despite being a stupid test for obscure things (why I saved it for the end) did manage to reveal a flaw in the new lazy evaluation logic
Half-way through fixing it though
I'll let you know once I've done a merge commit from main. Might be easier for me to do that and then you to merge that from my branch, so then you don't have to resolve new conflicts on your own outside of the code you were actually changing
my branch after all the merges to main last 24h...
sorry about that ๐ hope its gonna be fine ๐ค
yeah best to have only one of us resolve conflicts, rather than two different ways to resolve them ๐
plus you know your branch much more ๐
No apology needed, I appreciate you checking before merge. Yeah it's not gonna be a bad one at all based on what I saw, just saving it for after I finish up my current changes
Merged main into my PR. Also got all of the tests that were failing passing locally! So provided I didn't break any other tests in the process, should have all green tests in CI 
failed to read file: failed to create diff copy client: rpc error: code = Unavailable desc = rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: only one connection allowed", weeeeee.
I wonder if the filesync client needs a lock?
You might be hitting the wrong client
weird thing is I only see a single call to the lockfile flush?
good idea, this seems to be occuring when there's nested clients.
Yeah I forget the exact context, but I hit a similar error once with secrets and it came down to me trying to get a secret from a nested client that had exited
Make sure there's no clientIDs or other per-client state stored on cached objects which is then being used here.
Funnily, one large area of work with theseus has been finding places where we stored per-client state on objects, which just so happened to work because caching was broken enough that we never got cache hits and thus never hit bugs where we leaked client information across cache
But then when caching actually works, the bug gets revealed 
yeah that sounds very similar to what I'm seeing:
if client.clientID == sess.mainClientCallerID {
slog.Info("main client is shutting down")
flushCtx := context.WithoutCancel(ctx)
if err := srv.flushWorkspaceLocks(flushCtx, client); err != nil {
the main calling client is getting passed to flushWorkspaceLocks, but I suspect the flush needs to happen in the nested clients.
Is flush the place where it's written back to the client's filesystem?
If so this might actually be more like a design and/or test harness issue
it pulls out the session, which then has a bunch of lockfiles which are tied to workspaces which then have a workspaceCtx which pulls out a connection from there, AFAIK
It's kind of ambiguous what the behavior should be here. In real-world use cases you do only want to export back to the main client. But our tests make heavy use of nested clients, and we in this test case essentially want the nested client to be more like a simulated main client
I guess the most coherent thing would be to always export back to the client that loaded the lockfile in the first place
that might be a significant design departure though. I don't remember whether nested CLI clients call /shutdown or not
@uneven plinth @earnest spruce will catch up as soon as release emergency is over
I'm not entirely sure if this is a nested CLI client issue, it's getting triggered by
"container": {
"from": {
"file": {
"contents": "3.23.3\n"
}
}
}
}
where as it seems to work when it's simply a container.from (without any file)
can you link to the code for the test case that fails with this error
ok yeah it is a nested client in that case https://github.com/alexcb/dagger/blob/33eb5f0a73d041d9daf54866732ebfe8580c2713/core/integration/lockfile_test.go#L246
Some of the other tests use hostDaggerExec, which would avoid this issue technically
but more in the sense of hiding from it rather than solving it
ahh I see now, it's a hostDaggerExec (which works), vs With(daggerExec(....)), which doesn't work
Now catching up...
I'm thinking it's that the flush is occurring under the main client shutdown; however at that point the nested clients will already have been closed.
maybe we need to scan through the workspaces and if any are associated with the client shuttingdown (either main, or nested), then we do a flush?
i would check if nested CLI clients even call /shutdown today. If they do, then you'd need to update the accounting code for all of this to track which client a workspace+lockfile were loaded from and then trigger the export back on their shutdown. If they don't call /shutdown, then we'd need a broader change
The other option of course is to not care about nested clients since it's outside the realm of 99% of real world use cases, but we would rule out some legit uses like with the engine-dev playground container
I have a printf that's being hit when if client.clientID == sess.mainClientCallerID { ... } else { printf("here") }, so I think it's being called by nested.
this is feeling pretty hacky, but the test is now passing with this change: https://github.com/alexcb/dagger/commit/e068b39b403bc81efbe52e838517d619a00acd0c
apologies for all the noisy printfs in this
the main changes are that I'm calling srv.flushWorkspaceLocks from all shutdown instances now, and that I also changed the flushing code to only flush matching clients via a if state.ws.ClientID == client.clientID {
that sounds reasonable tbh
at least directionally
looks like my change has now caused TestGitBranchPinnedRefreshesFloatEntry to fail though, it used to pass before this ๐
ok now REALLY catching u
oh it's failing with a
it could very well be that it used to be passing because the string used to contain "error: this is not possible", at least that's me being optimistic ๐
@earnest spruce is the architecture of lockfile itself an issue after all;?
I'm not sure to be honest, it seems like I'm making progress on adding some extra steps in how the flush is called. it might work.
I'm avoiding refactoring anything right now just to see if I can get the tests passing first, then maybe see if we can optimize things later.
Can i help get to the answer?
I only have two failings tests now:
--- FAIL: TestLockfile/TestLiveDiscoversGitEntries (2.79s)
--- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (1.84s)
if I can get these working I think we're in good shape.
thanks for carrying it
I'm very curious to see what happens when I rebase workspace to egraph (via your new lockfile)
yeah that'll be another big moment
so I checked against an earlier commit and it was legit passing earlier, so I gotta figure out what changed with this.
so this git test is a bit racy, when I run ./hack/build && ./hack/with-dev go test -count=1 -v -run TestLockfile/TestGitBranchPinnedRefreshesFloatEntry ./core/integration/ it passes, however when I change my test regex to TestLockfile/TestGit, it fails since it runs two tests:
--- FAIL: TestLockfile (0.00s)
--- PASS: TestLockfile/TestGitBranchFrozenUsesFloatEntry (2.75s)
--- FAIL: TestLockfile/TestGitBranchPinnedRefreshesFloatEntry (3.96s)
i think the only real difference between these tests is that dagger is run with either "--lock=pinned" or "--lock=frozen". I wonder if these args are missing from the arg digest and are getting deduped?
yeah looks like the lockfile operations run in GitRepository.ref, which at least for public repos will be cached correctly between clients. So the first client gets what it expects, second client probably hits cache GitRepository.ref and thus never even executes anything lockfile related
I was under the impression when I last reviewed that all the lockfile operations ran in operations that were per-client (which would avoid this), but I must have missed this case
Kind of gets at the broader problem that the approach of trying to do this stuff "inline" in operations is gonna be finnicky for these sorts of reasons
conceptually it would be better to just integrate it into the cache itself
the cache has some client/session awareness already to deal with stuff like secrets/sockets
yeah it's a whole other dimension of things that can affect the execution, and changes the op's result ๐
do you think that's a very big detour to do that?
yeah it would be, but not as bad as it used to be. it'd be like 1 day of designing w/ codex and implementing
What's finnicky about the current design though?
as opposed to less elegant / fast / simple
it actually is the same sort of problem that @whole ibex hit with git caching. Query.git itself is executed per-client but it returns an object cache-scoped to the repo (https://github.com/alexcb/dagger/blob/33eb5f0a73d041d9daf54866732ebfe8580c2713/core/schema/git.go#L730-L755)
So subsequent calls chained off of it can share cache
Actually, now that I say this out loud... That "worked" in the previous system because the cache just dissappeared and wasn't actually persistent. But now in theseus, cache is actually retained when it's referenced by something that is persistent (a.k.a. it actually makes sense now)
But that may actually not be what we desire anymore. We don't want to have persistent cache for the per-client Query.git op
(This has been the story of my life for the last 2 months, some part of our system relied on broken caching behavior and thus once caching actually works it needs updating)
After "the engine is too fast", introducing "the cache works too well"
reverse bottlenecks
@earnest spruce does the flake disappear if you delete this chunk of code? https://github.com/alexcb/dagger/blob/33eb5f0a73d041d9daf54866732ebfe8580c2713/core/schema/git.go#L730-L755
yes, good problems to have in the long term
why yes it does
I know I'm a broken record but it feels soooooooooooo good when it's actually possible to predict the behavior of caching, it's such a goddamn relief after years of nonsense
Ok so I think we probably do want to delete that code. But we need replacements elsewhere. Basically need to "shift" the places where we scope cache by content.
E.g. by deleting that chunk of code each GitRepository operation will be per-client (so run once per-client), but we definitely want to attach a content digest at the end of GitRepository.ref. That way we retain content caching but in a more fine grain manner
That should avoid the issue for the lockfile stuff, we can probably put off trying to do deep cache integration (even though that might be a good long term direction)
ah that seems reasonable, since we'll force the ops to execute which will let us check the lockfile, then once we know exactly what the ref will be, that will go into the content digest key.
@uneven plinth if we punt on deep cache integration, does the overall shape of the lockfile feature (CLI flags, "lock modes", location and format of the lockfile itself...) allow doing it later without breaking the contract?
Yeah for sure. It'd be an implementation detail only. Changing the implementing to have deep cache integration just helps avoid situations like what we currently hit. We were lucky in this case since we actually really wanted to delete that chunk of code anyways for basic correctness, and it also happened to unblock the lockfile problem.
But highly possible we might hit a situation in the future where we aren't so lucky. Deeper cache integration would make it all way easier
BTW one stopgap available to us if needed: we could reduce the scope of lockfile-enabled operations in this branch. The only one we really need for merging workspace, is module.resolve. I added git and container operations mostly to make sure it generalized as I hoped, and also because it's a cool feature to have with little marginal cost.
But, if removing container and git ops from lockfile helps get this branch merged, it doesn't affect workspace
it's only the git related tests that are failing, the container.From lockfile tests are working
If you want to pull the ripcord and punt on git, it's an option
(we need module lookup though)
your call
do the modules use any of the dagger git API?
What I mean is: all the lockfile features are awesome, but only one is a blocker for merging the workspace branch: patching moduleSource() to use lockfile when looking up pins for module addresses. That's because the new workspace config file .dagger/config.toml doesn't contain pins; so we need a lockfile to persist module pins in .dagger/lockfile. The beauty of the lockfile feature is that it generalizes beyond just pinning dependencies for dagger modules installed in your workspace: really it can pin any engine lookup operation, at runtime. It just happens that one of those operations is looking up the pinned ref of a module
I do see some references to dag.Select(ctx, gitRef, so maybe the pinning of modules depends on git? It's probably more robust to also get the git pinning working, but I really don't know if it's a blocker for module pinning or not.
I'm relatively confident that they are decoupled - yes there is a dependency between module ref lookup and git ref lookup. But from a lockfile perspective they are cleanly decoupled. (so a single module lookup on a remote git address, will produce 2 lockfile entries, one of each type)
100% confident that I designed it that way. Moderately confident that the agent didn't change the implementation behind my back ๐
I can give it a try, if you're feeling a bit of pressure to get the workspace PR rebased. We can always work on the git changes in a seperate PR
It depends how you feel about the remaining git issues
I would be tempted to Erik's idea of moving the digests elsewhere, and see if that doesn't break any of the existing regular (non-lockfile) git tests. But if those explode then maybe yes cut out the git pinning otherwise
@earnest spruce aloha ๐ How is this looking?
(Trying to make a plan for workspace for this week)
I got nerdsniped on cgroups stuff ๐
I wrote a TestGit/... test to verify caching is still working after I removed the content digest overrides, and it still is, so I'm wondering if we even need to add them back in.
here's the new, rather simple test: https://github.com/alexcb/dagger/blob/306a0964c709ad97ca08c77d4793e9e3771f1610/core/integration/git_test.go#L1834-L1851 but it's working
so I think we would be good to try rebasing workspace on the lockfile-on-egraph branch we have.
ah! nice ๐
maybe we just overwrite lockfile with the egraph version? And keep the current version as backup somewhere
It's not like we have any path to shipping the non-lockfile version..
that sounds good, can I just do a git push -f when the time comes?
I don't think I have direct dagger push access (which is a good thing), so I pushed a backup to https://github.com/alexcb/dagger/tree/lockfile-before-egraph
well maybe not a good thing since I would be able to directly push to lockfile, but I can push it to my own fork and maybe you can do the push on my behalf when its ready?'
Weird you have write access to the repo
@earnest spruce can you try again? I upgraded from "write" to "maintain"
Weird that "write" doesn't let you push a branch
$ git push dagger lockfile-before-egraph
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
remote: error: GH013: Repository rule violations found for refs/heads/lockfile-before-egraph.
remote: Review all repository rules at https://github.com/dagger/dagger/rules?ref=refs%2Fheads%2Flockfile-before-egraph
remote:
remote: - Cannot create ref due to creations being restricted.
remote:
To github.com:dagger/dagger.git
! [remote rejected] lockfile-before-egraph -> lockfile-before-egraph (push declined due to repository rule violations)
error: failed to push some refs to 'github.com:dagger/dagger.git'
looks like i'm running into https://github.com/dagger/dagger/rules/1146738?ref=refs%2Fheads%2Flockfile-before-egraph
looks like it's specific to creating a new branch
Ah I see
@earnest spruce so that means you should be able to do the push -f lockfile then
probably... I could try it right now if we really want.
If it's ready, then yeah - no point in waiting. Nobody else is using that branch
looks like I need to do another rebase onto egraph first, plus fix a lint issue. I'll do that and if I get my tests passing I'll do the push.
I just tried it now, and I can't force push either ๐
my latest is on https://github.com/alexcb/dagger/tree/lockfile-on-egraph under f3ca85ef0e67bacf031eb4b1b53dab1eed3262f3
@earnest spruce try again? I upgraded you to maintain again
did you do the merge commit for egraph ? Looking into it rn otherwise
I only ever resolved it on my branch, here's the patch I had to apply to resolve the conflict: https://gist.github.com/alexcb/d93a89c7122815e61f6cada1c57e732f
I was sort of afraid to step on people's toes if I pushed it directly to the egraph branch
if youre confident with it, push on it, i think
its just you / lockfile
will do! give me a sec
thanks! I'll try it once I get the above egraph conflict updated.
wow, and here I thought I was going to run into more permissions because the egraph branch is under Erik's fork, but apparently it worked: https://github.com/sipsma/dagger/commit/e482d5c5a1e021fd0b949ec4bfe3f3632a6ba78a
๐ I have no idea why it's still not working:
remote: error: GH013: Repository rule violations found for refs/heads/lockfile.
remote: Review all repository rules at https://github.com/dagger/dagger/rules?ref=refs%2Fheads%2Flockfile
remote:
remote: - Cannot force-push to this branch
remote:
To github.com:dagger/dagger.git
! [remote rejected] lockfile -> lockfile (push declined due to repository rule violations)
@tawny apex do you want to just try running git checkout lockfile && git reset --hard c9aeb2d70079ef27e96cc55a8d4cf199657289ef && git push --force-with-lease?
@earnest spruce try again?
@tawny apex should I try out the workspace branch next, and see if I can get it rebased onto the new branch I just force pushed?
@earnest spruce sorry got pulled into something else..
Yes please! That would be awesome
@earnest spruce any luck? (ignore if you're off today, I don't know how you allocate your time throughout the week ๐ )
Happy to sync this week too if you wanna have a rubber duck, once release if out, will try to resolve some of the egraph conflicts, in the meantime
just got back from a nice lunchtime walk, I've managed to get all the workspace commits cherry-picked, and pushed it to https://github.com/alexcb/dagger/tree/workspace-on-rebased-lockfile but so far it's only compiling, and now I'm sorting out some failing tests.
ah well that's good news ๐
So unfortunately it will be hard to identity if the rebase itself broke anything - moving target
BTW make sure to get latest workspace tip if you're going to mess with the tests, I'm actively refactoring them
I think I introduced a race/cache related bug to TestDirectory/TestDiff, because it's intermittently failing, and is also doing that on the lockfile branch too.
@whole ibex is egraph itself very far behind main, with all the release fire-fighting?
that's good to know, I've only picked up to 20399820f5b5b1c337bcc95cfaab6533f0fbfc9e which is from 2026-04-08
@earnest spruce since we don't have the objective truth of "tests are green again" to tell us if rebase is successful: we should define an arbitrary truth, let's call it "Alex gut check" ๐
If you think we're close to passing the Alex Gut Check on latest tip, I can freeze while you catch up to tip, then can do the switch?
I'm doing a pass on docs right now anyway
I think you only have very superficial changes to catch up to (like UI polish, changing the text of an error message, docs, and tests)
gut check defintely has some food poisioning right now, since TestDirectory should be failing ๐
New success metric: "tests are red again" ๐
do we think we'll be able to merge egraph this or next week?
cc @whole ibex @hidden current on that
It's been non-stop firefighting on main... 0.20.4, .5, now struggling to get .6 out. So everyone has been distracted.
But we can't afford to wait too long on egraph, or it will get harder and harder
yeah, i've been slightly afraid to ask.... seems like it's been a struggle
We're in the final stretch though
The root cause for most of the problems (although not all) has been our attempt to merge workspace-plumbing fast, to sneak in as much of the workspace branch as we could without breaking backwards compat. Turns out, that was harder than planned, we had gaps in backwards compatibility. And, the problem was compounded by us not catching it before merge; or even after merge... we only caught it after release. Which created more problems because Dagger Cloud could no longer upgrade to latest release, but customers did start to upgrade...anyway
So anyway:
- We are still committed to merging egraph asap
- I don't think anyone had time to actually work on that
- So in practice I don't know what we can realistically aim to merge it
- Are you interested in working on that? ๐
I can help with another merge from main if needed
@earnest spruce I believe there's a specific failing test holding back merge
(plus right now, main is probably a few commits ahead too because of exceptions made to the freeze for the release)
before our cloud/release issues I think it was only the python-sdk tests that were failing, but now it looks like there's a merge conflict plus many other tests that were failing due to release issues.
assuming we get the tests green again on main (should be imminent), then the remaining issue are 1) the original python-sdk test, 2) the merge conflicts - to be expected since the release fixes involved module loading, routing of context directory... but hopefully the conflicts are superficial
i think its ok, there might be a cache conflict with the PR we introduced from Tibor, but it should be well scoped.
Yes, we still wanna merge asap (I'll let Marcos and Tibor confirm), but it seems that the few conflicts are scoped. I wanna add 1 cache integration test I think, regarding
I just tried a quick merge, and the one that's confusing me is:
func (mod *Module) GetContextSource() *ModuleSource {
if !mod.ContextSource.Valid {
return nil
}
<<<<<<< HEAD
return mod.ContextSource.Value.Self()
=======
return hashutil.HashStrings(
contentDigest,
contentCacheScope,
"asModule",
mod.AsModuleVariantDigest,
).String()
>>>>>>> dagger/main
}
from core/module.go
@unique robin added AsModuleVariantDigest under b35e37b0e042e8ca148e8580b52bec65846576b6 however that was to func (mod *Module) ContentDigestCacheKey() string { ....}, but in the egraph source the function it's trying to merge into is a different (or maybe renamed?) function called func (mod *Module) GetContextSource() *ModuleSource {
yes that's the thing i was mentioning above
I think its ok to remove that cache key change, and implement it in egraph style. But we need the test coverage for that (as a followup)
yeah a test will help a lot ๐ฅบ
That cache commit is about the following case: Two modules using the SAME toolchain, but with differing customizations fields (example: ignore pattern for a directory arg)
One should not cache hit for the other
@earnest spruce @whole ibex can we clarify ownership of getting egraph merged? Can I / should I make you Alex the owner of that? Not to do all the work necessarily, just making sure someone knows what's going on and can authoritatively pull others to help
Alex is fine by me totally ๐
IMO it makes sense for Alex to own it, since 1) merging egraph, and 2) rebasing lockfile on egraph, and 3) rebasing workspace on lockfile, are tightly coupled and would require a lot of back-and-forth - better to have the same person own all 3 IMO
It boils down to: do you have the bandwidth this week @earnest spruce , since I know you're not full time
I only have a few more hours left this week (I got some house stuff to tackle tomorrow and friday), I have a halfway merge here locally, just no test or anything for it.
I could see if my merge can pass a quick local test.
Sorry if youre waiting on me, still on the release above, will switch soon, let me hack the test real fast
I just pushed my untested merge to https://github.com/alexcb/dagger/tree/egraph-merge , @whole ibex if you're able to come up with an integration test for it, that'll make me feel more comfortable.
I can have a quick try at manually testing it while you're wrapping up with the release
let me push
https://github.com/grouville/dagger/tree/add-caching-regression-test @unique robin the fail still seems to fail until I add the second commit, which actually salts the and relies on the internal bindings you added. I remember your PR actually fixed a cache issue, but this very test might still trigger another variation ?
To run it (pre-egraph): dagger call engine-dev test --run '^TestToolchain/TestToolchainCustomizationCacheKey$' --test-verbose
To me, I'm ok removing the salting of the cache key @earnest spruce because I think it's been failing for a while (never really worked)
I'm currently running the test via ./hack/with-dev go test -count=1 -v -run TestToolchain/TestToolchainCustomizationCacheKey ./core/integration/ and it seems to have stalled ๐
but this is under egraph
I'll try it via dagger call engine-dev test --run '^TestToolchain/TestToolchainCustomizationCacheKey$' --test-verbose in case it's different
yeah it's failing here too
=== CONT TestToolchain/TestToolchainCustomizationCacheKey
testctx.go:174:
Error Trace: /app/core/integration/toolchain_test.go:522
/go/pkg/mod/github.com/dagger/testctx/oteltest@v0.1.2/trace.go:133
/go/pkg/mod/github.com/dagger/testctx@v0.1.2/middleware.go:25
/go/pkg/mod/github.com/dagger/testctx@v0.1.2/testctx.go:150
Error: Should not be: "1776293353521819777"
Test: TestToolchain/TestToolchainCustomizationCacheKey
Messages: different toolchain customizations must not cache-hit
I'm ok removing the salting of the cache key
I can push my merge in that case, and see how it goes.
give me a quick sec to try a few unrelated tests first.
I just pushed e104ce1 onto https://github.com/dagger/dagger/pull/11856
LGTM ๐๐
@earnest spruce are you available tomorrow? Our plan is to rebase Erik's PR with main and merge it so we can start dogfooding egraph ASAP. My understanding is that you can keep rebasing your Lockfile + workspaces on top of that, correct?
no sorry, I have plans tomorrow.
perfect, yes it's what I understood
are you ok in us merging the egraph branch if we get it green on CI?
oh yes please! the only last test failure are the python-sdk errors (and some linting, which seems easy enough to clean up).
once egraph is in main, I can rebase the lockfile and workspace branches onto it -- no need to worry about those.
awesome, thx ๐ . Looking forward to hopefully merge that tomorrow
