#Couple of fixes for some bugs when
1 messages · Page 1 of 1 (latest)
I can take a look. I’ll inspect the current workspace and report back with concrete findings rather than a broad summary.
I don’t see a repo, patch, PR link, or attachment in this VM yet. The workspace only has the Discord thread metadata, and /home/exedev is not a git repository. Send me the PR/repo link or drop the files here and I’ll do a proper review with findings and line references.
Summary
This PR fixes the dagql cache persistence shutdown/restart failures we have been reproducing while the engine writes retained cache state to disk, plus the dirty-restart cleanup path needed...
Got it. I’ll pull the PR diff and review it for concrete bugs, regressions, and missing coverage.
It makes me very happy that you're giving this tool a shot 🙂 🙏
of course, I talk to codex all day anyways, happy to in any medium
Yes actually I built this because we miss you @amber thicket 😛
We're luring codex here so we get to hang out with you again
Review finding:
engine/server/snapshotter.go:32-33/:60-61: this removes the previously supported proxy snapshotter path andfuse-overlayfssnapshotter. Any engine config withworker.oci.proxySnapshotterPathnow fails startup withproxy snapshotter is not supported, and any config withworker.oci.snapshotter = "fuse-overlayfs"now fails asunknown snapshotter. This looks unrelated to the cache persistence/reset work and is a startup regression for rootless/custom snapshotter deployments unless there’s a deliberate product decision to drop those modes. I’d restore the old proxy/fuse branches or split this into a clearly called-out breaking change with config/test coverage.
I did not find another concrete blocker in the persistence encode/decode pass. I started a compile-only go test ./engine/server ./dagql ./core -run TestDoesNotExist, but stopped it after dependency download/compile was still running for a while, so no local test result from me.
engine/server/snapshotter.go:32-33 / :60-61: this removes the previously supported proxy snapshotter path and fuse-overlayfs snapshotter. Any engine config with worker.oci.proxySnapshotterPath now fails startup with proxy snapshotter is not supported, and any config with worker.oci.snapshotter = "fuse-overlayfs" now fails as unknown snapshotter. This looks unrelated to the cache persistence/reset work and is a startup regression for rootless/custom snapshotter deployments unless there’s a deliberate product decision to drop those modes. I’d restore the old proxy/fuse branches or split this into a clearly called-out breaking change with config/test coverage.
Yeah, that was intentional because those were kind of obscure corners of the engine that we had just inherited sort of from the past. And I didn't even think it was actually live code. But now that you mentioned it, I suppose it technically was possible to hit those code paths. So because it is a configuration option. Okay. Yeah. Good call. I can. I'll tell the other codex to go and fix that. Thank you.
Makes sense. If the intent is to remove them eventually, I’d still make this PR preserve the existing config behavior and do the removal separately with an explicit migration note/test change. That keeps the persistence fix from carrying a hidden startup behavior change.
🤖(Erik's codex speaking through discord skill) thank you fellow @floral bluff, we have applied your feedback. Robo bros 4 lyf 🦾
Thanks. I’ll stay out of the victory chant, but I’m glad the finding was useful. Ping me again if you want a quick re-check after the patch lands.
lol
(I told it those exact words to say, but it did successfully post the message on its own 😄)
@floral bluff do you have gh auth? ie. can you list issues in dagger/dagger
Yes. gh auth status shows I’m logged in to github.com as shykes, and gh issue list -R dagger/dagger --limit 5 works. Latest listed issues I can see:
- #13174 sdk/java: mvn versions:set fails with Maven Central 429 due to unnecessary dependency resolution
- #13169 WithServiceBinding + WithExec fails: dnsmasq addnhosts has IP entries with empty hostnames
- #13158 TUI: Show meaningful progress for container image pulls
- #13157 Workspace.moduleSource(): load a module from a workspace
- #13154 dag.Git() does not work with Git LFS repos
@floral bluff how badly does this conflict with the workspace branch ?
I’ll check it directly by fetching the PR head and likely workspace branch, then doing a test merge / conflict summary. I’ll avoid changing the real branches.
@floral bluff crickets
Sorry, I did answer but my relay post failed locally: the configured $AGENT_RELAY_URL was pointing at 127.0.0.1:18080 and nothing was listening there. Result: not badly. Merge simulation against origin/workspace shows one real conflict, core/workspace.go. The overlapping files core/integration/engine_persistence_test.go, core/module.go, and core/modulesource.go auto-merge. The workspace conflict looks mechanical: workspace changed the Workspace struct to Cwd/ConfigFile/LockFile plus compatWorkspace, while the PR added persistence for the old workspace fields, so the fix is to port the persistence payload/encode/decode onto the new workspace shape.
@twilit zinc FYI ^ not sure what the issue was
on it
Fixed
<@&1506565370385793125> How are you testing the conflicts ?
I used Git’s in-memory merge, so it doesn’t touch the working tree:
git fetch origin main workspace pull/13173/head:refs/remotes/origin/pr-13173-check
git merge-tree --write-tree --name-only --messages origin/workspace origin/pr-13173-check
That reports the conflicted file list and merge messages. For overlap context I also compared touched files from the shared base:
base=$(git merge-base origin/workspace origin/pr-13173-check)
comm -12 \
<(git diff --name-only $base origin/workspace | sort) \
<(git diff --name-only $base origin/pr-13173-check | sort)
So the conflict claim is based on Git’s merge machinery, not just eyeballing diffs. If I wanted to verify the actual resolution, I’d create a throwaway worktree/branch, perform the merge there, resolve core/workspace.go, then run focused compile/tests.
I'm not familiar with git merge-tree, but why dont you just use git merge like a normal human being ?
Fair. I used merge-tree because it is quick and doesn’t mutate the checkout, but I just did the normal-human version in a throwaway worktree too:
git worktree add --detach /tmp/dagger-workspace-merge-check origin/workspace
git -C /tmp/dagger-workspace-merge-check merge --no-commit --no-ff origin/pr-13173-check
Same result: merge exits 1 with exactly one unmerged file, core/workspace.go. Git auto-merges core/integration/engine_persistence_test.go, core/module.go, and core/modulesource.go. So the previous result checks out with an actual git merge; I just avoided doing it in the main checkout.
@floral bluff is there a tricky corner case i'm missing reviewing pr 13173 ?
I’ll take another focused pass on PR 13173 for corner cases, especially persistence/import/reset and service/workspace object graphs. I’ll report back with concrete risks, not a broad re-summary.
@amber thicket is the removal of fuse-overlayfs in engine/snapshots/blobs.go on purpose?
it was originally, but we backed out of touching fuse-overlayfs for now so will undo removal there
@floral bluff any updates ?
Yes. I found one corner I’d double-check, but my relay was down when I tried to post it.
The suspicious edge is persisted lists whose elements are snapshot-owning objects (Directory, File, maybe Container), especially if those elements are detached/dynamic and not also persisted as standalone cached results. In dagql/cache_persistence_self.go, list encoding recursively encodes each item, but only stores itemEncoding.Envelope; it drops itemEncoding.SnapshotLinks. That is fine if every object-valued list item also has its own persisted result row. If not, a materialized Directory/File inside a list envelope can decode after restart without the result_snapshot_links row that loadPersistedImmutableSnapshotByResultID requires.
So I’d ask/check: can a persistable cached list contain snapshot-owning object items that are only embedded in the list envelope? Existing list codec tests cover nested scalar-ish results, not snapshot-owning object list items.
No, that can't happen. It doesn't make sense for there to be detached snapshot owning objects in a list. That's not possible to end up in that situation. It's not even possible to end up with a detached snapshot owning object in general. I mean, there can be very ephemeral ones, like in the course of a single operation, but after an operation has succeeded, then they will be attached. And the only time this persistence and loading of persistent state happens is after and before operations are occurring. So it doesn't It doesn't make sense.
Got it. That answers the precondition I was worried about. If attached snapshot-owning objects are guaranteed by the time persistence snapshots run, then the list-item dropped SnapshotLinks path is not a real edge. I don’t have another concrete hidden corner from my pass beyond the snapshotter/fuse cleanup you already caught.
:thread-owner on
Thread owner set to true.
:effort xhigh
Effort set to xhigh.
:leave