#theseus: cause/effect failure
1 messages · Page 1 of 1 (latest)
I can look into this too since I have half of the context, just might need some pointers (will ask here)
Yeah it's gotten fuzzy over vacation time but IIRC I made this one in response to a TestPython failure where some codegen issue wasn't bubbled up correctly. I made it 50% in frustration/desperation and 50% in actual reasoning.
I also remember that I was trying to avoid the O(n^2) behavior of the previous implementation where you had to track/send the entire set of effect IDs for an individual operation rather than just send links
I do also recall just feeling like I didn't understand enough on the client-side to find the right solution, so would appreciate your eyes here but very happy to help navigate all the new parts of the code on the server-side
There's some background info on how laziness works in skills/cache-expert/references/lazy_evaluation.md, not all exactly relevant to this, but generally that's where all this cause/effect stuff comes into play in the new world
I also remember that I was trying to avoid the O(n^2) behavior of the previous implementation where you had to track/send the entire set of effect IDs for an individual operation rather than just send links
The telemetry mitigated that a bit - it would only send newly installed effect IDs relative to the previous call in the chain. That was a later improvement that made a big difference, but it was definitely O(n^2) at one point
https://github.com/dagger/dagger/blob/9da5b7dc29779e131af7f2777ff0e9fc6d9c1c3c/core/telemetry.go#L307-L326
Though looking at it again, we also sent all the parent effects asEffectsCompleted- I think that was a recent change, that part is definitely expensive but I think that was a stopgap until Theseus arrived
Yeah, but I think a further problem is that even in the situation of "only send newly installed effect IDs relative to the previous call", to be able to know what effect IDs are newly installed it still needed to do either O(n^2) work or use O(n^2) space. IIRC of course, going off memory
theseus: cause/effect failure
@signal hazel How's this look? https://github.com/dagger/dagger/pull/13073/changes/a842a85b9f141275e1accb58f90d713b261dcb70
Here's the pi + Codex session if you're curious: https://pi.dev/session/#cbac1cdfa5c9db9beb0032444451348f
Quick explanation: https://pi.dev/session/#cbac1cdfa5c9db9beb0032444451348f&leafId=ba100d7a&targetId=b3c5c7d3
I'm not 100% sure about having to extend GeneratedCode but I thought that might be excusable on account of it being a core object, and maybe Module objects automatically get that by walking fields or something, but that's a total guess
Some of the TestTelemetry diffs are ... interesting? Maybe just needs TUI adjustment? Looks like we're tracking too many relationships now, which feels like a preferable problem at least:
diff --git a/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail b/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
index 2dd6a9627..cf79f5a86 100644
--- a/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
+++ b/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
@@ -49,12 +49,20 @@ $ viztest: Viztest! X.Xs CACHED
├╴✔ withWorkdir / X.Xs
├╴✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
│ ┃ im failing
- │ ! exit code: 1
+ │ ┇ .dockerBuildFail › .dockerBuild ›
+ │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+ │ ┃ im failing
├╴✘ withEnvVariable PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin X.Xs ERROR
- │ ! exit code: 1
+ │ ┇ .dockerBuildFail › .dockerBuild ›
+ │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+ │ ┃ im failing
├╴✘ .withoutEntrypoint(keepDefaultArgs: true): Container! X.Xs ERROR
- │ ! exit code: 1
+ │ ┇ .dockerBuildFail › .dockerBuild ›
+ │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+ │ ┃ im failing
╰╴✘ .withDefaultArgs(args: ["sh"]): Container! X.Xs ERROR
- ! exit code: 1
+ ┇ .dockerBuildFail › .dockerBuild ›
+ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+ ┃ im failing
Setup tracing at https://dagger.cloud/traces/setup. To hide set DAGGER_NO_NAG=1
(this could indicate a O(n^2) situation - iterating...)
that's better! "pending" states are back on the menu
OK, so: after a bit of iteration, the results are great - all tests are passing, and the TestTelemetry golden file updates are solid, but I'm cautious about introducing complexity to Theseus right after it lands. Hopefully I'm not making a huge mess of things. Even added a new test showing that when a module returns a module object that embeds a newly constructed Container, and that Container fails, the API call returning the module object fails (similar to the PythonSdk.codegen case - wanted to make sure we weren't special-casing core types like GeneratedCode).
I had GPT 5.5 draft a design doc for how the effect tracking system works, and then had Opus 4.6 1M review the doc against the code to check for accuracy, and then clearly mark which parts are **NEW** for clarity:
Design doc: https://gist.github.com/vito/e9018ec9445a0b0831a34f64e28fa149
Diff: https://github.com/dagger/dagger/pull/13073/changes
Trying out a simpler version acquired by trying make Opus 4.7 jealous of its competitor's solution and find a better one
Individual commit (PR is polluted with a bunch of go.mod/go.sum bumps): https://github.com/dagger/dagger/pull/13073/changes/517207c165dc98d9dfbee33507c890659975e901
(Sorry it reads like word salad)
@signal hazel Do you recall what UIResumeOutputAttr and LogTargetCallDigest are about? doesn't seem to be wired up into anything in the UI, it's only ever assigned
I noticed service spans changed from attributing logs to .asService to having an exec ... span beneath asService that has the logs, wondering if that was intentional or something that just came along for the ride
(Context: while I'm in the area, I'm tidying up service related telemetry, related to the other PR for fail-fast services; the goal is that the asService span becomes the 'error origin' so you can easily jump to it from wherever the exit error bubbled up)
Yes, somewhat... The egraph branch on sipsma/dagger still has the unsquashed history (well, less squashed anyways), so I bet there's more definitive records in commit messages somewhere in there, which an agent would probably delight in finding.
For LogTargetCallDigest I remember that the problem was Service storing the otel span context on the object itself. That does not play nicely if a Service ends up in the DAG underneath something getting persisted in cache (either memory or disk). It's bad to load up a cached Service and then just use some old span context that has nothing to do with the current client. This was actually causing problems.
I recall that LogTargetCallDigest addressed that in some way, though I'd be lying if I said I immediately remembered exactly how/why. I think it was just a client-independent way of pointing to where logs for the service should end up?
Lemme know if that's useless, I can dig up more too
Oh wait, sorry I misread your message, are you saying they aren't even hooked up at all? If so, then what I was describing was true at some point, but may have shifted
Ah ok, sounds plausible and very closely related to the error attribution refactor above which sort of generalizes that pattern ("what span created this object")
Oh, maybe it is used - at least, LogTargetCallDigest seems to ultimately get used by dagui.DB to map back from a log record (which sends it as DagDigestAttr) back to the creating span