#theseus: cause/effect failure

1 messages · Page 1 of 1 (latest)

elder crystal
#

I can look into this too since I have half of the context, just might need some pointers (will ask here)

signal hazel
#

Yeah it's gotten fuzzy over vacation time but IIRC I made this one in response to a TestPython failure where some codegen issue wasn't bubbled up correctly. I made it 50% in frustration/desperation and 50% in actual reasoning.

I also remember that I was trying to avoid the O(n^2) behavior of the previous implementation where you had to track/send the entire set of effect IDs for an individual operation rather than just send links

#

I do also recall just feeling like I didn't understand enough on the client-side to find the right solution, so would appreciate your eyes here but very happy to help navigate all the new parts of the code on the server-side

#

There's some background info on how laziness works in skills/cache-expert/references/lazy_evaluation.md, not all exactly relevant to this, but generally that's where all this cause/effect stuff comes into play in the new world

elder crystal
# signal hazel Yeah it's gotten fuzzy over vacation time but IIRC I made this one in response t...

I also remember that I was trying to avoid the O(n^2) behavior of the previous implementation where you had to track/send the entire set of effect IDs for an individual operation rather than just send links
The telemetry mitigated that a bit - it would only send newly installed effect IDs relative to the previous call in the chain. That was a later improvement that made a big difference, but it was definitely O(n^2) at one point
https://github.com/dagger/dagger/blob/9da5b7dc29779e131af7f2777ff0e9fc6d9c1c3c/core/telemetry.go#L307-L326
Though looking at it again, we also sent all the parent effects as EffectsCompleted - I think that was a recent change, that part is definitely expensive but I think that was a stopgap until Theseus arrived

signal hazel
elder crystal
#

theseus: cause/effect failure

elder crystal
#

I'm not 100% sure about having to extend GeneratedCode but I thought that might be excusable on account of it being a core object, and maybe Module objects automatically get that by walking fields or something, but that's a total guess

#

Some of the TestTelemetry diffs are ... interesting? Maybe just needs TUI adjustment? Looks like we're tracking too many relationships now, which feels like a preferable problem at least:

diff --git a/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail b/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
index 2dd6a9627..cf79f5a86 100644
--- a/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
+++ b/dagql/idtui/testdata/TestTelemetry/TestGolden/docker-build-fail
@@ -49,12 +49,20 @@ $ viztest: Viztest! X.Xs CACHED
   ├╴✔ withWorkdir / X.Xs
   ├╴✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
   │ ┃ im failing
-  │ ! exit code: 1
+  │ ┇ .dockerBuildFail › .dockerBuild ›
+  │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+  │ ┃ im failing
   ├╴✘ withEnvVariable PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin X.Xs ERROR
-  │ ! exit code: 1
+  │ ┇ .dockerBuildFail › .dockerBuild ›
+  │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+  │ ┃ im failing
   ├╴✘ .withoutEntrypoint(keepDefaultArgs: true): Container! X.Xs ERROR
-  │ ! exit code: 1
+  │ ┇ .dockerBuildFail › .dockerBuild ›
+  │ ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+  │ ┃ im failing
   ╰╴✘ .withDefaultArgs(args: ["sh"]): Container! X.Xs ERROR
-    ! exit code: 1
+    ┇ .dockerBuildFail › .dockerBuild ›
+    ✘ withExec /bin/sh -c 'echo im failing && false' X.Xs ERROR
+    ┃ im failing
 
 Setup tracing at https://dagger.cloud/traces/setup. To hide set DAGGER_NO_NAG=1

(this could indicate a O(n^2) situation - iterating...)

elder crystal
#

that's better! "pending" states are back on the menu

elder crystal
#

OK, so: after a bit of iteration, the results are great - all tests are passing, and the TestTelemetry golden file updates are solid, but I'm cautious about introducing complexity to Theseus right after it lands. Hopefully I'm not making a huge mess of things. Even added a new test showing that when a module returns a module object that embeds a newly constructed Container, and that Container fails, the API call returning the module object fails (similar to the PythonSdk.codegen case - wanted to make sure we weren't special-casing core types like GeneratedCode).

I had GPT 5.5 draft a design doc for how the effect tracking system works, and then had Opus 4.6 1M review the doc against the code to check for accuracy, and then clearly mark which parts are **NEW** for clarity:

Design doc: https://gist.github.com/vito/e9018ec9445a0b0831a34f64e28fa149
Diff: https://github.com/dagger/dagger/pull/13073/changes

elder crystal
#

Trying out a simpler version acquired by trying make Opus 4.7 jealous of its competitor's solution and find a better one

elder crystal
elder crystal
elder crystal
#

@signal hazel Do you recall what UIResumeOutputAttr and LogTargetCallDigest are about? doesn't seem to be wired up into anything in the UI, it's only ever assigned

I noticed service spans changed from attributing logs to .asService to having an exec ... span beneath asService that has the logs, wondering if that was intentional or something that just came along for the ride

(Context: while I'm in the area, I'm tidying up service related telemetry, related to the other PR for fail-fast services; the goal is that the asService span becomes the 'error origin' so you can easily jump to it from wherever the exit error bubbled up)

signal hazel
# elder crystal <@949034677610643507> Do you recall what `UIResumeOutputAttr` and `LogTargetCall...

Yes, somewhat... The egraph branch on sipsma/dagger still has the unsquashed history (well, less squashed anyways), so I bet there's more definitive records in commit messages somewhere in there, which an agent would probably delight in finding.

For LogTargetCallDigest I remember that the problem was Service storing the otel span context on the object itself. That does not play nicely if a Service ends up in the DAG underneath something getting persisted in cache (either memory or disk). It's bad to load up a cached Service and then just use some old span context that has nothing to do with the current client. This was actually causing problems.

I recall that LogTargetCallDigest addressed that in some way, though I'd be lying if I said I immediately remembered exactly how/why. I think it was just a client-independent way of pointing to where logs for the service should end up?

#

Lemme know if that's useless, I can dig up more too

#

Oh wait, sorry I misread your message, are you saying they aren't even hooked up at all? If so, then what I was describing was true at some point, but may have shifted

elder crystal
#

Ah ok, sounds plausible and very closely related to the error attribution refactor above which sort of generalizes that pattern ("what span created this object")

elder crystal