verbose tui trace confusion | Dagger | Page 1

snow lake Jan 28, 2025, 6:28 PM

#

@tardy ridge am i just confused here? i feel like i'm just confused, but regardless of this is expected or unexpected tui behavior i'd like to understand what's going on

tardy ridge Jan 28, 2025, 6:30 PM

#

do you have a cloud link? won't have metrics but some parts will be easier to grok than screenshots

snow lake Jan 28, 2025, 6:31 PM

#

lemme repro again

#

here's one of the "middle" spans that acts cached (0 duration) but doesn't show as cached https://v3.dagger.cloud/dagger/traces/732f12ddfc38696018dfba3a2237ab88?span=21a5a703dabb0981, and if you pop back up to https://v3.dagger.cloud/dagger/traces/732f12ddfc38696018dfba3a2237ab88?listen=37f20fae79f4f797&listen=817f266a5373f954&listen=235c8597dd57d773&listen=72d47fb511bf4a3e&listen=9b78cf33f126c038&listen=7123dbc8f65819d1&listen=2431fbd33fe04524&listen=5acaf01f4520f47d&listen=1cf2eb4d9f73dda1&listen=f8220805feca2e8f&listen=bcc75b6e58954620&listen=aab8e1a71b9e06aa you can see the context with the cache miss followed by the 0 duration call

Dagger Cloud

Browse and visualize Dagger traces.

Dagger Cloud

Browse and visualize Dagger traces.

#

(focusing on the corepack use call here, but the same thing happens in other contexts)

#

and seems kinda unpredictable... like it oscillates between showing CACHED and not, like these same calls start showing as CACHED under the generatedCode span and then go back to 0-duration under TypescriptSdk.moduleRuntime

Dagger Cloud

Browse and visualize Dagger traces.

Dagger Cloud

Browse and visualize Dagger traces.

#

(some part of me is hoping that i'm not just confused here and there's some un-instrumented timecost to the 0-duration spans, like maybe copies or something)

tardy ridge Jan 28, 2025, 6:55 PM

#

so, one thing that happens is that in a call like this, the File argument is passed by ID, and that ID gets loaded by just re-evaluating it and taking the resulting object - so that's why you see its calls repeated beneath the withFile call. those should be dagql-level in-memory cache hits

#

i think at one point I wanted to just not show spans at all for dagql cache hits, but that led to weirdness with parallelism (like call chains being split arbitrarily)

snow lake Jan 28, 2025, 7:08 PM

#

so 0-duration spans are re-eval-dagql-cache-hits, but CACHED spans are exclusively bk vertex cache hits?

i also found this comment which implies that we'd like the re-eval-dagql-cache-hits to read as CACHED but they don't for some reason? is the comment referring to these 0-duration cache hits or different ones?

GitHub

dagger/engine/buildkit/telemetry.go at 4da07e4712de142a5da7592a4f51...

An engine to run your pipelines in containers. Contribute to cwlbraa/dagger development by creating an account on GitHub.

GitHub

dagger/core/telemetry.go at 4da07e4712de142a5da7592a4f519195808dc76...

An engine to run your pipelines in containers. Contribute to cwlbraa/dagger development by creating an account on GitHub.

tardy ridge Jan 28, 2025, 7:13 PM

#

CACHED spans are exclusively bk vertex cache hits?
correct

#

but, now that i look at the code, it seems like we actually shouldn't see spans at all for dagql cache hits

#

at the very least, cached is always false for the callback

#

oh my local repo checkout is out of date

#

one sec, rebasing

#

rebased, and it seems to have the same intention (telemetry is only emitted if we're NOT a dagql cache hit) - so I'm not sure why those spans are showing up multiple times

#

unless for some reason it's tainted

#

hmm would be cool if the UI highlighted tainted calls

snow lake Jan 28, 2025, 8:07 PM

#

wouldn't there be logs and durations if they were tainted?

tardy ridge Jan 28, 2025, 8:13 PM

#

not necessarily, the API calls themselves don't really do anything that takes time and don't print logs on their own, that would only happen if the LLB ops themselves re-ran

snow lake Jan 28, 2025, 8:13 PM

#

ah, false assumption on my part that tainted implies re-run, then?

tardy ridge Jan 28, 2025, 8:13 PM

#

tainted just means it won't be cached in memory by dagql, so the resolver will always run

#

which would explain why we see the spans again

snow lake Jan 28, 2025, 8:14 PM

#

gotcha

tardy ridge Jan 28, 2025, 8:14 PM

#

but, i'm not sure if that's what's happening

snow lake Jan 28, 2025, 8:20 PM

#

thought experiment for myself, if the resolver wasn't re-running, what would we want the traces to look like? like in this screenshot, we corepack use for real right before yarn install, then save off the lockfile. the call right after does .withFile(lockfile) and continues onwards... if the resolver wasn't re-running up to corepack use, what container-producing span would that withFile parent to?

Screenshot_2025-01-28_at_12.17.57_PM.png

#

it kinda feels like the only way we have to visualize these dagql "continuances" is to either redraw the parent or do some weird nesting, although nesting feels like it'd read weird because it's not like the next-step-in-chain forked shit is down-stack from its parent

snow lake Jan 29, 2025, 11:56 PM

#

@dim panther kinda curious your thoughts on this as well, especially coming off of work moving more things to dagql cache... with the -vvv and the entailed exec metrics printing, it's kinda weird how it'll report the utilization multiple times (subsequent ones having 0-duration) ... im kinda inclined to look for a way to either:
a. mark these as CACHED even though they're not bk-cached but dagql cached
b. "hide" these calls by having their down-chain referencers point back at the previous call (no idea how to visualize this though... maybe in the webgui we use those little "chips" like we use for directory and file params, but for the method-root object, but in the tui it'd have to be by ID? so like Container(CACHED: $IDXYZ).withExec(...) instead of reprinting the whole chain of already-exec'd calls that led to that point? am i missing some detail of purity re-execs that makes this weird?)

tardy ridge Jan 30, 2025, 1:38 AM

#

i feel like i'm seeing more duplicate call spans lately notsureif - e.g. dagger -m github.com/vito/bass call integration --runtime Buildkit

tardy ridge Jan 30, 2025, 2:17 AM

#

note the repeated apko.wolfi calls

#

seems to happen repeatedly too, maybeinterestingly this one is only twice

tardy ridge Jan 30, 2025, 2:57 AM

#

here it is in Cloud: https://v3.dagger.cloud/dagger/traces/34c5f2524bfed4e711b7f45b36b41659?span=ca8432b938e90468

Dagger Cloud

Browse and visualize Dagger traces.

dim panther Jan 30, 2025, 7:16 AM

#

snow lake <@949034677610643507> kinda curious your thoughts on this as well, especially co...

Not immediately sure, feels like the dupes probably shouldn't be there in the first place, but not sure what's going wrong yet

dim panther Jan 30, 2025, 7:20 AM

#

tardy ridge i feel like i'm seeing more duplicate call spans lately <:notsureif:102505866723...

I get an error loading the module when running that (go: conflicting replacements for go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc), so can't reach the point where you were seeing dupes.
I was just gonna look at it w/ -vvvvvvvvvv so it showed the digests of the calls. I'm presuming that if we are seeing duplicates it could plausibly be that we're sending the same telemetry but w/ different call digests?

tardy ridge Jan 30, 2025, 2:29 PM

#

dim panther I get an error loading the module when running that (`go: conflicting replacemen...

hmm make sure you're on v0.15.3 (latest) - it should work, my bass stuff was affected by the OTel bump but should be good now

tardy ridge Jan 30, 2025, 6:25 PM

#

oh... module functions aren't pure/cached within a session anymore. but it's been this way for four months apparently - not sure why we'd only be seeing dupes now:
https://github.com/dagger/dagger/blame/df40da5a61c787ef321f503287dcde92222279d8/core/typedef.go#L81-L88
cc @ancient heath (this came up in your demo)

snow lake Jan 30, 2025, 6:32 PM

#

squint could that have perf implications?

#

4 months feels like it potentially lines up with some CI slowdown

dim panther Jan 30, 2025, 9:18 PM

#

tardy ridge hmm make sure you're on v0.15.3 (latest) - it should work, my bass stuff was aff...

Ah thanks, I forgot to update the CLI. It looks like the call digests of duped spans are the same? I am not familiar with the tui frontend code enough to know what that would imply, there must be something different in the telemetry metadata that causes the tui to think there's two spans running when only one is, but if it's not the call digest then what would it be? Span id?

tardy ridge Jan 30, 2025, 9:18 PM

#

yeah, span ID; I think that'll show up on max verbosity

#

considering those functions are marked impure now, it makes sense to me that you would see two spans if there are two things trying to evaluate it, since they won't be deduped by the caching logic

dim panther Jan 30, 2025, 9:20 PM

#

tardy ridge considering those functions are marked impure now, it makes sense to me that you...

Ah okay, and I'm guessing in your module code you actually do end up calling that twice in parallel or something

tardy ridge Jan 30, 2025, 9:21 PM

#

not really hyperthinkspin i actually don't know why it shows up so many times (4x by the looks)

#

my guess is they're actually inside multiple IDs being passed in which end up being evaluated in parallel somehow, maybe server side

#

oh, wait - does client-side resolve objects to IDs in parallel? maybe that's it

#

i kind of remember that somewhere in the go sdk

dim panther Jan 30, 2025, 10:02 PM

#

Okay seems like I got a quick fix working locally. Can get rid of the dupes w/ the bass example while also still passing the Secret integ tests that cover the original reason that impurity was added 4 months ago.

I just replaced the Impure: true with a custom cache key func using the stuff in that PR I merged the other day. End result is function calls are cached in dagql per-client, which is granular enough to allow the secret corner cases to still be covered, while not as granular as Impure:true such that they appear multiple times.

Will send out a PR in a min

tardy ridge Jan 30, 2025, 10:25 PM

#

sweet!

dim panther Jan 30, 2025, 10:44 PM

#

https://github.com/dagger/dagger/pull/9483, some subtlety to this of course so hopefully CI catches anything I didn't think of. Also curious to see if there's any discernable performance impact amongst the noise since I agree @snow lake this probably is a performance improvement (just not sure if it's enough of a bottleneck to be noticeable)

snow lake Jan 30, 2025, 11:32 PM

#

snow lake 4 months feels like it potentially lines up with some CI slowdown

was hoping to drop a graph of this but alas https://ui.honeycomb.io/dagger/environments/experimental/board/tXGtVxx7uVP/Is-Dagger-Dagger-production-ready?endTime=0&startTime=-9304219 ... am i correct in guessing we only have 60 day honeycomb retention @silver kraken ? or was this data just added to hc on 12/1?

#verbose tui trace confusion