#verbose tui trace confusion
1 messages · Page 1 of 1 (latest)
@tardy ridge am i just confused here? i feel like i'm just confused, but regardless of this is expected or unexpected tui behavior i'd like to understand what's going on
do you have a cloud link? won't have metrics but some parts will be easier to grok than screenshots
lemme repro again
here's one of the "middle" spans that acts cached (0 duration) but doesn't show as cached https://v3.dagger.cloud/dagger/traces/732f12ddfc38696018dfba3a2237ab88?span=21a5a703dabb0981, and if you pop back up to https://v3.dagger.cloud/dagger/traces/732f12ddfc38696018dfba3a2237ab88?listen=37f20fae79f4f797&listen=817f266a5373f954&listen=235c8597dd57d773&listen=72d47fb511bf4a3e&listen=9b78cf33f126c038&listen=7123dbc8f65819d1&listen=2431fbd33fe04524&listen=5acaf01f4520f47d&listen=1cf2eb4d9f73dda1&listen=f8220805feca2e8f&listen=bcc75b6e58954620&listen=aab8e1a71b9e06aa you can see the context with the cache miss followed by the 0 duration call
(focusing on the corepack use call here, but the same thing happens in other contexts)
and seems kinda unpredictable... like it oscillates between showing CACHED and not, like these same calls start showing as CACHED under the generatedCode span and then go back to 0-duration under TypescriptSdk.moduleRuntime
(some part of me is hoping that i'm not just confused here and there's some un-instrumented timecost to the 0-duration spans, like maybe copies or something)
so, one thing that happens is that in a call like this, the File argument is passed by ID, and that ID gets loaded by just re-evaluating it and taking the resulting object - so that's why you see its calls repeated beneath the withFile call. those should be dagql-level in-memory cache hits
i think at one point I wanted to just not show spans at all for dagql cache hits, but that led to weirdness with parallelism (like call chains being split arbitrarily)
so 0-duration spans are re-eval-dagql-cache-hits, but CACHED spans are exclusively bk vertex cache hits?
i also found this comment which implies that we'd like the re-eval-dagql-cache-hits to read as CACHED but they don't for some reason? is the comment referring to these 0-duration cache hits or different ones?
An engine to run your pipelines in containers. Contribute to cwlbraa/dagger development by creating an account on GitHub.
An engine to run your pipelines in containers. Contribute to cwlbraa/dagger development by creating an account on GitHub.
CACHED spans are exclusively bk vertex cache hits?
correct
but, now that i look at the code, it seems like we actually shouldn't see spans at all for dagql cache hits
at the very least, cached is always false for the callback
oh my local repo checkout is out of date
one sec, rebasing
rebased, and it seems to have the same intention (telemetry is only emitted if we're NOT a dagql cache hit) - so I'm not sure why those spans are showing up multiple times
unless for some reason it's tainted
hmm would be cool if the UI highlighted tainted calls
wouldn't there be logs and durations if they were tainted?
not necessarily, the API calls themselves don't really do anything that takes time and don't print logs on their own, that would only happen if the LLB ops themselves re-ran
ah, false assumption on my part that tainted implies re-run, then?
tainted just means it won't be cached in memory by dagql, so the resolver will always run
which would explain why we see the spans again
gotcha
but, i'm not sure if that's what's happening
thought experiment for myself, if the resolver wasn't re-running, what would we want the traces to look like? like in this screenshot, we corepack use for real right before yarn install, then save off the lockfile. the call right after does .withFile(lockfile) and continues onwards... if the resolver wasn't re-running up to corepack use, what container-producing span would that withFile parent to?
it kinda feels like the only way we have to visualize these dagql "continuances" is to either redraw the parent or do some weird nesting, although nesting feels like it'd read weird because it's not like the next-step-in-chain forked shit is down-stack from its parent
@dim panther kinda curious your thoughts on this as well, especially coming off of work moving more things to dagql cache... with the -vvv and the entailed exec metrics printing, it's kinda weird how it'll report the utilization multiple times (subsequent ones having 0-duration) ... im kinda inclined to look for a way to either:
a. mark these as CACHED even though they're not bk-cached but dagql cached
b. "hide" these calls by having their down-chain referencers point back at the previous call (no idea how to visualize this though... maybe in the webgui we use those little "chips" like we use for directory and file params, but for the method-root object, but in the tui it'd have to be by ID? so like Container(CACHED: $IDXYZ).withExec(...) instead of reprinting the whole chain of already-exec'd calls that led to that point? am i missing some detail of purity re-execs that makes this weird?)
i feel like i'm seeing more duplicate call spans lately
- e.g. dagger -m github.com/vito/bass call integration --runtime Buildkit
note the repeated apko.wolfi calls
seems to happen repeatedly too, maybeinterestingly this one is only twice
here it is in Cloud: https://v3.dagger.cloud/dagger/traces/34c5f2524bfed4e711b7f45b36b41659?span=ca8432b938e90468
Not immediately sure, feels like the dupes probably shouldn't be there in the first place, but not sure what's going wrong yet
I get an error loading the module when running that (go: conflicting replacements for go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc), so can't reach the point where you were seeing dupes.
I was just gonna look at it w/ -vvvvvvvvvv so it showed the digests of the calls. I'm presuming that if we are seeing duplicates it could plausibly be that we're sending the same telemetry but w/ different call digests?
hmm make sure you're on v0.15.3 (latest) - it should work, my bass stuff was affected by the OTel bump but should be good now
oh... module functions aren't pure/cached within a session anymore. but it's been this way for four months apparently - not sure why we'd only be seeing dupes now:
https://github.com/dagger/dagger/blame/df40da5a61c787ef321f503287dcde92222279d8/core/typedef.go#L81-L88
cc @ancient heath (this came up in your demo)
could that have perf implications?
4 months feels like it potentially lines up with some CI slowdown
Ah thanks, I forgot to update the CLI. It looks like the call digests of duped spans are the same? I am not familiar with the tui frontend code enough to know what that would imply, there must be something different in the telemetry metadata that causes the tui to think there's two spans running when only one is, but if it's not the call digest then what would it be? Span id?
yeah, span ID; I think that'll show up on max verbosity
considering those functions are marked impure now, it makes sense to me that you would see two spans if there are two things trying to evaluate it, since they won't be deduped by the caching logic
Ah okay, and I'm guessing in your module code you actually do end up calling that twice in parallel or something
not really
i actually don't know why it shows up so many times (4x by the looks)
my guess is they're actually inside multiple IDs being passed in which end up being evaluated in parallel somehow, maybe server side
oh, wait - does client-side resolve objects to IDs in parallel? maybe that's it
i kind of remember that somewhere in the go sdk
Okay seems like I got a quick fix working locally. Can get rid of the dupes w/ the bass example while also still passing the Secret integ tests that cover the original reason that impurity was added 4 months ago.
I just replaced the Impure: true with a custom cache key func using the stuff in that PR I merged the other day. End result is function calls are cached in dagql per-client, which is granular enough to allow the secret corner cases to still be covered, while not as granular as Impure:true such that they appear multiple times.
Will send out a PR in a min
sweet!
https://github.com/dagger/dagger/pull/9483, some subtlety to this of course so hopefully CI catches anything I didn't think of. Also curious to see if there's any discernable performance impact amongst the noise since I agree @snow lake this probably is a performance improvement (just not sure if it's enough of a bottleneck to be noticeable)
was hoping to drop a graph of this but alas https://ui.honeycomb.io/dagger/environments/experimental/board/tXGtVxx7uVP/Is-Dagger-Dagger-production-ready?endTime=0&startTime=-9304219 ... am i correct in guessing we only have 60 day honeycomb retention @silver kraken ? or was this data just added to hc on 12/1?