#@vito random services question 🧵
1 messages · Page 1 of 1 (latest)
If it became possible for this Creator span we store on Service to refer to a span from a client that no longer exists, would that break anything? And if so what?
Asking in the context of the global cache work https://github.com/dagger/dagger/pull/9682. So far services have actually been almost entirely resilient to various new cases introduced by a cross-session cache, which is great, but that field feels like it might be a bit frought just because it's the only statefulness that's tied to a specific client/session in such a way that it won't be "recalculated" based on who is calling it
Ah fun. It would affect telemetry - you wouldn't see service run durations accounted for in the UI after the first one that ever used the service (between engine restarts I suppose, until it persists). The spans would arrive for the service start but their linked span wouldn't.
But... hmm, it might end up working for other reasons, by falling back to the cause/effect attributes (at least for withServiceBinding call sites, which currently somewhat confusingly also show the service run duration, I think because of the cause/effect fallback)
Okay cool, yeah I'm mainly trying to figure out what to even test for in terms of breakage, so that's super useful, thank you! Basically just check to see if service run durations are still coherent when there's cross-session cache hits, IIUC.
If it does end breaking something, I'll see if I can do something like dynamically find it from the current client rather than store it on the cached Service struct (or something like that). That approach is why the rest of services have been super resilient so far, which has been a pleasant surprise (especially relative to secrets...)
sgtm! and nice. the telemetry tests could almost confirm it too since they do the warmup phase, which would make sure the second run hits the issue. too bad durations are sanitized.
tempting to just bite the bullet and add a slow one, but... man there'd probably be a ton of variance under load
So I started testing this and realized telemetry was more fundamentally broken than just services when there were cross-session cache hits. I was getting missing IDs even in simple cases.
I found a nice fix of basically moving telemetry outside of cache.GetOrInitialize (here) so that it's always called, even when there's a dagql cache hit. And then also adding support for checking whether the result was from a cache hit, setting the CACHED attr if so. So basically getting cache hits from dagql on the same feature-level as buildkit cache hits in terms of telemetry.
That feels like the the right way forward generally but now I'm getting some weird output in cloud+local progress, e.g. here if you expand echoSvc you can see repeat rows for the calls made by that function where the latter ones are Cached.
- Eventually I tracked down the duplicate calls to be from this place where we load function call results, so it makes sense that they are cached, but still bad UX for them to be shown to users (I think)
I'm also mildly concerned that this will explode the amount of telemetry we're sending to cli/cloud since we previously just never sent anything when there was a dagql cache hit. But that may be FUD, idk yet.
I've been trying some stuff to get telemetry looking the same as before but so far it all feels kind of hacky and probably doesn't work in every case.
My most recent thought was to setup some plumbing that has the end-effect of sending telemetry for the 1st cache hit on a given call in a given session, but don't send telemetry for subsequent cache hits on that object in that session. Hopefully the tracking isn't too harsh in terms of memory usage. Does that sound reasonable to you @proper river? Or any other ideas?
(Also, at least so far Service span times are working without any changes, so seems like the fallback cause/effect stuff may be saving it 🎉 )
Makes sense, yeah it was a conscious choice before to avoid emitting a span on cache hits, since .load etc will lead to a ton of dupes, which felt a bit too overwhelming. And confusing, in the case of parallel calls being deduped - you'd see a the same call repeated 2-4-8x in even somewhat simple cases.
So that idea sounds reasonable to me 👍
And we also wouldn't emit a cache hit for something that was run by the same session, right? Or 1 "real" call and 1 cache hit?