caching ttl | Dagger | Page 1

radiant pecan May 10, 2024, 11:08 PM

#

🧵

#

Example of the pattern in the wild: https://github.com/chainloop-dev/chainloop/blob/main/extras/dagger/main.go#L205-L209

GitHub

chainloop/extras/dagger/main.go at main · chainloop-dev/chainloop

Chainloop is an Open Source Metadata Vault for your Software Supply Chain metadata, SBOMs, VEX, SARIF files, QA reports, and more. - chainloop-dev/chainloop

#

and the implementation: https://github.com/chainloop-dev/chainloop/blob/main/extras/dagger/main.go#L237-L238

GitHub

chainloop/extras/dagger/main.go at main · chainloop-dev/chainloop

Chainloop is an Open Source Metadata Vault for your Software Supply Chain metadata, SBOMs, VEX, SARIF files, QA reports, and more. - chainloop-dev/chainloop

#

disclaimer: I contributed that particular snippet 🙂

hallow shard May 10, 2024, 11:09 PM

#

Also a fan of that pattern 👍 - so this would be for impure functions right? We'd still default to pure?

radiant pecan May 10, 2024, 11:10 PM

#

Yeah just a refinement on what you suggested

#

Also, I'm still obsessed with the concept of "lookup functions", it seems relevant to this, but I don't know exactly how it interacts with this rip-off-the-bandaid plan

hallow shard May 10, 2024, 11:11 PM

#

I think it'd work naturally with self calls

#

You'd have a lookup function with a ttl that delegates to a pure function with the resolved version

#

And that return value would be pure and pinned as a result

radiant pecan May 10, 2024, 11:16 PM

#

A lookup function would be a function marked by the developer as special:

A lookup function is never cached (ttl is always zero, I think?)
It can only take scalar values as arguments, and can only return scalar values
The engine uses a special section of the caller's dagger.json as a special lookup cache.
In addition to reading from dagger.json, the engine can also be configured to call all lookup functions and update the lock file

#

(sorry I was a bit slower than you there. just wanted to dump my thoughts)

hallow shard May 10, 2024, 11:16 PM

#

Oh I see, yeah you're thinking of it as also encompassing pinning and such?

radiant pecan May 10, 2024, 11:16 PM

#

right

hallow shard May 10, 2024, 11:17 PM

#

Would be a great time to figure that out

radiant pecan May 10, 2024, 11:17 PM

#

feels like maybe an intertiwned design

#

but haven't fully figured it out yet

#

what if lookup functions only returned scalars - so you have to do that work once, eg. for Go, you'd need to write the function that takes a go module name and/or version, and returns a digest

#

but once that's done - possibly everything else can be pure

#

I'm wondering if anything we are planning now might make this harder to ship later?

hallow shard May 10, 2024, 11:19 PM

#

I went down that rabbit hole in one of the early CLI design PRs (hell of a tangent in hindsight), there may be parts worth recovering, lots of potential prior art with Bass's memorization system too

radiant pecan May 10, 2024, 11:19 PM

#

I think if we could somehow make dagger.json a universal lock file, that could be insanely powerful

#

of course if you call core git(), container().from() orhttp(), that would work out of the box (under the hood there would be builtin lookup functions - possibly wrapping the buildkit calls, so buildkit only ever gets pinned calls)

#

I was worried about this being harder to ship later, if we ship a stopgap // +cached pragma

#

But actually, if we ship a // +ttl=<n> pragma instead, it gets easier

#

2 possible scenarios of future interference:

Your function has a // +ttl, and now it's a lookup function. Lookup functions always have ttl=0 (no caching) so your ttl is ignored or you get an error. Unlikely scenario anyway.

hallow shard May 10, 2024, 11:23 PM

#

The magic trick I think would be if we can have a context-free way of bumping dependencies by just re-evaluating functions described by the dagger.json

#

But the hard part I found there is knowing what can be pruned from dagger.json

#

Otherwise it just keeps growing

radiant pecan May 10, 2024, 11:24 PM

#

Your function has a // +ttl, and now it doesn't need a ttl at all, because it's pure. So you'll get unnecessary cache misses. Worst case: you forget to remove the ttl, and have a non-perfectly-optimized function. Solution: remove the ttl. Seems fine.

radiant pecan May 10, 2024, 11:25 PM

#

hallow shard But the hard part I found there is knowing what can be pruned from dagger.json

Even though that problem is real, I think we can safely ignore it for now, because it's possible that in practice the file never grows large enough for this to really matter.

#

IMO bumping dependencies in the file would be done with a special flag to dagger call

trim marsh May 10, 2024, 11:31 PM

#

Couple of things:

There isn't really a need to limit to scalars for lookup functions. Inputs+outputs of functions are required to be json serializable, so you can just store any of the inputs/outputs in dagger.json. There are practical exceptions, which leads to the next point
Any function that does side-effectful API calls is gonna have to be marked as uncached by the user. SetSecret comes to mind. I don't think it's practical for us currently to introspect the code to find those (would be another huge burden for SDKs to support).
- The other possibility would be to identify that a function made those calls after it runs and then prune buildkit's cache of it, but I don't know of a great way to do that out-of-the-box today. Certainly has to be possible, but may or may not require upstream changes.
- This isn't a blocker, but is probably a big gotcha for users after ripping the bandaid off

hallow shard May 10, 2024, 11:32 PM

#

(fam dinner time, bbl [...Drizzy])

trim marsh May 10, 2024, 11:41 PM

#

I guess if we 100% detach from buildkit's cache entirely for all this then we don't need to even think about pruning. So we could identify that SetSecret calls were made and then after the fact say "don't store that in dagger.json". But that would encompass not just lookup functions, but all functions. In which case dagger.json is gonna get real big real fast 😄 And also this wouldn't work for users calling external modules, in which case there's no dagger.json to write to, etc.

radiant pecan May 10, 2024, 11:42 PM

#

I guess I was only considering impure functions that have an input not visible in the DAG (eg. downloading go packages). But I was ignoring impure functins that have outputs not visible in the DAG (eg. uploading an artifact to a registry, or deploying , etc)

#

I'm not sure if your SetSecret example fits in the second category, or is in a 3d category I also hadn't though of

trim marsh May 10, 2024, 11:45 PM

#

radiant pecan I'm not sure if your `SetSecret` example fits in the second category, or is in a...

In my mind it's an example of a side-effectful function that should never be cached, so it's the same as users running side-effectful functions that don't even hit our API (i.e. deploying to AWS using an aws sdk directly, etc.).

The only reason I brought it up in particular is that I know there are multiple users calling SetSecret from their function, so we need to be very loud about those functions needing to be marked as uncached. It's also slightly different in that there are plausible routes we could automatically do it for them, as opposed to things like the "deploy to AWS" example where we could never figure it out on their behalf

radiant pecan May 10, 2024, 11:48 PM

#

What I think I've learned so far:

The concept of a lookup function is potentially interesting 🙂
We can ship default-cached functions + optional +ttl pragma now; and later we can ship lookup functions; they will not conflict.
Even after lookup functions, there will still be the need for manual control of function caching behavior, in other words: we'll still need +ttl. We can't just say "all functions that aren't lookup functions are pure and always cached".
Any function could be designated as a lookup function, which is cool (but what if my function returns a directorym will it cache the ID?)

radiant pecan May 10, 2024, 11:48 PM

#

trim marsh In my mind it's an example of a side-effectful function that should never be cac...

What does SetSecret do again? It's for dynamically creating secrets basically

trim marsh May 10, 2024, 11:49 PM

#

but what if my function returns a directorym will it cache the ID?
Yes, and we can also identify whether it's reproducible or not (i.e. not from a local dir) and choose whether to store in dagger.json based on that. Stuff like that

radiant pecan May 10, 2024, 11:49 PM

#

I guess the problem is not calling SetSecret itself, it's the external data source that presumably is used as cleartext argument to it

#

So then it goes in bucket 1: it has a hidden input (eg. it connects to hashicorp vault to get the latest ephemeral token, then shoves that in a new dagger secret)

trim marsh May 10, 2024, 11:50 PM

#

radiant pecan What does `SetSecret` do again? It's for dynamically creating secrets basically

It registers a Secret in memory, which is ephemeral to the session, so you can't cache it since the memory is blown away next time you run (or never there to begin with if on different machine, etc.)

radiant pecan May 10, 2024, 11:50 PM

#

oh I see

#

not only do you probably not want to cache it (because probably the cleartext is dynamically downloaded from somewhere), but you can't cache it anyway

#

so what happens if I set //+ttl=1000 to such a function ?

trim marsh May 10, 2024, 11:51 PM

#

radiant pecan I guess the problem is not calling `SetSecret` itself, it's the external data so...

Yeah precisely, secret services solve this, but SetSecret still exists right now so we just need to be very clear about calling it in the "WARNING BREAKING CHANGE" release notes that would come with this

radiant pecan May 10, 2024, 11:51 PM

#

https://tenor.com/view/mbappe-convoy-running-after-guards-security-gif-12170844

Tenor

trim marsh May 10, 2024, 11:52 PM

#

radiant pecan so what happens if I set `//+ttl=1000` to such a function ?

It would still break if you re-run the function calls that rely on it within the next 1000s. Basically, any function that makes that call has to be marked as not cached at all, ttl=0

radiant pecan May 10, 2024, 11:53 PM

#


// +ttl=1000
func getToken(vaultAddress string, key string, vaultAuth *Secret) *Secret {
  // Hashicorp logic
  cleartext, _ := do_vault_stuff(ctx)
  return dag.SetSecret(cleartext)
}

#

Is this 👆 a realistic example?

#

So this function as I wrote it, would have a bug

trim marsh May 10, 2024, 11:55 PM

#

Yeah that would break things the way SetSecret is implemented today. Say another function calls that one and then sets the returned secret as a secret env var in a container. If getToken was cached so it never ran, the SetSecret call will never run and the secret will not be present in the in-memory store and the container won't be able to use it to set the value of the env var.

#

If however secrets were looked up from services, then when the secret is needed it will always be obtained from the service (services are already never cached in execution, so it works out perfectly)

radiant pecan May 11, 2024, 12:02 AM

#

I guess there is a way to find these functions statically though: they return a secret

#

so we could disable caching of functions that return a secret (and return an error if you try to set a ttl)

trim marsh May 11, 2024, 12:07 AM

#

radiant pecan I guess there is a way to find these functions statically though: they return a ...

That's true, also need to handle functions that return something that has a Secret somewhere inside of it (i.e. a container with a secret set, basically need to crawl the whole DAG) but that's doable (overlaps with all the crap needed for the "pass sockets as CLI args" epic I've been working through).

I guess there's still corner cases since one function can call another one directly (and thus call SetSecret without a Secret being present in it's signature).

#

Again though, this is just something to make users aware of, I don't think it needs to be 100% solved necessarily before we can enable caching by default

radiant pecan May 11, 2024, 12:13 AM

#

Yeah we can probably cut that corner. When people hit it though, it will be painful, because the error will probably be opaque

#

We should still do it though, small enough numbner of people affected that we can give them VIP support on discord

trim marsh May 11, 2024, 12:27 AM

#

radiant pecan Yeah we can probably cut that corner. When people hit it though, it will be pain...

It's probably trivial to make the error message less opaque too, like "Secret %s not found, did you cache a Function that calls SetSecret? Don't do that, silly!"

#

Though if you are calling a function transitively that's extra painful since it's not your fault and you need to wait for dependencies to update... We can do the "break glass scenario" and use the engine version tracking that was added to dagger.json to retain old behavior on modules that weren't updated yet

hallow shard May 11, 2024, 1:32 AM

#

radiant pecan IMO bumping dependencies in the file would be done with a special flag to `dagge...

[circling back quickly] the downside here is if you have dependencies that are only needed for a side-effectful thing like releasing/publishing and you don't actuallly want to publish, you just want to bump the dependency. assuming you mean "you run the thing you normally would with a flag to bump dependencies"

radiant pecan May 11, 2024, 2:56 AM

#

hallow shard [circling back quickly] the downside here is if you have dependencies that are o...

I see what you mean. I guess there could be a dagger command (dagger update ?) that only calls the lookup functions themselves, without calling anything else. This would be possible if all required information is included in the lockfile: function path, arguments. Then it's up to you to make sure your lookup functions don't have side effects. But that seems fair since they're, you know, for lookup 🙂

hallow shard May 11, 2024, 3:04 AM

#

radiant pecan I see what you mean. I guess there could be a dagger command (`dagger update` ?)...

Yep yep - that's how Bass's lock files work, they embed a module + func + args and just re-evaluate them to yield the new value. I think we can do basically the same thing, but yeah the last part I couldn't quite figure out was how to prevent it from just growing forever, because sometimes you'd have dependencies between dependencies and it couldn't really tell when some of those became unused. Kinda hard to explain from memory, hopefully it's something we won't run into, not sure

#

Could maybe be resolved with some sort of aliasing system

#

Would be great to integrate this directly into things like container.from and git/http too

radiant pecan May 11, 2024, 3:10 AM

#

hallow shard Would be great to integrate this directly into things like `container.from` and ...

yeah tons of instant value if we do that

hallow shard May 11, 2024, 3:11 AM

#

(for posterity: https://github.com/vito/bass/blob/main/bass/bass.lock - it got pretty verbose)

radiant pecan May 11, 2024, 3:13 AM

#

Also I was wondering: how to represent the "path" of the lookup function, ie should the object state be part of the key? or the function calls that produced it?

#

maybe lookup functions are only allowed at top level of the module?

hallow shard May 11, 2024, 3:24 AM

#

One tempting way to look at it is that it's a persistent form of DagQL's query cache, which is keyed by a hash of the ID. Those are also almost easy to evaluate as-is, except we only have loadFooFromID for objects, not scalar types (even though IDs can technically be any type)

hallow shard May 11, 2024, 3:31 AM

#

radiant pecan maybe lookup functions are only allowed at top level of the module?

This would be simpler though and probably avoid the whole inter-dependent-dependencies mess I mentioned

trim marsh May 11, 2024, 4:00 AM

#

I almost feel like besides only allowing top level functions it may make sense to also only allow functions with no args to start. Otherwise if your lookup function accepts e.g. an int are we going to cache every result for every value of the int that it’s ever invoked with?

You could solve that with a whole pruning UX but even then it sounds kind of annoying to manage, besides more upfront work for us. I think a lot of use cases would already be covered even without arg support (you can still use consts and such) and then the pruning logic is just “if the function doesn’t exist anymore, prune it”.

Just as a starting place that sounds more tractable while not ruling out more in the future.

trim marsh May 11, 2024, 4:37 AM

#

hallow shard One tempting way to look at it is that it's a persistent form of DagQL's query c...

Oh btw my recent forays into the guts of our buildkit interface revealed a very interesting way that we could practically abandon LLB all together and instead just operate purely on dagql IDs, having buildkits persistent cache be a map of Dagql digests to layer cache.

Most importantly, it’s actually just straight up supported and not abusing any functionality or otherwise hacking something. And it also doesn’t require rewriting tons of buildkit code (some, but not a really significant amount)

Which is interesting because a “cache export” at that point would almost feel like a slight generalization of what we’re talking about here.

Not persuing at this moment obviously but something to keep in mind, especially since storage drivers are coming soon.

hallow shard May 11, 2024, 1:50 PM

#

trim marsh Oh btw my recent forays into the guts of our buildkit interface revealed a very ...

https://tenor.com/bEirJ.gif

Tenor

hallow shard May 11, 2024, 2:19 PM

#

trim marsh Oh btw my recent forays into the guts of our buildkit interface revealed a very ...

Super interesting - for function analytics I've had to somewhat tightly couple the relationship between our DAG and LLB (https://github.com/dagger/dagger/pull/7321/files#diff-b93188b9d79aa9868ea185ca205ea31f7619506c4c764b9a5f49182aa9b48f3eR87-R173) and I've wondered how that might be made "whole" in the future. For now I'm calling LLB "effects" at the persistence layer to avoid keeping that terminology coupling forever (as in, Container.withExec returns a Container with a baked-in "future effect" that we'll see run later). Maybe in the future they're one and the same, or maybe we have some other identifier for "effects" 🤷‍♂️ (also open to other terms, but I guess either way we can migrate etc)

radiant pecan May 11, 2024, 7:27 PM

#

hallow shard https://tenor.com/bEirJ.gif

https://tenor.com/view/coming-samwise-lord-of-the-rings-fellow-ship-of-the-ring-coming-with-you-gif-5758122

Tenor

I'm coming with you - Coming

▶ Play video

trim marsh May 13, 2024, 5:25 PM

#

hallow shard Super interesting - for function analytics I've had to somewhat tightly couple t...

Ah interesting, yeah I was thinking through what this would look like in practice and there would absolutely still be a difference between dagql fields that produce a persistent filesystem as a side effect and those that are derived from those fields. E.g. in dag.Container().From(alpine).WithEnvVar(foo, bar).WithExec(blah).Stdout(), From and WithExec create persistent cached filesystems but WithEnvVar and Stdout don't. The difference is that today we internally have to translate all those "persistently cached" fields to LLB, solve that instead and then lose the "dagql-nativeness".

The breadcrumb for my realization around all this is this method: https://github.com/dagger/dagger/blob/2ae9f4d815c1fcf48626ce9437c788d0efb91a79/engine/buildkit/worker.go#L337 which is what gets called by buildkit to determine how to get the cache key and implementation for a given vertex. But that's coming from Buildkit's "generic solver" that is agnostic to LLB, so if you look at the types of the solver.Vertex and solver.Op, they aren't actually LLB specific at all. We could just be passing types from the dagql protobuf package instead.

It remains to be seen if the benefits of this are more than "that's super cool" but I'm gonna pull on the thread as part of storage drivers just enough to see if it would make sense to pursue further.

radiant pecan May 14, 2024, 4:59 PM

#

@dark nymph @hallow shard may I suggest we continue the thread here 🙂

#

@hallow shard so a small new dilemma for me: is it weird that a function dev can specify its own TTL value. What is the correct value? And what if the admin wants to configure their own value in a site-dependent way?

#

But on the other hand - you can implement a crude cache buster in code today. So if we don't let devs do it the clean way, they may just do it the dirty way with old school cache busters

#

so that's my dilemma

hallow shard May 14, 2024, 5:01 PM

#

radiant pecan <@108011715077091328> so a small new dilemma for me: is it weird that a function...

yeah... now i'm thinking it should be more caller-dependent tbh. For example, someone might want to bump certain dependencies every minute, and others every month

#

on the callee side, I can for sure know that something should have cache controls, but determining how long it caches for seems like you'd just be pulling a number out of thin air most of the time

radiant pecan May 14, 2024, 5:09 PM

#

OK so we need:

A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl
A way for the caller to set a ttl programmatically
A way for the user to set the default ttl session-wide?

hallow shard May 14, 2024, 5:12 PM

#

is kinda interesting, I could see that being the "I just stepped on a plane and would like to stop bumping everything for a bit" switch

#

only, you would want the transition from ttl=60s to ttl=infinity (or whatever) to not bust the caches on its own 😛

radiant pecan May 14, 2024, 5:14 PM

#

Also you may want to choose between 1) setting the TTL of functions that expect to be cached, and 2) forcing a TTL on all functions, even those that indicated they should not be cached. Do we want both? And if so how do we differentiate?

hallow shard May 14, 2024, 5:15 PM

#

also - I've personally run into cases where being able to literally set the cache key is useful, vs. just TTL-based. For example, if you have a cache that you know only needs to be busted for each commit. In the past this was mostly relevant for cache volumes, where you can put it in the name. Maybe there's a connection there, maybe not

radiant pecan May 14, 2024, 5:15 PM

#

A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl

Follow-up question: is there any chance we may want to differentiate between those two types of "don't cache me" functions?

dark nymph May 14, 2024, 5:16 PM

#

I'd certainly expect as a module author to have some control over what the user can or can't do

radiant pecan May 14, 2024, 5:16 PM

#

I guess "I depend on an external data source" might be the same as "// +lookup"

dark nymph May 14, 2024, 5:17 PM

#

When caching makes absolutely zero sense, I want to be able to disable caching completely.

#

Caching is hard to debug, I'd like to minimize the possibility of user error 🙂

#

I wonder how this became a conversation about "ttl". Isn't the cache content based?

hallow shard May 14, 2024, 5:20 PM

#

radiant pecan > 1. A way for the callee to declare its caching behavior ("I need caching / I n...

there could maybe be value in distinguishing side-effect-ful functions from functions whose return value changes over time, but there might be more value in just saying side-effect-ful functions should try their hardest to be idempotent, i.e. they should be able to safely re-run and converge on the desired state, because we want users to be able to re-run builds that may have partially succeeded, for example one big 'shipit' job that publishes a bunch of stuff

#

(that was Concourse's stance, and why the resource write operation is called put)

radiant pecan May 14, 2024, 5:21 PM

#

OK how about this:

// +lookup to mark your function as a lookup function. This means it may read data out-of-band from an external source, and will be cached in a special way (including in the future, a lockfile feature).
// +side_effect to mark your function as having side effects. This means it may write data out-of-band to an external source, and will be cached in a special way. (simplest implementation: no caching. but perhaps some optimizations in the future?)
Lookup functions cannot have side effects
All other functions are cached by default. The callee cannot specify a TTL
Caller can specify a TTL programmatically with dag.WithTTL(<millseconds>)
Operator can set a default TTL in the CLI with dagger --ttl <milliseconds>
TTL settings are not applied to lookup and side-effect functions

hallow shard May 14, 2024, 5:23 PM

#

dark nymph I wonder how this became a conversation about "ttl". Isn't the cache content bas...

The "real" cache is, but this is a TTL for things that feed into that content-based key, i.e. knowing when there's a new version of something available. All values downstream of that would be cached independently, with the TTL only influencing how frequently those new values are discovered, completely disconnected from the eventual "real" cache. For example, going from TTL=10m to TTL=1m* would cause more lookups, but you'd still get cache hits for all of the "real work" provided it discovers the same version

radiant pecan May 14, 2024, 5:24 PM

#

@dark nymph yeah the TTL is just API cosmetics. The cachebuster pattern is a messier way of setting a TTL

dark nymph May 14, 2024, 5:29 PM

#

Makes sense

hallow shard May 14, 2024, 5:30 PM

#

radiant pecan OK how about this: 1. `// +lookup` to mark your function as a lookup function. ...

sounds pretty good, I don't grok 7) though - you'd want it applied to lookup, but not side-effect-ful ones right? Or you mean it's not applied to ones that have a custom WithTTL already set?

#

also, @trim marsh probably has thoughts on // +side_effect, he was talking about modeling it instead as a function that takes the 'current state' (auto-provided/fetched somehow), which might just be a // +lookup pattern

#

though even just having that metadata ("I mutate the outside world!") seems pretty nice

radiant pecan May 14, 2024, 5:48 PM

#

hallow shard sounds pretty good, I don't grok 7) though - you'd want it applied to lookup, bu...

For 7: for lookup functions, I think the caching mechanism transcends TTLs, it is its own thing. The "cache" is the lockfile, so it never expires. If there's a "cache miss", you run the lookup and update the lockfile. If there's a "cache hit", you either use the value (with infinite ttl) or re-run lookup and overwrite, depending on what the user asked for. But TTL is never relevant IMO. (Like there's no TTL in go dependency caching for example).

--> I could be wrong, thinking out loud

radiant pecan May 14, 2024, 5:50 PM

#

hallow shard also, <@949034677610643507> probably has thoughts on `// +side_effect`, he was t...

I guess that could be a separate, more advanced feature. I just need a way for devs to tag their existing functions that happen to have side effects. There are lots of those and they're not going away. Instead of // +nocache I'm looking for something more specific basically.

So we tell developers "don't worry about whether dagger caches your function or not. Just worry about flagging side effects and external lookups, and dagger will do the right thing"

hallow shard May 14, 2024, 5:52 PM

#

radiant pecan For 7: for lookup functions, I think the caching mechanism transcends TTLs, it i...

Oh yeah that makes sense in the context of a lock file. I guess that opens another question around whether/how the locking is opt-in on a call-by-call basis

radiant pecan May 14, 2024, 6:13 PM

#

hallow shard Oh yeah that makes sense in the context of a lock file. I guess that opens anoth...

Since we wanted to ship a stopgap quickly, and not rush a full implementation of lookup functions (so we can sweat the details on lockfile etc), maybe we start by just disabling caching completely on // +lookup?

#

Initially I thought the marker // +lookup would not appear in the stopgap at all; but now thinking that having it from the start, just with a more naive implementation, is even better

hallow shard May 14, 2024, 8:13 PM

#

radiant pecan Since we wanted to ship a stopgap quickly, and not rush a full implementation of...

What do you think of adding // +impure as the primitive "don't cache me" and then adding // +lookup and // +side_effect once we actually have those more specific semantics pinned down? It was originally brought up here: https://linear.app/dagger/issue/DEV-3034/cache-control-for-functions#comment-8eb7205f - and matches DagQL terminology. The lookup and side_effect patterns seem like specializations of impure, so we could just start with that and add more nuance later. Starting with lookup seems like we could have the adverse effect of users marking not-actual-"lookup"-pattern things as lookup just to disable caching, but then getting undesired behavior later once we add meaning to it.

#

(Not that we need to leak DagQL terminology to the user, but it aligns well with functional terms anyway imo)

radiant pecan May 14, 2024, 8:43 PM

#

I think specializing from the start works better, because it will cause less breakage in the future.

Specialize from the start: most devs get it right (if we explain the difference correctly). Let's call is 20% error rate. Later when the behavior becomes really different (lockfile etc), those 20% break. Solution: rename lookup to side_effect or vice versa.
Specialize later: everyone uses the impure stopgap for now. Later we migrate them to either lookup or side_effect. impure is deprecated. At this point the breakage rate is 100%: everyone using a deprecated API and must migrate. More painful.

hallow shard May 14, 2024, 8:46 PM

#

I wasn't assuming we'd deprecate impure - unless we're pretty sure lookup and side_effect cover 100% of the cases (maybe true). But sure, point taken about having to change more things to lookup if we think that's the massive majority case, but a lot of it might be non-lookup cases. Hard to say

#

Maybe we could look over the daggerverse - I would have guessed we have a lot of side_effect use cases right now, and not many lookup

#

"deploy to X"

radiant pecan May 14, 2024, 8:50 PM

#

Side effects:

Upload
Publish
Deploy
Send messages

Lookup:

HTTP download
Package registry download
Git download
Docker pull

hallow shard May 14, 2024, 8:51 PM

#

in terms of code sprawl, I'd wager the 'lookup' cases are a few more centralized super common cases (mostly core APIs even), and 'side effects' is a ton of third-party modules

radiant pecan May 14, 2024, 8:52 PM

#

I don't know, pretty much every language-specific build function is a lookup; and linux distro package install also

#

lots of packaging systems out there

#

anythign that deals with dependencies at runtime basically

hallow shard May 14, 2024, 8:53 PM

#

yea true. and it could be totally incidental which things have been built so far (fluent-ci has a ton of 'deploy to X' modules)

radiant pecan May 14, 2024, 8:54 PM

#

one concern I have is possible mixing of the two. For example deployment systems sometimes can be queried for the current digest of the deployed content. That digest can then be incorporated into the cache key of the deployment function. In that case, should the deployment function still be marked as +side_effect?

hallow shard May 14, 2024, 8:58 PM

#

I think it's fine for side_effect to mean 'possible side effects ideally with idempotency/intelligent no-ops involved', not 'I will always do something', but it might be worth thinking more about what side_effect could possibly even mean in terms of behavior thinkspin

#

for example, does it produce a particular type of return value? do we use it to draw a fancy 'output' in a Concourse-style data flow pipeline UI?

radiant pecan May 14, 2024, 9:00 PM

#

I think side_effect would mean:

Never cache
Use as an identifier in Traces (render in a special way, allow searching / filtering)
Possibly instrument DNS to get analytics on outbound connections

hallow shard May 14, 2024, 9:00 PM

#

is it always idempotent? does it tie back in to a symmetrical // +lookup function, much like how Concourse has a single Resource definition that implements check (// +lookup), get (the cached fetch), and put (// +side_effect)?

#

(ideally Resources would just be something you can implement in terms of these more generic building blocks, the parallels just seem useful to analyze)

radiant pecan May 14, 2024, 9:06 PM

#

Yeah, since you guys mentioned all objects are already required to be serializable to JSON, a lookup function could be check+get

#

which would it easier to fit your existing code into a lookup function

hallow shard May 14, 2024, 9:09 PM

#

there may be another interesting case, where you want // +lookup but you don't want it pinned, because it actually returns all available versions, or maybe even new versions from a cursor, which is how check works in Concourse

#

(substitute 'versions' for 'states' to make it more general)

radiant pecan May 14, 2024, 9:12 PM

#

I see, like querying the github API for a list of PRs for example?

hallow shard May 14, 2024, 9:12 PM

#

yeah exactly

#

in Concourse for example, you could get all PRs => configure a pipeline for each PR

#

that might reduce to a general 'get current state' pattern

#

with a more intense focus on current (compared to lookup which you would happily pin, because that pinned state is always available)

hallow shard May 14, 2024, 9:17 PM

#

radiant pecan one concern I have is possible mixing of the two. For example deployment systems...

feels related to this

radiant pecan May 14, 2024, 9:17 PM

#

yeah that wouldn't qualify as a lookup in the meaning i intended at least. which is your point I guess

#

is impure the opposite of reproducible?

hallow shard May 14, 2024, 9:18 PM

#

yeah, sort of, it's hard to say with the word 'reproducible' being heavily debated, but it at minimum means you can't trust it to do/return the same thing when given the same parameters

#

but, it's still possible for an impure function to return a pure value

#

since it could return a value that has its own ID that has the pinned value from that particular call baked in

radiant pecan May 14, 2024, 9:20 PM

#

ok so: // +pure=false ? and later // +lookup is a different thing?

hallow shard May 14, 2024, 9:21 PM

#

works for me - maybe we could even start with a few nuanced terms, and for now they're just aliases, and later we give them more meaning?

radiant pecan May 14, 2024, 9:24 PM

#

yeah I just worry that I was too optimistic that we can or should eventually distinguish the 2 properties of pure functions

#


1. Always produces the same output given the same input.
2. Has no side effects (does not alter any state or interact with the outside world).

These characteristics make pure functions predictable and easier to test.

hallow shard May 14, 2024, 9:32 PM

#

There might be a whole other avenue to approach that from where we build on the interfaces support and start recognizing particular interfaces instead of trying to cram a bunch of things into function annotations. Sort of like how Go has io.Writer everywhere, we could have canonical Deployment and Resource interfaces that we recognize and visualize in certain ways

#

Just planting a seed, don't have much concrete there yet, the thought came up while I was implementing my Concourse module; we don't actually make much use of interfaces yet, so it might be an untapped mine

radiant pecan May 14, 2024, 10:03 PM

#

I love that and want to play with interdaces. but I think it's orthogonal

#

because purity is tied to the implementation

#

for example if I mock a resource i would expect my mock to be cached

hallow shard May 14, 2024, 10:20 PM

#

yea true, these annotations would still live on the interface's function implementations

#

personally, i can see the value in having these simple aliases even if we're not 100% sure how they'll be used yet. like, even if it's just metadata, there's value in knowing what flavor of non-hermetic/pure a function is

#

even if it's just for browsing the daggerverse - "I want to find an effect-ful function for GitHub"

radiant pecan May 14, 2024, 10:25 PM

#

Maybe it's // +pure=false (never deprecated) + additional information you could optionally fill in?

For example you could specify which external resources you read from, write to, or both

#

(via a url for example)

#

there's also things like randomness and system clock

hallow shard May 14, 2024, 10:35 PM

#

that's feels more like data on the object to me (and interfaces/etc) - since in a lot of cases it might be dynamically configured, not something you can put in a static comment. Unless you mean something like an API schema definition URL/identifier

radiant pecan May 14, 2024, 10:43 PM

#

I don't know.. just throwing stuff at the wall

#

there's value in knowing what flavor of non-hermetic/pure a function is

I guess step one for this is: defining a nomenclature that captures 100% of non-pure functions like you said

#

My list so far, of behavior that can make a function impure:

It interacts with an external system without mutating it (external data source)
It interacts with an external system and does mutate it (side effect)
It uses the system clock
It uses the system's random generator
It executes code that has unpredictable behavior (aka a flaky test, which we're discussing right now in #wandb-ai )
other?

hallow shard May 14, 2024, 11:55 PM

#

sounds like fun homework (unironically), will think on it! would be especially interesting if it boils down to just a few things, or some sort of combining scheme like read/write, external/internal

radiant pecan May 15, 2024, 12:03 AM

#

This flows into a list of things the engine does differently between pure and impure functions:

Pure functions are cached
Pure functions get deterministic system clock?
Pure functions get deterministic system randomness?
Pure functions are never retried (built-in retry feature for impure functions?)
Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"

#

Cache volumes: technically this could make your function impure?

hallow shard May 15, 2024, 12:25 AM

#

radiant pecan - Cache volumes: technically this could make your function impure?

Technically yeah, but it seems pragmatic to allow it to be pure, from the POV that cache presence shouldn't affect the result. But if you're using it to hack in some non content addressed data, totally

hallow shard May 15, 2024, 9:29 PM

#

I just got bit by an unpinned dependency in my build, so raising while the topic is hot. In my case it was a broken Wolfi package: https://github.com/wolfi-dev/os/pull/19530

Here's the apko pattern for pinning: https://www.chainguard.dev/unchained/reproducing-chainguards-reproducible-image-builds

I think this might just fit the // +lookup pattern, where the value being stored is a list of package version constraints.

radiant pecan May 20, 2024, 10:55 PM

#

FYI I opened an issue: https://github.com/dagger/dagger/issues/7428

GitHub

Cache function calls by default · Issue #7428 · dagger/dagger

Problem Dagger has great caching, but Dagger Functions don't fully benefit from it, because their runtime containers are not cached. This has several consequences: Functions that perform comput...

hallow shard May 29, 2024, 1:37 AM

#

radiant pecan My list so far, of behavior that can make a function impure: - It interacts wit...

Just remembered DagQL it actually makes you specify a reason for the impurity: https://github.com/dagger/dagger/pull/7438/files#diff-74a9c9e133ee9e1936ab0992a17621f577bf69d08455e7610afef8ce50cc033fR56

So I guess that's one option: to enable impurity you need to provide a freeform description, and the onus is on the module author to be descriptive; it doubles as documentation anyway. Maybe we don't need a precise taxonomy or set of constants. That sort of thing is more interesting to observe from a vast dataset anyway; unlikely we'd predict them all.

hallow shard May 29, 2024, 1:57 AM

#

radiant pecan This flows into a list of things the engine does differently between pure and im...

Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"

I think disabling network for pure functions is a cool idea but I'm not sure about the practicality.

Practically, I think we have to allow networked pure functions, in the same way we allow wget ... to be cached. In the space Dagger occupies, networked resources are just a reality, and network fetches are the exact thing we want to cache, so if purity is what determines caching, we have to allow pure networked functions. re: failure to fetch a networked resource - I don't see failure as an output, I see it as failure to produce one, so it's still pure. But it would be terrible if the function actually returned different things for different calls. I don't know what to do about that, besides requiring checksums for everything. But honestly, not everyone wants that. Sometimes you just want to YOLO and cache/roll-forward on a best-effort basis.

#

Placing Dagger on a 'purity spectrum' might be nice for docs. Something like this, only less embarrassing.

#

width of the slice representing flexibility

#caching ttl