#caching ttl
1 messages · Page 1 of 1 (latest)
🧵
Example of the pattern in the wild: https://github.com/chainloop-dev/chainloop/blob/main/extras/dagger/main.go#L205-L209
and the implementation: https://github.com/chainloop-dev/chainloop/blob/main/extras/dagger/main.go#L237-L238
disclaimer: I contributed that particular snippet 🙂
Also a fan of that pattern 👍 - so this would be for impure functions right? We'd still default to pure?
Yeah just a refinement on what you suggested
Also, I'm still obsessed with the concept of "lookup functions", it seems relevant to this, but I don't know exactly how it interacts with this rip-off-the-bandaid plan
I think it'd work naturally with self calls
You'd have a lookup function with a ttl that delegates to a pure function with the resolved version
And that return value would be pure and pinned as a result
A lookup function would be a function marked by the developer as special:
- A lookup function is never cached (ttl is always zero, I think?)
- It can only take scalar values as arguments, and can only return scalar values
- The engine uses a special section of the caller's
dagger.jsonas a special lookup cache. - In addition to reading from
dagger.json, the engine can also be configured to call all lookup functions and update the lock file
(sorry I was a bit slower than you there. just wanted to dump my thoughts)
Oh I see, yeah you're thinking of it as also encompassing pinning and such?
right
Would be a great time to figure that out
feels like maybe an intertiwned design
but haven't fully figured it out yet
what if lookup functions only returned scalars - so you have to do that work once, eg. for Go, you'd need to write the function that takes a go module name and/or version, and returns a digest
but once that's done - possibly everything else can be pure
I'm wondering if anything we are planning now might make this harder to ship later?
I went down that rabbit hole in one of the early CLI design PRs (hell of a tangent in hindsight), there may be parts worth recovering, lots of potential prior art with Bass's memorization system too
I think if we could somehow make dagger.json a universal lock file, that could be insanely powerful
of course if you call core git(), container().from() orhttp(), that would work out of the box (under the hood there would be builtin lookup functions - possibly wrapping the buildkit calls, so buildkit only ever gets pinned calls)
I was worried about this being harder to ship later, if we ship a stopgap // +cached pragma
But actually, if we ship a // +ttl=<n> pragma instead, it gets easier
2 possible scenarios of future interference:
- Your function has a
// +ttl, and now it's a lookup function. Lookup functions always have ttl=0 (no caching) so your ttl is ignored or you get an error. Unlikely scenario anyway.
The magic trick I think would be if we can have a context-free way of bumping dependencies by just re-evaluating functions described by the dagger.json
But the hard part I found there is knowing what can be pruned from dagger.json
Otherwise it just keeps growing
- Your function has a
// +ttl, and now it doesn't need a ttl at all, because it's pure. So you'll get unnecessary cache misses. Worst case: you forget to remove the ttl, and have a non-perfectly-optimized function. Solution: remove the ttl. Seems fine.
Even though that problem is real, I think we can safely ignore it for now, because it's possible that in practice the file never grows large enough for this to really matter.
IMO bumping dependencies in the file would be done with a special flag to dagger call
Couple of things:
- There isn't really a need to limit to scalars for lookup functions. Inputs+outputs of functions are required to be json serializable, so you can just store any of the inputs/outputs in dagger.json. There are practical exceptions, which leads to the next point
- Any function that does side-effectful API calls is gonna have to be marked as uncached by the user.
SetSecretcomes to mind. I don't think it's practical for us currently to introspect the code to find those (would be another huge burden for SDKs to support).- The other possibility would be to identify that a function made those calls after it runs and then prune buildkit's cache of it, but I don't know of a great way to do that out-of-the-box today. Certainly has to be possible, but may or may not require upstream changes.
- This isn't a blocker, but is probably a big gotcha for users after ripping the bandaid off
(fam dinner time, bbl [...Drizzy])
I guess if we 100% detach from buildkit's cache entirely for all this then we don't need to even think about pruning. So we could identify that SetSecret calls were made and then after the fact say "don't store that in dagger.json". But that would encompass not just lookup functions, but all functions. In which case dagger.json is gonna get real big real fast 😄 And also this wouldn't work for users calling external modules, in which case there's no dagger.json to write to, etc.
I guess I was only considering impure functions that have an input not visible in the DAG (eg. downloading go packages). But I was ignoring impure functins that have outputs not visible in the DAG (eg. uploading an artifact to a registry, or deploying , etc)
I'm not sure if your SetSecret example fits in the second category, or is in a 3d category I also hadn't though of
In my mind it's an example of a side-effectful function that should never be cached, so it's the same as users running side-effectful functions that don't even hit our API (i.e. deploying to AWS using an aws sdk directly, etc.).
The only reason I brought it up in particular is that I know there are multiple users calling SetSecret from their function, so we need to be very loud about those functions needing to be marked as uncached. It's also slightly different in that there are plausible routes we could automatically do it for them, as opposed to things like the "deploy to AWS" example where we could never figure it out on their behalf
What I think I've learned so far:
- The concept of a lookup function is potentially interesting 🙂
- We can ship default-cached functions + optional
+ttlpragma now; and later we can ship lookup functions; they will not conflict. - Even after lookup functions, there will still be the need for manual control of function caching behavior, in other words: we'll still need
+ttl. We can't just say "all functions that aren't lookup functions are pure and always cached". - Any function could be designated as a lookup function, which is cool (but what if my function returns a directorym will it cache the ID?)
What does SetSecret do again? It's for dynamically creating secrets basically
but what if my function returns a directorym will it cache the ID?
Yes, and we can also identify whether it's reproducible or not (i.e. not from a local dir) and choose whether to store in dagger.json based on that. Stuff like that
I guess the problem is not calling SetSecret itself, it's the external data source that presumably is used as cleartext argument to it
So then it goes in bucket 1: it has a hidden input (eg. it connects to hashicorp vault to get the latest ephemeral token, then shoves that in a new dagger secret)
It registers a Secret in memory, which is ephemeral to the session, so you can't cache it since the memory is blown away next time you run (or never there to begin with if on different machine, etc.)
oh I see
not only do you probably not want to cache it (because probably the cleartext is dynamically downloaded from somewhere), but you can't cache it anyway
so what happens if I set //+ttl=1000 to such a function ?
Yeah precisely, secret services solve this, but SetSecret still exists right now so we just need to be very clear about calling it in the "WARNING BREAKING CHANGE" release notes that would come with this
It would still break if you re-run the function calls that rely on it within the next 1000s. Basically, any function that makes that call has to be marked as not cached at all, ttl=0
// +ttl=1000
func getToken(vaultAddress string, key string, vaultAuth *Secret) *Secret {
// Hashicorp logic
cleartext, _ := do_vault_stuff(ctx)
return dag.SetSecret(cleartext)
}
Is this 👆 a realistic example?
So this function as I wrote it, would have a bug
Yeah that would break things the way SetSecret is implemented today. Say another function calls that one and then sets the returned secret as a secret env var in a container. If getToken was cached so it never ran, the SetSecret call will never run and the secret will not be present in the in-memory store and the container won't be able to use it to set the value of the env var.
If however secrets were looked up from services, then when the secret is needed it will always be obtained from the service (services are already never cached in execution, so it works out perfectly)
I guess there is a way to find these functions statically though: they return a secret
so we could disable caching of functions that return a secret (and return an error if you try to set a ttl)
That's true, also need to handle functions that return something that has a Secret somewhere inside of it (i.e. a container with a secret set, basically need to crawl the whole DAG) but that's doable (overlaps with all the crap needed for the "pass sockets as CLI args" epic I've been working through).
I guess there's still corner cases since one function can call another one directly (and thus call SetSecret without a Secret being present in it's signature).
Again though, this is just something to make users aware of, I don't think it needs to be 100% solved necessarily before we can enable caching by default
Yeah we can probably cut that corner. When people hit it though, it will be painful, because the error will probably be opaque
We should still do it though, small enough numbner of people affected that we can give them VIP support on discord
It's probably trivial to make the error message less opaque too, like "Secret %s not found, did you cache a Function that calls SetSecret? Don't do that, silly!"
Though if you are calling a function transitively that's extra painful since it's not your fault and you need to wait for dependencies to update... We can do the "break glass scenario" and use the engine version tracking that was added to dagger.json to retain old behavior on modules that weren't updated yet
[circling back quickly] the downside here is if you have dependencies that are only needed for a side-effectful thing like releasing/publishing and you don't actuallly want to publish, you just want to bump the dependency. assuming you mean "you run the thing you normally would with a flag to bump dependencies"
I see what you mean. I guess there could be a dagger command (dagger update ?) that only calls the lookup functions themselves, without calling anything else. This would be possible if all required information is included in the lockfile: function path, arguments. Then it's up to you to make sure your lookup functions don't have side effects. But that seems fair since they're, you know, for lookup 🙂
Yep yep - that's how Bass's lock files work, they embed a module + func + args and just re-evaluate them to yield the new value. I think we can do basically the same thing, but yeah the last part I couldn't quite figure out was how to prevent it from just growing forever, because sometimes you'd have dependencies between dependencies and it couldn't really tell when some of those became unused. Kinda hard to explain from memory, hopefully it's something we won't run into, not sure
Could maybe be resolved with some sort of aliasing system
Would be great to integrate this directly into things like container.from and git/http too
yeah tons of instant value if we do that
(for posterity: https://github.com/vito/bass/blob/main/bass/bass.lock - it got pretty verbose)
Also I was wondering: how to represent the "path" of the lookup function, ie should the object state be part of the key? or the function calls that produced it?
maybe lookup functions are only allowed at top level of the module?
One tempting way to look at it is that it's a persistent form of DagQL's query cache, which is keyed by a hash of the ID. Those are also almost easy to evaluate as-is, except we only have loadFooFromID for objects, not scalar types (even though IDs can technically be any type)
This would be simpler though and probably avoid the whole inter-dependent-dependencies mess I mentioned
I almost feel like besides only allowing top level functions it may make sense to also only allow functions with no args to start. Otherwise if your lookup function accepts e.g. an int are we going to cache every result for every value of the int that it’s ever invoked with?
You could solve that with a whole pruning UX but even then it sounds kind of annoying to manage, besides more upfront work for us. I think a lot of use cases would already be covered even without arg support (you can still use consts and such) and then the pruning logic is just “if the function doesn’t exist anymore, prune it”.
Just as a starting place that sounds more tractable while not ruling out more in the future.
Oh btw my recent forays into the guts of our buildkit interface revealed a very interesting way that we could practically abandon LLB all together and instead just operate purely on dagql IDs, having buildkits persistent cache be a map of Dagql digests to layer cache.
Most importantly, it’s actually just straight up supported and not abusing any functionality or otherwise hacking something. And it also doesn’t require rewriting tons of buildkit code (some, but not a really significant amount)
Which is interesting because a “cache export” at that point would almost feel like a slight generalization of what we’re talking about here.
Not persuing at this moment obviously but something to keep in mind, especially since storage drivers are coming soon.
Super interesting - for function analytics I've had to somewhat tightly couple the relationship between our DAG and LLB (https://github.com/dagger/dagger/pull/7321/files#diff-b93188b9d79aa9868ea185ca205ea31f7619506c4c764b9a5f49182aa9b48f3eR87-R173) and I've wondered how that might be made "whole" in the future. For now I'm calling LLB "effects" at the persistence layer to avoid keeping that terminology coupling forever (as in, Container.withExec returns a Container with a baked-in "future effect" that we'll see run later). Maybe in the future they're one and the same, or maybe we have some other identifier for "effects" 🤷♂️ (also open to other terms, but I guess either way we can migrate etc)
Ah interesting, yeah I was thinking through what this would look like in practice and there would absolutely still be a difference between dagql fields that produce a persistent filesystem as a side effect and those that are derived from those fields. E.g. in dag.Container().From(alpine).WithEnvVar(foo, bar).WithExec(blah).Stdout(), From and WithExec create persistent cached filesystems but WithEnvVar and Stdout don't. The difference is that today we internally have to translate all those "persistently cached" fields to LLB, solve that instead and then lose the "dagql-nativeness".
The breadcrumb for my realization around all this is this method: https://github.com/dagger/dagger/blob/2ae9f4d815c1fcf48626ce9437c788d0efb91a79/engine/buildkit/worker.go#L337 which is what gets called by buildkit to determine how to get the cache key and implementation for a given vertex. But that's coming from Buildkit's "generic solver" that is agnostic to LLB, so if you look at the types of the solver.Vertex and solver.Op, they aren't actually LLB specific at all. We could just be passing types from the dagql protobuf package instead.
It remains to be seen if the benefits of this are more than "that's super cool" but I'm gonna pull on the thread as part of storage drivers just enough to see if it would make sense to pursue further.
@dark nymph @hallow shard may I suggest we continue the thread here 🙂
@hallow shard so a small new dilemma for me: is it weird that a function dev can specify its own TTL value. What is the correct value? And what if the admin wants to configure their own value in a site-dependent way?
But on the other hand - you can implement a crude cache buster in code today. So if we don't let devs do it the clean way, they may just do it the dirty way with old school cache busters
so that's my dilemma
yeah... now i'm thinking it should be more caller-dependent tbh. For example, someone might want to bump certain dependencies every minute, and others every month
on the callee side, I can for sure know that something should have cache controls, but determining how long it caches for seems like you'd just be pulling a number out of thin air most of the time
OK so we need:
- A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl
- A way for the caller to set a ttl programmatically
- A way for the user to set the default ttl session-wide?
- is kinda interesting, I could see that being the "I just stepped on a plane and would like to stop bumping everything for a bit" switch
only, you would want the transition from ttl=60s to ttl=infinity (or whatever) to not bust the caches on its own 😛
Also you may want to choose between 1) setting the TTL of functions that expect to be cached, and 2) forcing a TTL on all functions, even those that indicated they should not be cached. Do we want both? And if so how do we differentiate?
also - I've personally run into cases where being able to literally set the cache key is useful, vs. just TTL-based. For example, if you have a cache that you know only needs to be busted for each commit. In the past this was mostly relevant for cache volumes, where you can put it in the name. Maybe there's a connection there, maybe not
- A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl
Follow-up question: is there any chance we may want to differentiate between those two types of "don't cache me" functions?
I'd certainly expect as a module author to have some control over what the user can or can't do
I guess "I depend on an external data source" might be the same as "// +lookup"
When caching makes absolutely zero sense, I want to be able to disable caching completely.
Caching is hard to debug, I'd like to minimize the possibility of user error 🙂
I wonder how this became a conversation about "ttl". Isn't the cache content based?
there could maybe be value in distinguishing side-effect-ful functions from functions whose return value changes over time, but there might be more value in just saying side-effect-ful functions should try their hardest to be idempotent, i.e. they should be able to safely re-run and converge on the desired state, because we want users to be able to re-run builds that may have partially succeeded, for example one big 'shipit' job that publishes a bunch of stuff
(that was Concourse's stance, and why the resource write operation is called put)
OK how about this:
-
// +lookupto mark your function as a lookup function. This means it may read data out-of-band from an external source, and will be cached in a special way (including in the future, a lockfile feature). -
// +side_effectto mark your function as having side effects. This means it may write data out-of-band to an external source, and will be cached in a special way. (simplest implementation: no caching. but perhaps some optimizations in the future?) -
Lookup functions cannot have side effects
-
All other functions are cached by default. The callee cannot specify a TTL
-
Caller can specify a TTL programmatically with
dag.WithTTL(<millseconds>) -
Operator can set a default TTL in the CLI with
dagger --ttl <milliseconds> -
TTL settings are not applied to lookup and side-effect functions
The "real" cache is, but this is a TTL for things that feed into that content-based key, i.e. knowing when there's a new version of something available. All values downstream of that would be cached independently, with the TTL only influencing how frequently those new values are discovered, completely disconnected from the eventual "real" cache. For example, going from TTL=10m to TTL=1m* would cause more lookups, but you'd still get cache hits for all of the "real work" provided it discovers the same version
@dark nymph yeah the TTL is just API cosmetics. The cachebuster pattern is a messier way of setting a TTL
Makes sense
sounds pretty good, I don't grok 7) though - you'd want it applied to lookup, but not side-effect-ful ones right? Or you mean it's not applied to ones that have a custom WithTTL already set?
also, @trim marsh probably has thoughts on // +side_effect, he was talking about modeling it instead as a function that takes the 'current state' (auto-provided/fetched somehow), which might just be a // +lookup pattern
though even just having that metadata ("I mutate the outside world!") seems pretty nice
For 7: for lookup functions, I think the caching mechanism transcends TTLs, it is its own thing. The "cache" is the lockfile, so it never expires. If there's a "cache miss", you run the lookup and update the lockfile. If there's a "cache hit", you either use the value (with infinite ttl) or re-run lookup and overwrite, depending on what the user asked for. But TTL is never relevant IMO. (Like there's no TTL in go dependency caching for example).
--> I could be wrong, thinking out loud
I guess that could be a separate, more advanced feature. I just need a way for devs to tag their existing functions that happen to have side effects. There are lots of those and they're not going away. Instead of // +nocache I'm looking for something more specific basically.
So we tell developers "don't worry about whether dagger caches your function or not. Just worry about flagging side effects and external lookups, and dagger will do the right thing"
Oh yeah that makes sense in the context of a lock file. I guess that opens another question around whether/how the locking is opt-in on a call-by-call basis
Since we wanted to ship a stopgap quickly, and not rush a full implementation of lookup functions (so we can sweat the details on lockfile etc), maybe we start by just disabling caching completely on // +lookup?
Initially I thought the marker // +lookup would not appear in the stopgap at all; but now thinking that having it from the start, just with a more naive implementation, is even better
What do you think of adding // +impure as the primitive "don't cache me" and then adding // +lookup and // +side_effect once we actually have those more specific semantics pinned down? It was originally brought up here: https://linear.app/dagger/issue/DEV-3034/cache-control-for-functions#comment-8eb7205f - and matches DagQL terminology. The lookup and side_effect patterns seem like specializations of impure, so we could just start with that and add more nuance later. Starting with lookup seems like we could have the adverse effect of users marking not-actual-"lookup"-pattern things as lookup just to disable caching, but then getting undesired behavior later once we add meaning to it.
(Not that we need to leak DagQL terminology to the user, but it aligns well with functional terms anyway imo)
I think specializing from the start works better, because it will cause less breakage in the future.
-
Specialize from the start: most devs get it right (if we explain the difference correctly). Let's call is 20% error rate. Later when the behavior becomes really different (lockfile etc), those 20% break. Solution: rename lookup to side_effect or vice versa.
-
Specialize later: everyone uses the
impurestopgap for now. Later we migrate them to eitherlookuporside_effect.impureis deprecated. At this point the breakage rate is 100%: everyone using a deprecated API and must migrate. More painful.
I wasn't assuming we'd deprecate impure - unless we're pretty sure lookup and side_effect cover 100% of the cases (maybe true). But sure, point taken about having to change more things to lookup if we think that's the massive majority case, but a lot of it might be non-lookup cases. Hard to say
Maybe we could look over the daggerverse - I would have guessed we have a lot of side_effect use cases right now, and not many lookup
"deploy to X"
Side effects:
- Upload
- Publish
- Deploy
- Send messages
Lookup:
- HTTP download
- Package registry download
- Git download
- Docker pull
in terms of code sprawl, I'd wager the 'lookup' cases are a few more centralized super common cases (mostly core APIs even), and 'side effects' is a ton of third-party modules
I don't know, pretty much every language-specific build function is a lookup; and linux distro package install also
lots of packaging systems out there
anythign that deals with dependencies at runtime basically
yea true. and it could be totally incidental which things have been built so far (fluent-ci has a ton of 'deploy to X' modules)
one concern I have is possible mixing of the two. For example deployment systems sometimes can be queried for the current digest of the deployed content. That digest can then be incorporated into the cache key of the deployment function. In that case, should the deployment function still be marked as +side_effect?
I think it's fine for side_effect to mean 'possible side effects ideally with idempotency/intelligent no-ops involved', not 'I will always do something', but it might be worth thinking more about what side_effect could possibly even mean in terms of behavior 
for example, does it produce a particular type of return value? do we use it to draw a fancy 'output' in a Concourse-style data flow pipeline UI?
I think side_effect would mean:
- Never cache
- Use as an identifier in Traces (render in a special way, allow searching / filtering)
- Possibly instrument DNS to get analytics on outbound connections
is it always idempotent? does it tie back in to a symmetrical // +lookup function, much like how Concourse has a single Resource definition that implements check (// +lookup), get (the cached fetch), and put (// +side_effect)?
(ideally Resources would just be something you can implement in terms of these more generic building blocks, the parallels just seem useful to analyze)
Yeah, since you guys mentioned all objects are already required to be serializable to JSON, a lookup function could be check+get
which would it easier to fit your existing code into a lookup function
there may be another interesting case, where you want // +lookup but you don't want it pinned, because it actually returns all available versions, or maybe even new versions from a cursor, which is how check works in Concourse
(substitute 'versions' for 'states' to make it more general)
I see, like querying the github API for a list of PRs for example?
yeah exactly
in Concourse for example, you could get all PRs => configure a pipeline for each PR
that might reduce to a general 'get current state' pattern
with a more intense focus on current (compared to lookup which you would happily pin, because that pinned state is always available)
feels related to this
yeah that wouldn't qualify as a lookup in the meaning i intended at least. which is your point I guess
is impure the opposite of reproducible?
yeah, sort of, it's hard to say with the word 'reproducible' being heavily debated, but it at minimum means you can't trust it to do/return the same thing when given the same parameters
but, it's still possible for an impure function to return a pure value
since it could return a value that has its own ID that has the pinned value from that particular call baked in
ok so: // +pure=false ? and later // +lookup is a different thing?
works for me - maybe we could even start with a few nuanced terms, and for now they're just aliases, and later we give them more meaning?
yeah I just worry that I was too optimistic that we can or should eventually distinguish the 2 properties of pure functions
1. Always produces the same output given the same input.
2. Has no side effects (does not alter any state or interact with the outside world).
These characteristics make pure functions predictable and easier to test.
There might be a whole other avenue to approach that from where we build on the interfaces support and start recognizing particular interfaces instead of trying to cram a bunch of things into function annotations. Sort of like how Go has io.Writer everywhere, we could have canonical Deployment and Resource interfaces that we recognize and visualize in certain ways
Just planting a seed, don't have much concrete there yet, the thought came up while I was implementing my Concourse module; we don't actually make much use of interfaces yet, so it might be an untapped mine
I love that and want to play with interdaces. but I think it's orthogonal
because purity is tied to the implementation
for example if I mock a resource i would expect my mock to be cached
yea true, these annotations would still live on the interface's function implementations
personally, i can see the value in having these simple aliases even if we're not 100% sure how they'll be used yet. like, even if it's just metadata, there's value in knowing what flavor of non-hermetic/pure a function is
even if it's just for browsing the daggerverse - "I want to find an effect-ful function for GitHub"
Maybe it's // +pure=false (never deprecated) + additional information you could optionally fill in?
For example you could specify which external resources you read from, write to, or both
(via a url for example)
there's also things like randomness and system clock
that's feels more like data on the object to me (and interfaces/etc) - since in a lot of cases it might be dynamically configured, not something you can put in a static comment. Unless you mean something like an API schema definition URL/identifier
I don't know.. just throwing stuff at the wall
there's value in knowing what flavor of non-hermetic/pure a function is
I guess step one for this is: defining a nomenclature that captures 100% of non-pure functions like you said
My list so far, of behavior that can make a function impure:
- It interacts with an external system without mutating it (external data source)
- It interacts with an external system and does mutate it (side effect)
- It uses the system clock
- It uses the system's random generator
- It executes code that has unpredictable behavior (aka a flaky test, which we're discussing right now in #wandb-ai )
- other?
sounds like fun homework (unironically), will think on it! would be especially interesting if it boils down to just a few things, or some sort of combining scheme like read/write, external/internal
This flows into a list of things the engine does differently between pure and impure functions:
- Pure functions are cached
- Pure functions get deterministic system clock?
- Pure functions get deterministic system randomness?
- Pure functions are never retried (built-in retry feature for impure functions?)
- Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"
- Cache volumes: technically this could make your function impure?
Technically yeah, but it seems pragmatic to allow it to be pure, from the POV that cache presence shouldn't affect the result. But if you're using it to hack in some non content addressed data, totally
I just got bit by an unpinned dependency in my build, so raising while the topic is hot. In my case it was a broken Wolfi package: https://github.com/wolfi-dev/os/pull/19530
Here's the apko pattern for pinning: https://www.chainguard.dev/unchained/reproducing-chainguards-reproducible-image-builds
I think this might just fit the // +lookup pattern, where the value being stored is a list of package version constraints.
FYI I opened an issue: https://github.com/dagger/dagger/issues/7428
Just remembered DagQL it actually makes you specify a reason for the impurity: https://github.com/dagger/dagger/pull/7438/files#diff-74a9c9e133ee9e1936ab0992a17621f577bf69d08455e7610afef8ce50cc033fR56
So I guess that's one option: to enable impurity you need to provide a freeform description, and the onus is on the module author to be descriptive; it doubles as documentation anyway. Maybe we don't need a precise taxonomy or set of constants. That sort of thing is more interesting to observe from a vast dataset anyway; unlikely we'd predict them all.
Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"
I think disabling network for pure functions is a cool idea but I'm not sure about the practicality.
Practically, I think we have to allow networked pure functions, in the same way we allow wget ... to be cached. In the space Dagger occupies, networked resources are just a reality, and network fetches are the exact thing we want to cache, so if purity is what determines caching, we have to allow pure networked functions. re: failure to fetch a networked resource - I don't see failure as an output, I see it as failure to produce one, so it's still pure. But it would be terrible if the function actually returned different things for different calls. I don't know what to do about that, besides requiring checksums for everything. But honestly, not everyone wants that. Sometimes you just want to YOLO and cache/roll-forward on a best-effort basis.
Placing Dagger on a 'purity spectrum' might be nice for docs. Something like this, only less embarrassing.
width of the slice representing flexibility