#caching ttl

1 messages · Page 1 of 1 (latest)

radiant pecan
#

🧵

#

disclaimer: I contributed that particular snippet 🙂

hallow shard
#

Also a fan of that pattern 👍 - so this would be for impure functions right? We'd still default to pure?

radiant pecan
#

Yeah just a refinement on what you suggested

#

Also, I'm still obsessed with the concept of "lookup functions", it seems relevant to this, but I don't know exactly how it interacts with this rip-off-the-bandaid plan

hallow shard
#

I think it'd work naturally with self calls

#

You'd have a lookup function with a ttl that delegates to a pure function with the resolved version

#

And that return value would be pure and pinned as a result

radiant pecan
#

A lookup function would be a function marked by the developer as special:

  • A lookup function is never cached (ttl is always zero, I think?)
  • It can only take scalar values as arguments, and can only return scalar values
  • The engine uses a special section of the caller's dagger.json as a special lookup cache.
  • In addition to reading from dagger.json, the engine can also be configured to call all lookup functions and update the lock file
#

(sorry I was a bit slower than you there. just wanted to dump my thoughts)

hallow shard
#

Oh I see, yeah you're thinking of it as also encompassing pinning and such?

radiant pecan
#

right

hallow shard
#

Would be a great time to figure that out

radiant pecan
#

feels like maybe an intertiwned design

#

but haven't fully figured it out yet

#

what if lookup functions only returned scalars - so you have to do that work once, eg. for Go, you'd need to write the function that takes a go module name and/or version, and returns a digest

#

but once that's done - possibly everything else can be pure

#

I'm wondering if anything we are planning now might make this harder to ship later?

hallow shard
#

I went down that rabbit hole in one of the early CLI design PRs (hell of a tangent in hindsight), there may be parts worth recovering, lots of potential prior art with Bass's memorization system too

radiant pecan
#

I think if we could somehow make dagger.json a universal lock file, that could be insanely powerful

#

of course if you call core git(), container().from() orhttp(), that would work out of the box (under the hood there would be builtin lookup functions - possibly wrapping the buildkit calls, so buildkit only ever gets pinned calls)

#

I was worried about this being harder to ship later, if we ship a stopgap // +cached pragma

#

But actually, if we ship a // +ttl=<n> pragma instead, it gets easier

#

2 possible scenarios of future interference:

  1. Your function has a // +ttl, and now it's a lookup function. Lookup functions always have ttl=0 (no caching) so your ttl is ignored or you get an error. Unlikely scenario anyway.
hallow shard
#

The magic trick I think would be if we can have a context-free way of bumping dependencies by just re-evaluating functions described by the dagger.json

#

But the hard part I found there is knowing what can be pruned from dagger.json

#

Otherwise it just keeps growing

radiant pecan
#
  1. Your function has a // +ttl, and now it doesn't need a ttl at all, because it's pure. So you'll get unnecessary cache misses. Worst case: you forget to remove the ttl, and have a non-perfectly-optimized function. Solution: remove the ttl. Seems fine.
radiant pecan
#

IMO bumping dependencies in the file would be done with a special flag to dagger call

trim marsh
#

Couple of things:

  1. There isn't really a need to limit to scalars for lookup functions. Inputs+outputs of functions are required to be json serializable, so you can just store any of the inputs/outputs in dagger.json. There are practical exceptions, which leads to the next point
  2. Any function that does side-effectful API calls is gonna have to be marked as uncached by the user. SetSecret comes to mind. I don't think it's practical for us currently to introspect the code to find those (would be another huge burden for SDKs to support).
    • The other possibility would be to identify that a function made those calls after it runs and then prune buildkit's cache of it, but I don't know of a great way to do that out-of-the-box today. Certainly has to be possible, but may or may not require upstream changes.
    • This isn't a blocker, but is probably a big gotcha for users after ripping the bandaid off
hallow shard
#

(fam dinner time, bbl [...Drizzy])

trim marsh
#

I guess if we 100% detach from buildkit's cache entirely for all this then we don't need to even think about pruning. So we could identify that SetSecret calls were made and then after the fact say "don't store that in dagger.json". But that would encompass not just lookup functions, but all functions. In which case dagger.json is gonna get real big real fast 😄 And also this wouldn't work for users calling external modules, in which case there's no dagger.json to write to, etc.

radiant pecan
#

I guess I was only considering impure functions that have an input not visible in the DAG (eg. downloading go packages). But I was ignoring impure functins that have outputs not visible in the DAG (eg. uploading an artifact to a registry, or deploying , etc)

#

I'm not sure if your SetSecret example fits in the second category, or is in a 3d category I also hadn't though of

trim marsh
# radiant pecan I'm not sure if your `SetSecret` example fits in the second category, or is in a...

In my mind it's an example of a side-effectful function that should never be cached, so it's the same as users running side-effectful functions that don't even hit our API (i.e. deploying to AWS using an aws sdk directly, etc.).

The only reason I brought it up in particular is that I know there are multiple users calling SetSecret from their function, so we need to be very loud about those functions needing to be marked as uncached. It's also slightly different in that there are plausible routes we could automatically do it for them, as opposed to things like the "deploy to AWS" example where we could never figure it out on their behalf

radiant pecan
#

What I think I've learned so far:

  • The concept of a lookup function is potentially interesting 🙂
  • We can ship default-cached functions + optional +ttl pragma now; and later we can ship lookup functions; they will not conflict.
  • Even after lookup functions, there will still be the need for manual control of function caching behavior, in other words: we'll still need +ttl. We can't just say "all functions that aren't lookup functions are pure and always cached".
  • Any function could be designated as a lookup function, which is cool (but what if my function returns a directorym will it cache the ID?)
radiant pecan
trim marsh
#

but what if my function returns a directorym will it cache the ID?
Yes, and we can also identify whether it's reproducible or not (i.e. not from a local dir) and choose whether to store in dagger.json based on that. Stuff like that

radiant pecan
#

I guess the problem is not calling SetSecret itself, it's the external data source that presumably is used as cleartext argument to it

#

So then it goes in bucket 1: it has a hidden input (eg. it connects to hashicorp vault to get the latest ephemeral token, then shoves that in a new dagger secret)

trim marsh
radiant pecan
#

oh I see

#

not only do you probably not want to cache it (because probably the cleartext is dynamically downloaded from somewhere), but you can't cache it anyway

#

so what happens if I set //+ttl=1000 to such a function ?

trim marsh
trim marsh
radiant pecan
#

// +ttl=1000
func getToken(vaultAddress string, key string, vaultAuth *Secret) *Secret {
  // Hashicorp logic
  cleartext, _ := do_vault_stuff(ctx)
  return dag.SetSecret(cleartext)
}
#

Is this 👆 a realistic example?

#

So this function as I wrote it, would have a bug

trim marsh
#

Yeah that would break things the way SetSecret is implemented today. Say another function calls that one and then sets the returned secret as a secret env var in a container. If getToken was cached so it never ran, the SetSecret call will never run and the secret will not be present in the in-memory store and the container won't be able to use it to set the value of the env var.

#

If however secrets were looked up from services, then when the secret is needed it will always be obtained from the service (services are already never cached in execution, so it works out perfectly)

radiant pecan
#

I guess there is a way to find these functions statically though: they return a secret

#

so we could disable caching of functions that return a secret (and return an error if you try to set a ttl)

trim marsh
# radiant pecan I guess there is a way to find these functions statically though: they return a ...

That's true, also need to handle functions that return something that has a Secret somewhere inside of it (i.e. a container with a secret set, basically need to crawl the whole DAG) but that's doable (overlaps with all the crap needed for the "pass sockets as CLI args" epic I've been working through).

I guess there's still corner cases since one function can call another one directly (and thus call SetSecret without a Secret being present in it's signature).

#

Again though, this is just something to make users aware of, I don't think it needs to be 100% solved necessarily before we can enable caching by default

radiant pecan
#

Yeah we can probably cut that corner. When people hit it though, it will be painful, because the error will probably be opaque

#

We should still do it though, small enough numbner of people affected that we can give them VIP support on discord

trim marsh
#

Though if you are calling a function transitively that's extra painful since it's not your fault and you need to wait for dependencies to update... We can do the "break glass scenario" and use the engine version tracking that was added to dagger.json to retain old behavior on modules that weren't updated yet

hallow shard
radiant pecan
# hallow shard [circling back quickly] the downside here is if you have dependencies that are o...

I see what you mean. I guess there could be a dagger command (dagger update ?) that only calls the lookup functions themselves, without calling anything else. This would be possible if all required information is included in the lockfile: function path, arguments. Then it's up to you to make sure your lookup functions don't have side effects. But that seems fair since they're, you know, for lookup 🙂

hallow shard
# radiant pecan I see what you mean. I guess there could be a dagger command (`dagger update` ?)...

Yep yep - that's how Bass's lock files work, they embed a module + func + args and just re-evaluate them to yield the new value. I think we can do basically the same thing, but yeah the last part I couldn't quite figure out was how to prevent it from just growing forever, because sometimes you'd have dependencies between dependencies and it couldn't really tell when some of those became unused. Kinda hard to explain from memory, hopefully it's something we won't run into, not sure

#

Could maybe be resolved with some sort of aliasing system

#

Would be great to integrate this directly into things like container.from and git/http too

radiant pecan
hallow shard
radiant pecan
#

Also I was wondering: how to represent the "path" of the lookup function, ie should the object state be part of the key? or the function calls that produced it?

#

maybe lookup functions are only allowed at top level of the module?

hallow shard
#

One tempting way to look at it is that it's a persistent form of DagQL's query cache, which is keyed by a hash of the ID. Those are also almost easy to evaluate as-is, except we only have loadFooFromID for objects, not scalar types (even though IDs can technically be any type)

hallow shard
trim marsh
#

I almost feel like besides only allowing top level functions it may make sense to also only allow functions with no args to start. Otherwise if your lookup function accepts e.g. an int are we going to cache every result for every value of the int that it’s ever invoked with?

You could solve that with a whole pruning UX but even then it sounds kind of annoying to manage, besides more upfront work for us. I think a lot of use cases would already be covered even without arg support (you can still use consts and such) and then the pruning logic is just “if the function doesn’t exist anymore, prune it”.

Just as a starting place that sounds more tractable while not ruling out more in the future.

trim marsh
# hallow shard One tempting way to look at it is that it's a persistent form of DagQL's query c...

Oh btw my recent forays into the guts of our buildkit interface revealed a very interesting way that we could practically abandon LLB all together and instead just operate purely on dagql IDs, having buildkits persistent cache be a map of Dagql digests to layer cache.

Most importantly, it’s actually just straight up supported and not abusing any functionality or otherwise hacking something. And it also doesn’t require rewriting tons of buildkit code (some, but not a really significant amount)

Which is interesting because a “cache export” at that point would almost feel like a slight generalization of what we’re talking about here.

Not persuing at this moment obviously but something to keep in mind, especially since storage drivers are coming soon.

hallow shard
# trim marsh Oh btw my recent forays into the guts of our buildkit interface revealed a very ...

Super interesting - for function analytics I've had to somewhat tightly couple the relationship between our DAG and LLB (https://github.com/dagger/dagger/pull/7321/files#diff-b93188b9d79aa9868ea185ca205ea31f7619506c4c764b9a5f49182aa9b48f3eR87-R173) and I've wondered how that might be made "whole" in the future. For now I'm calling LLB "effects" at the persistence layer to avoid keeping that terminology coupling forever (as in, Container.withExec returns a Container with a baked-in "future effect" that we'll see run later). Maybe in the future they're one and the same, or maybe we have some other identifier for "effects" 🤷‍♂️ (also open to other terms, but I guess either way we can migrate etc)

trim marsh
# hallow shard Super interesting - for function analytics I've had to somewhat tightly couple t...

Ah interesting, yeah I was thinking through what this would look like in practice and there would absolutely still be a difference between dagql fields that produce a persistent filesystem as a side effect and those that are derived from those fields. E.g. in dag.Container().From(alpine).WithEnvVar(foo, bar).WithExec(blah).Stdout(), From and WithExec create persistent cached filesystems but WithEnvVar and Stdout don't. The difference is that today we internally have to translate all those "persistently cached" fields to LLB, solve that instead and then lose the "dagql-nativeness".

The breadcrumb for my realization around all this is this method: https://github.com/dagger/dagger/blob/2ae9f4d815c1fcf48626ce9437c788d0efb91a79/engine/buildkit/worker.go#L337 which is what gets called by buildkit to determine how to get the cache key and implementation for a given vertex. But that's coming from Buildkit's "generic solver" that is agnostic to LLB, so if you look at the types of the solver.Vertex and solver.Op, they aren't actually LLB specific at all. We could just be passing types from the dagql protobuf package instead.

It remains to be seen if the benefits of this are more than "that's super cool" but I'm gonna pull on the thread as part of storage drivers just enough to see if it would make sense to pursue further.

radiant pecan
#

@dark nymph @hallow shard may I suggest we continue the thread here 🙂

#

@hallow shard so a small new dilemma for me: is it weird that a function dev can specify its own TTL value. What is the correct value? And what if the admin wants to configure their own value in a site-dependent way?

#

But on the other hand - you can implement a crude cache buster in code today. So if we don't let devs do it the clean way, they may just do it the dirty way with old school cache busters

#

so that's my dilemma

hallow shard
#

on the callee side, I can for sure know that something should have cache controls, but determining how long it caches for seems like you'd just be pulling a number out of thin air most of the time

radiant pecan
#

OK so we need:

  1. A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl
  2. A way for the caller to set a ttl programmatically
  3. A way for the user to set the default ttl session-wide?
hallow shard
#
  1. is kinda interesting, I could see that being the "I just stepped on a plane and would like to stop bumping everything for a bit" switch
#

only, you would want the transition from ttl=60s to ttl=infinity (or whatever) to not bust the caches on its own 😛

radiant pecan
#

Also you may want to choose between 1) setting the TTL of functions that expect to be cached, and 2) forcing a TTL on all functions, even those that indicated they should not be cached. Do we want both? And if so how do we differentiate?

hallow shard
#

also - I've personally run into cases where being able to literally set the cache key is useful, vs. just TTL-based. For example, if you have a cache that you know only needs to be busted for each commit. In the past this was mostly relevant for cache volumes, where you can put it in the name. Maybe there's a connection there, maybe not

radiant pecan
#
  1. A way for the callee to declare its caching behavior ("I need caching / I need no caching" or perhaps "I have side effects / I depend on an external data source") but not declare a ttl

Follow-up question: is there any chance we may want to differentiate between those two types of "don't cache me" functions?

dark nymph
#

I'd certainly expect as a module author to have some control over what the user can or can't do

radiant pecan
#

I guess "I depend on an external data source" might be the same as "// +lookup"

dark nymph
#

When caching makes absolutely zero sense, I want to be able to disable caching completely.

#

Caching is hard to debug, I'd like to minimize the possibility of user error 🙂

#

I wonder how this became a conversation about "ttl". Isn't the cache content based?

hallow shard
# radiant pecan > 1. A way for the callee to declare its caching behavior ("I need caching / I n...

there could maybe be value in distinguishing side-effect-ful functions from functions whose return value changes over time, but there might be more value in just saying side-effect-ful functions should try their hardest to be idempotent, i.e. they should be able to safely re-run and converge on the desired state, because we want users to be able to re-run builds that may have partially succeeded, for example one big 'shipit' job that publishes a bunch of stuff

#

(that was Concourse's stance, and why the resource write operation is called put)

radiant pecan
#

OK how about this:

  1. // +lookup to mark your function as a lookup function. This means it may read data out-of-band from an external source, and will be cached in a special way (including in the future, a lockfile feature).

  2. // +side_effect to mark your function as having side effects. This means it may write data out-of-band to an external source, and will be cached in a special way. (simplest implementation: no caching. but perhaps some optimizations in the future?)

  3. Lookup functions cannot have side effects

  4. All other functions are cached by default. The callee cannot specify a TTL

  5. Caller can specify a TTL programmatically with dag.WithTTL(<millseconds>)

  6. Operator can set a default TTL in the CLI with dagger --ttl <milliseconds>

  7. TTL settings are not applied to lookup and side-effect functions

hallow shard
# dark nymph I wonder how this became a conversation about "ttl". Isn't the cache content bas...

The "real" cache is, but this is a TTL for things that feed into that content-based key, i.e. knowing when there's a new version of something available. All values downstream of that would be cached independently, with the TTL only influencing how frequently those new values are discovered, completely disconnected from the eventual "real" cache. For example, going from TTL=10m to TTL=1m* would cause more lookups, but you'd still get cache hits for all of the "real work" provided it discovers the same version

radiant pecan
#

@dark nymph yeah the TTL is just API cosmetics. The cachebuster pattern is a messier way of setting a TTL

dark nymph
#

Makes sense

hallow shard
#

also, @trim marsh probably has thoughts on // +side_effect, he was talking about modeling it instead as a function that takes the 'current state' (auto-provided/fetched somehow), which might just be a // +lookup pattern

#

though even just having that metadata ("I mutate the outside world!") seems pretty nice

radiant pecan
# hallow shard sounds pretty good, I don't grok 7) though - you'd want it applied to lookup, bu...

For 7: for lookup functions, I think the caching mechanism transcends TTLs, it is its own thing. The "cache" is the lockfile, so it never expires. If there's a "cache miss", you run the lookup and update the lockfile. If there's a "cache hit", you either use the value (with infinite ttl) or re-run lookup and overwrite, depending on what the user asked for. But TTL is never relevant IMO. (Like there's no TTL in go dependency caching for example).

--> I could be wrong, thinking out loud

radiant pecan
# hallow shard also, <@949034677610643507> probably has thoughts on `// +side_effect`, he was t...

I guess that could be a separate, more advanced feature. I just need a way for devs to tag their existing functions that happen to have side effects. There are lots of those and they're not going away. Instead of // +nocache I'm looking for something more specific basically.

So we tell developers "don't worry about whether dagger caches your function or not. Just worry about flagging side effects and external lookups, and dagger will do the right thing"

hallow shard
radiant pecan
#

Initially I thought the marker // +lookup would not appear in the stopgap at all; but now thinking that having it from the start, just with a more naive implementation, is even better

hallow shard
# radiant pecan Since we wanted to ship a stopgap quickly, and not rush a full implementation of...

What do you think of adding // +impure as the primitive "don't cache me" and then adding // +lookup and // +side_effect once we actually have those more specific semantics pinned down? It was originally brought up here: https://linear.app/dagger/issue/DEV-3034/cache-control-for-functions#comment-8eb7205f - and matches DagQL terminology. The lookup and side_effect patterns seem like specializations of impure, so we could just start with that and add more nuance later. Starting with lookup seems like we could have the adverse effect of users marking not-actual-"lookup"-pattern things as lookup just to disable caching, but then getting undesired behavior later once we add meaning to it.

#

(Not that we need to leak DagQL terminology to the user, but it aligns well with functional terms anyway imo)

radiant pecan
#

I think specializing from the start works better, because it will cause less breakage in the future.

  • Specialize from the start: most devs get it right (if we explain the difference correctly). Let's call is 20% error rate. Later when the behavior becomes really different (lockfile etc), those 20% break. Solution: rename lookup to side_effect or vice versa.

  • Specialize later: everyone uses the impure stopgap for now. Later we migrate them to either lookup or side_effect. impure is deprecated. At this point the breakage rate is 100%: everyone using a deprecated API and must migrate. More painful.

hallow shard
#

I wasn't assuming we'd deprecate impure - unless we're pretty sure lookup and side_effect cover 100% of the cases (maybe true). But sure, point taken about having to change more things to lookup if we think that's the massive majority case, but a lot of it might be non-lookup cases. Hard to say

#

Maybe we could look over the daggerverse - I would have guessed we have a lot of side_effect use cases right now, and not many lookup

#

"deploy to X"

radiant pecan
#

Side effects:

  • Upload
  • Publish
  • Deploy
  • Send messages

Lookup:

  • HTTP download
  • Package registry download
  • Git download
  • Docker pull
hallow shard
#

in terms of code sprawl, I'd wager the 'lookup' cases are a few more centralized super common cases (mostly core APIs even), and 'side effects' is a ton of third-party modules

radiant pecan
#

I don't know, pretty much every language-specific build function is a lookup; and linux distro package install also

#

lots of packaging systems out there

#

anythign that deals with dependencies at runtime basically

hallow shard
#

yea true. and it could be totally incidental which things have been built so far (fluent-ci has a ton of 'deploy to X' modules)

radiant pecan
#

one concern I have is possible mixing of the two. For example deployment systems sometimes can be queried for the current digest of the deployed content. That digest can then be incorporated into the cache key of the deployment function. In that case, should the deployment function still be marked as +side_effect?

hallow shard
#

I think it's fine for side_effect to mean 'possible side effects ideally with idempotency/intelligent no-ops involved', not 'I will always do something', but it might be worth thinking more about what side_effect could possibly even mean in terms of behavior thinkspin

#

for example, does it produce a particular type of return value? do we use it to draw a fancy 'output' in a Concourse-style data flow pipeline UI?

radiant pecan
#

I think side_effect would mean:

  • Never cache
  • Use as an identifier in Traces (render in a special way, allow searching / filtering)
  • Possibly instrument DNS to get analytics on outbound connections
hallow shard
#

is it always idempotent? does it tie back in to a symmetrical // +lookup function, much like how Concourse has a single Resource definition that implements check (// +lookup), get (the cached fetch), and put (// +side_effect)?

#

(ideally Resources would just be something you can implement in terms of these more generic building blocks, the parallels just seem useful to analyze)

radiant pecan
#

Yeah, since you guys mentioned all objects are already required to be serializable to JSON, a lookup function could be check+get

#

which would it easier to fit your existing code into a lookup function

hallow shard
#

there may be another interesting case, where you want // +lookup but you don't want it pinned, because it actually returns all available versions, or maybe even new versions from a cursor, which is how check works in Concourse

#

(substitute 'versions' for 'states' to make it more general)

radiant pecan
#

I see, like querying the github API for a list of PRs for example?

hallow shard
#

yeah exactly

#

in Concourse for example, you could get all PRs => configure a pipeline for each PR

#

that might reduce to a general 'get current state' pattern

#

with a more intense focus on current (compared to lookup which you would happily pin, because that pinned state is always available)

radiant pecan
#

yeah that wouldn't qualify as a lookup in the meaning i intended at least. which is your point I guess

#

is impure the opposite of reproducible?

hallow shard
#

yeah, sort of, it's hard to say with the word 'reproducible' being heavily debated, but it at minimum means you can't trust it to do/return the same thing when given the same parameters

#

but, it's still possible for an impure function to return a pure value

#

since it could return a value that has its own ID that has the pinned value from that particular call baked in

radiant pecan
#

ok so: // +pure=false ? and later // +lookup is a different thing?

hallow shard
#

works for me - maybe we could even start with a few nuanced terms, and for now they're just aliases, and later we give them more meaning?

radiant pecan
#

yeah I just worry that I was too optimistic that we can or should eventually distinguish the 2 properties of pure functions

#

1. Always produces the same output given the same input.
2. Has no side effects (does not alter any state or interact with the outside world).

These characteristics make pure functions predictable and easier to test.
hallow shard
#

There might be a whole other avenue to approach that from where we build on the interfaces support and start recognizing particular interfaces instead of trying to cram a bunch of things into function annotations. Sort of like how Go has io.Writer everywhere, we could have canonical Deployment and Resource interfaces that we recognize and visualize in certain ways

#

Just planting a seed, don't have much concrete there yet, the thought came up while I was implementing my Concourse module; we don't actually make much use of interfaces yet, so it might be an untapped mine

radiant pecan
#

I love that and want to play with interdaces. but I think it's orthogonal

#

because purity is tied to the implementation

#

for example if I mock a resource i would expect my mock to be cached

hallow shard
#

yea true, these annotations would still live on the interface's function implementations

#

personally, i can see the value in having these simple aliases even if we're not 100% sure how they'll be used yet. like, even if it's just metadata, there's value in knowing what flavor of non-hermetic/pure a function is

#

even if it's just for browsing the daggerverse - "I want to find an effect-ful function for GitHub"

radiant pecan
#

Maybe it's // +pure=false (never deprecated) + additional information you could optionally fill in?

For example you could specify which external resources you read from, write to, or both

#

(via a url for example)

#

there's also things like randomness and system clock

hallow shard
#

that's feels more like data on the object to me (and interfaces/etc) - since in a lot of cases it might be dynamically configured, not something you can put in a static comment. Unless you mean something like an API schema definition URL/identifier

radiant pecan
#

I don't know.. just throwing stuff at the wall

#

there's value in knowing what flavor of non-hermetic/pure a function is

I guess step one for this is: defining a nomenclature that captures 100% of non-pure functions like you said

#

My list so far, of behavior that can make a function impure:

  • It interacts with an external system without mutating it (external data source)
  • It interacts with an external system and does mutate it (side effect)
  • It uses the system clock
  • It uses the system's random generator
  • It executes code that has unpredictable behavior (aka a flaky test, which we're discussing right now in #wandb-ai )
  • other?
hallow shard
#

sounds like fun homework (unironically), will think on it! would be especially interesting if it boils down to just a few things, or some sort of combining scheme like read/write, external/internal

radiant pecan
#

This flows into a list of things the engine does differently between pure and impure functions:

  • Pure functions are cached
  • Pure functions get deterministic system clock?
  • Pure functions get deterministic system randomness?
  • Pure functions are never retried (built-in retry feature for impure functions?)
  • Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"
#
  • Cache volumes: technically this could make your function impure?
hallow shard
hallow shard
radiant pecan
hallow shard
# radiant pecan My list so far, of behavior that can make a function impure: - It interacts wit...

Just remembered DagQL it actually makes you specify a reason for the impurity: https://github.com/dagger/dagger/pull/7438/files#diff-74a9c9e133ee9e1936ab0992a17621f577bf69d08455e7610afef8ce50cc033fR56

So I guess that's one option: to enable impurity you need to provide a freeform description, and the onus is on the module author to be descriptive; it doubles as documentation anyway. Maybe we don't need a precise taxonomy or set of constants. That sort of thing is more interesting to observe from a vast dataset anyway; unlikely we'd predict them all.

hallow shard
# radiant pecan This flows into a list of things the engine does differently between pure and im...

Pure functions can, if they are very very pure, be run without internet connectivity because they have no external dependency at all? (But some functions are pure even though they interact with an outside system, is that possible? Depends if "the function failed because docker hub was down" qualifies as "always get the same output for the same inputs"

I think disabling network for pure functions is a cool idea but I'm not sure about the practicality.

Practically, I think we have to allow networked pure functions, in the same way we allow wget ... to be cached. In the space Dagger occupies, networked resources are just a reality, and network fetches are the exact thing we want to cache, so if purity is what determines caching, we have to allow pure networked functions. re: failure to fetch a networked resource - I don't see failure as an output, I see it as failure to produce one, so it's still pure. But it would be terrible if the function actually returned different things for different calls. I don't know what to do about that, besides requiring checksums for everything. But honestly, not everyone wants that. Sometimes you just want to YOLO and cache/roll-forward on a best-effort basis.

#

Placing Dagger on a 'purity spectrum' might be nice for docs. Something like this, only less embarrassing.

#

width of the slice representing flexibility