#Are Dagger's cache semantics content-based or topological?

1 messages · Page 1 of 1 (latest)

sick urchin
#

I am hoping to develop a stronger intuition for Dagger's internal mechanisms by understanding when a DAG node will or won't be recalculated. (Aside: I think I saw in a recent community call that dagger cloud now presents whether or not a step was pulled from cache or not, so forgive me for not just locking in and testing all of this for myself -- but in any case, I'd like to know if my mental model here makes sense.)

First, what I mean by content-based or topological: consider DAG Input(A) -> Job(B) -> Value(C) -> Job(D), which has been run and cached. Then, the DAG is rerun with Input(Z), which when fed to Job(B) will also output Value(C), e.g. Input(Z) -> Job(B) -> Value(C) -> Job(D). Will Job(D) be recalculated, or will it be retrieved from cache? If cache semantics are content-based, then it will be retrieved from cache, since Value(C) matches the input to Job D from the initial run. If cache semantics are topological, then Job D will be recalculated, since it is a descendent of Input(Z) which has not been cached.

Of course, if we are optimizing for efficiency, content-based cache semantics are superior, as long as you can guarantee that the output of Job(D) actually will be the same in both cases. As far as I know, Dagger can't rigorously guarantee this (e.g. if a container makes an external HTTP request that retrieves a timestamped payload, does the container node's hash now incorporate the payload data?), but if the purpose of the Dagger API is simply to allow you to model your dependency graph with the semantics that are germane to your use case, then this may be "close enough" to catch what matters and ignore what doesn't.

In reading https://docs.dagger.io/features/caching/, it seems to me like the answer might be "both" -- assuming I understand correctly. Running out of characters on this post; will reply with what I mean by this.

#

The caching doc describes layer caching and volume caching, and I think I would be correct to interpret layer caching as topological (with the understanding that, like a git commit graph, OCI layers include the hash of their previous layers) and volume caching as content-based (with files being referenced with content hashes). Is layers and volumes exhaustive of inputs and outputs in a dagger DAG? Is it correct that any container build is going to use layer caching (as provided by buildkit like it would do for a traditional dockerfile), and any other data input/output between functions uses volume caching, or are there some types of inputs that are not included under the scope of dagger's caching?

hollow sinew
#

having said that, I'm silently summoning @acoustic gyro and @tribal jolt which have more context about the caching as that's pretty much coming from buildkit's internal cache.