#A moment of doubt about cache volumes...
1 messages · Page 1 of 1 (latest)
đ§”
Also copying @oblique garnet @pearl lance @cyan agate
Basically: we have been straining our brains to make one particular feature of Dagger, cache volumes, work great. This feature is very powerful (enables powerful app-level caching) but also generally a pain.
We are making progress in supporting cache volumes better and better. But it does feel like the closer we get, the more painful it gets. Like approaching the speed of light, it feels like the closer we want to get, the more expensive it gets, and of course by definition you can never reach 100%.
The whole time we have never questioned the core mechanics of the feature itself, because it's part of core buildkit. We've treated it like an immutable constant, for us to work around as best as we could.
So this thread is just a thought experiment, where I ask: what would happen if we didn't do that?
Do we really need cache volumes as currently designed? Are they the best design to give users what they need, or are they a local maximum?
This goes all the way back to our epic dogfooding session with @ruby quartz , trying to convert his CUE configuration to the 0.2 design (codename Europa). Even back then, incorporating cache mounts was THE most painful part.
These memories resurfaced in the context of namespacing (https://github.com/dagger/dagger/issues/3345) . And that topic itself resurfaced in the context of shipping Dagger Cloud, aka "a control plane that can distribute cache efficiently to a bunch of ephemeral CI runners"
The problem of namespacing cache volumes is not specific to cache distribution in CI (it exists when running dagger locally too), it's just exacerbated by it.
TLDR: we have never had a good, finished design for cache volumes, ever. We just had periods of time where the flaws and unfinished parts were not a pressing matter.
@ivory raptor I still think the idea of "auto-namespacing by extension" we discused is a step in the right direction. But it doesn't solve everything... For example, the npm extension might be used to build multiple, unrelated apps as part of the same DAG. So it will need to manage multiple, unrelated cache directories. Can it really do so in a way that is hidden from the caller? That seems unlikely to end well. So, it may be that it's inevitable for extensions that use cache volumes, to have leaks: they can't hide how they use cache directories completely.
If that's the case, that means one way or the other these cache directories will be part of these extension APIs
And if that's the case: then the current design of cache volumes makes it harder.
To illustrate what I mean, here's the old CUE API for cache volumes in the yarn/npm package: https://github.com/dagger/dagger/blob/v0.2.x/pkg/universe.dagger.io/yarn/yarn.cue#L131-L142
// Run a yarn command (`yarn <ARGS>')
#Command: {
[...]
// Project name, used for cache scoping
project: string | *"default"
"Yarn cache": {
dest: "/cache/yarn"
contents: core.#CacheDir & {
id: "\(project)-yarn"
}
}
"NodeJS cache": {
dest: "/src/node_modules"
type: "cache"
contents: core.#CacheDir & {
id: "\(project)-nodejs"
}
}
Sorry for jumping on this threat directly.
Last week, I was working with cache on GHA and Dagger. I also spent some time thinking about it. I thought @ruby quartz 's suggestion about In terms of design/implementation, what are your thoughts on a Zenith-based extension for this issue https://github.com/dagger/dagger/issues/5583#issuecomment-1667963869 is also a viable option for the cache as well.
With extension mechanism, I would give more flexibility overall to users
I don't see the connection but maybe I'm missing something
In this example, to get a high-level yarn builder, in addition to the expected inputs (source code, settings etc) you also need to pass a project name. This is a string which has only one purpose: allow the underlying namespacing of cache volumes.
That API worked, but from experience - including the painful experience using it together with @ruby quartz - it is hard to reason about, because it's so different from the rest of the API. Basically: if the npm cache is a directory, why can't I pass that directory as input and output, like all my other artifacts?
Answer: well you can't, because you see it's a very special kind of directory, it's persisted out of band by buildkit, so you need to give it a string, etc etc
After reading that issue, I had a thought. If Dagger had an extension-based caching system, we could have more control and customization over it, making it perfect. There is no direct connection
If my NPM cache is going to be something I have to manage and pass around, I would much rather have it be a regular directory. And if there's magic in how that directory is persisted between runs (and really that's all it is - persistence across runs), then let me manage that in a familiar way: like a simple persistence API for directories?
extend type Directory: {
save(key: String!): Directory!
load(key: String!, create: Boolean=true): Directory!
}
I'm not sure about the API exactly, but since we end up snapshotting / checksumming the contents of these cache volumes anyway for distributed use - maybe it's not such a stretch to make it happen all the time, and inject those directories into the regular layer cache?
To also hop in (hi, everyone, btw, I've been lurking in #maintainers for ageees)
I've had this idea for some time that a more high-level interaction of cache could be reached by buildkit clients that actually understood their files and dependencies instead of just "this directory here has some persistent cache".
Like, classically in C/C++, each unit is compiled separately, and then linked together - you don't need some generic cache directory here, the build tool understands each point. So an extension that build C++ projects could ideally construct an LLB graph where each file is an input, and the compiled file is an output - then you could stitch this all together using MergeOp.
For the npm case, a package.json that says install X, and Y, and Z is kinda the same as installing X and installing Y and installing Z and then MergeOping them together. If you change a package, you still get the cache from the other unchanged packages.
(maybe I'm missing the point though đ )
I see 2 possible wins:
-
OX: only one cache to worry about. TTL, rotation, eviction etc. Not to mention terminology. It's all just "the cache"
-
DX: simpler APIs. You're just passing directories around. You see exactly where the directories come from, and where they go. No spooky out of band magic.
Just to add to the melting pot: @pearl lance also had questions earlier about the semantics of setting initial content for a cache, which is pretty difficult to reason about, has some surprising behavior (changes to "initial content" value busts caches for subsequent WithExecs), and is tightly coupled to Container (can only configure initial content via Container.WithMountedCache)
Yeah, my attitude in this thought experiment is "maybe this API is best handled at a higher level", and basically we would stop using buildkit "cache mounts" completely
So all the concerns you're talking about @ivory raptor would be "solved" by virtue of us not using that code anymore (of course that only makes sense if being at a higher layer somehow allows us to solve the same problem better)
I think the big difference is that buildkit cannot count on callers being programmable. There's no DX above
is there any known cons or performance penalty stop using buildkit's cache mounts
This kind of does make your cache avoid having a magical special "blob" that is very difficult to cache well.
However, it doesn't neccessarily make cache volumes obselete. Implementing the higher level API in every single extension might be tricky - e.g. how does this work for go? Or rust? Maybe cache volumes are a nice "less-good" way to solve this, and could be a good stop-gap until a developer adds the full integration that really understands the toolchain.
Cache mounts are also still quite useful for entirely local use cases - so even if they don't make sense to share, I think it would be good to make sure there's still a way to do them locally.
There shouldn't be, we're directly integrating with Buildkit internals already so we should have access to the same internal mechanisms
Two things worth mentioning here I think:
- there are some situations where MergeOp can't be used, mainly when there's a single file that gets updated to reflect the current set of content, since there's no generic way to "merge" that file. Nix is an example here, because it maintains a SQLite db of installed packages (something like that)
- we've also been floating the idea of just representing these as "plain old volumes", less specific to the caching use case. You could imagine pointing a database to a volume for persisting state for local dev loops, for example (Docker Compose style)
They are definitely related topics at the very least. so worth discussing side by side
since both have to do with app messing with / enhancing the toolsâs default cache logic
all those things need a default cache key too
maybe âcache volumesâ are just directories with a custom cache key?
and likewise those individual package installs combined with mergeop - they will need custom caching for the same reason: out-of-DAG dependency
aka âis there a new package I can download ?â
so MAYBE if you squint, those two patterns (cache volume, mergeop-fest) are the same pattern at different levels of granularity & sophistication?
One thing is certain, the defining feature of cache volumes is that it lets you change how dagger does caching. So whatever replaces it will need that capability as well. Itâs about customizing your DAGâs caching
I know that @oblique garnet knows more about this... but I'm pretty sure you can have a directory, do some operations on it (npm install, go get, etc), then get the state after those operations. I'm betting that might be the buildkit primitive that we could use to build up something more user-friendly.
@smoky night It definitely feels like there is a better pattern here than the current cache volume design.
I think this is the buildkit-level API you're thinking of? https://github.com/moby/buildkit/blob/master/frontend/gateway/pb/gateway.proto#L17
Yes you can do that, but you also need a way to re-inject that modified version of the directory in the next run
That's the "messing with the caching logic" part
True. Without always busting cache. 
- Ugh, I keep forgetting about file conflicts for some reason. Perhaps blasphemous - but potentially for the nix case, would it make some amount of sense for the database file to actually be owned and managed by the extension, instead of by buildkit? So an extension would read the database file input, add the packages and metadata in there that's neccessary, and then add that as an input to the MergeOp, so that it'll overwrite the input. From a cache perspective, you'd then have each nix package -> layer, MergeOp concats all the layers together to produce the real fs. Though you really have problems with things like hooks, or anything that's ordered. Also, the cache graph of install "X" and then install "Y", versus, install "X and Y" together looks slightly different here (though maybe that doesn't matter). maybe my lack of nix knowledge here showing
That might be it, yeah.
Actually, I'm pretty sure it's not.
One thing I like about merge-op fest (funky name, i like it), is that you could potentially populate pre-build global caches (from trusted compute obv) - "here is the result of building rust crate X", "here is the result of installing npm package Y". Rust build times are (not just recently) a contentious topic - having a global, pre-compiled cache of things to pull from for cases like this would be super cool imo.
@hidden elk I think this is it: https://pkg.go.dev/github.com/moby/buildkit@v0.12.1/client/llb#ExecState.GetMount
Global world-readable cache is a planned feature of our distributed cache service đ The best part is that it works for all package ecosystems. Including rust
Of course it does assume there are adaptors so that each native tool leverages Dagger APIs. (what weâre discussing here basically)
Is one viable route to make mergeop-fest so obviously better, and easy to use, that cache volumes are just deprecated?
i'm biased, but i'd love this đ
from my pov, cache volumes are a bit of a workaround the fact that dockerfiles and similar don't have understanding of the underlying tools they're calling - so caching a directory is really the best you can do. if you move away from that to understanding how tools are really working, then i think building something that fits directly more naturally into the DAG becomes much more viable.
but how would it work for, say, Go build?
so you'd want to use the low level compiler and linker: https://pkg.go.dev/cmd/compile
thankfully, for go, there's a 1-1 case for packages and .o files, quote from the above page:
The generated files contain type information about the symbols exported by the package and about types used by symbols imported by the package from other packages. It is therefore not necessary when compiling client C of package P to read the files of P's dependencies, only the compiled output of P.
(edit: you'd still somehow need the extension to actually understand the graph of package dependencies, though, in a sense, this feels intuitively correct - if you want really good graph caching, the thing building the graph needs to understand it's own graph before you can translate it efficiently into dagger)
I think the decision to make mergeop-fest the only way, depends on how hard implementation is, and how bad the experience is when itâs not implemented.
C is a true nightmare here, because you need to track pre-processor dependencies - in this case, you could somehow use https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html -M flags to work these out maybe, so you know how to construct the actual build graph in the first place? but tbf, build tools like cmake, etc, already have to handle the misery that this is anyways, compiling c is very hard and sad
Preprocessor Options (Using the GNU Compiler Collection (GCC))
maybe i'm dragging this wayy off topic (happy to chat about this somewhere else), but..
it feels like what you have is an "application-level DAG" - something expressed through makefiles, or through the go tooling, or cargo build, or even npm dependencies. what you want as an extension dev, is to translate your application-level DAG into a dagger DAG - that way you get all this caching, data, encapsulation, etc. maybe instead of just providing a way to give dagger a DAG, you want a utility to help devs to express what their own DAG looks like, and then a utility to help devs convert that into dagger.
definitely not off topic - I would say itâs a 2-part topic with unclear connection between the two parts
Catching up
I think the decision to make mergeop-fest the only way, depends on how hard implementation is, and how bad the experience is when itâs not implemented.
It's quite a bit of work and also going to involve a lot of specificity to every application we integrate with. So we can (and should) do the work that takes ago.modand converts it into aDirectorythat is a mergeop of each individual package downloaded to the go mod cache. But then we've only added support for go mod cache, we'd need to do the same for the go build cache (which is separate), pip, gradle, cargo, etc. etc.
I 100% think we should do this, it would be more efficient + safe than opaque cache mounts and it would plug right into hypothetical support for merging cache mount contents in the cloud (i.e. across different engines). But it's a very large barrier of entry, which is the main argument for retaining opaque cache mounts as the fallback when merge-op based optimization hasn't been implemented yet.
All that aside, there's other properties of cache mounts that are tricky to deal with if we change models here. One example is the fact that a shared cache mount can be read+written to by execs running in parallel. So if i have 10 exec ops running in parallel and they all share go mod+build cache mounts, they can de-dupe work between each other as they are running.
If we want to retain that, any new model we switch to would need to deal with it. It's not entirely clear to me how we could do that in a general way that doesn't just become essentially the same model as cache mounts currently are.
@crisp spire I think pluggable caching is very interesting but too early to discuss until we have clarity on the model and APIs. In the issue you linked, engine provisioning is now more clear, so now we can discuss a plugin system on top
Youâre right re: concurrency settings. But, to be fair we canât guarantee concurrent access across machines⊠so if it may end up being another buildkit promise we have to bend over backwards to keep in distributed deployments
I think the key thing there is that because these are cache volumes, they are best effort, which means there's no promises đ It's more about trying to give users best-effort performance optimizations than it is to promise them certain contents are there. In the distributed case, the extent of our best effort would just be that the cache mounts are shared on a given host (at least initially, more optimizations to combine cache mount contents across hosts would come along side support for merging cache mount contents in the cloud).
One feature of cache mounts that's probably underutilized and underappreciated is that they can also be layered. You can specify a source Directory when specifying a cache mount, in which case that Directory will be the base and the actual cache contents will just be a mutable layer on top of the base.
One idea that has come up previously would be a pattern where cache mounts can get written to in "trusted" settings and then those trusted contents are used as a base for ephemeral cache mounts elsewhere.
For example, in our Dagger CI, when we merge into main we would populate the gomodcache cache mount using all the packages we currently use in main. Then in PRs (i.e. more untrusted executions), we'd use that gomodcache from main as the source Directory, but anything written by the pipelines running as part of the PR would be ephemeral changes on top of that base.
The reason that came up before was more about stopping a malicious or buggy PR from corrupting the cache mount contents that's shared between all other PRs. But it feels potentially relevent here too. Maybe.
TBH I am not fully caught up on the exact problems around namespacing that a totally different approach like this would solve, so I don't really know if that's relevant yet.
Maybe there's some connection between a namespace and a layer of a cache mount? Certainly feels like a similar sort of structure somehow
Namespacing problems:
-
I'm developing apps A and B. They both use Dagger. By coincidence, both their Dagger pipelines create a cache volume called "foo". As a result, they both use the same cache volume to store unrelated data, causing conflicts.
-
I'm developing an app using Dagger. My Dagger pipeline imports 3d-party components (library or extension) for different parts of the pipeline. Both components by coincidence create a cache volume "foo". This results in conflict between components of the pipeline (eg. Go builder and Python builder share data).
-
I'm developing an app using Dagger. My Dagger pipeline imports a single 3d-party component; but it invokes it in several different parts of the pipeline. For example a builder function is used to build different apps. All invocations of the same component create a cache volume "foo". This results in a conflict between instances of the same component within the pipeline (eg. Go builder uses the same cache directory to build 10 different apps).
Right okay. I'm not sure the exact details of the "auto-namespacing by extension" idea, so maybe I'm just repeating, but one possibility would be to only allow persistent cache volumes to be created+used by extensions.
In a given extension, a cache mount can be re-used and shared across different entrypoints in the extension, but it cannot be "exported" and used outside of that code (either returned to callers or passed to dependency envs).
So, if you want efficient Go mod/build caching, you have to either use our built-in Go extension or, if you really don't like the builtin one or need something highly-custom for whatever reason, create your own third-party extension that does Go builds with its own cache volume.
It would still be possible for Cache volumes to be created outside of an extension, but they would always be given a random ID and thus only beneficial over the course of a single session.
This approach would allow the extension to have centralized control over the cache volume.
- It can decide whether to just directly run pipelines on top of the cache volume, or whether to instead run them as ephemeral mutable layers on top of an immutable base (the source Directory feature I mentioned above).
- Or it can even decide to not use cache volumes at all and instead implement mergeop-fest types of optimizations (doesn't even have to be merge-op, it can do whatever it wants)
- The extension can also choose to accept configuration parameters that somehow influence the cache volume behavior (or even disable it entirely, etc.). That's up to the extension.
This doesn't totally stop arbitrary pipelines that end up using the same cache mount from interfering with each other, but honestly that's sort of the implicit tradeoff of cache volumes, better performance for more risk of conflicts.
- At least now, the decisions around that tradeoff are in a central point of control, whereas today it's 100% distributed chaos.
That would be a much more restrictive model than what exists today. However, I suspect it would still be just as useful for the vast majority of cases. Most users are just gonna end up using a builtin extension for this type of stuff. And those who really need something custom are free to write their own extension.
I'm sure there's also modifications we could make to the idea if it ends up being way to restrictive, this is just at a thinking-out-loud stage
E.g. one possible relaxation that would still retain the same benefits would be that it's allowed for extensions to "export" their cache volumes to callers or other extensions, but once that happens the "source" volume becomes a base layer for the new cache volume. So the user of the exported cache volume cannot make any changes to the source volume, it's an immutable layer that any changes they make are on top of.
I wouldn't want to start there, but that would be totally possible to support if a compelling use case arises.
The "auto-namespace by extension" idea is jotted down here: https://linear.app/dagger/issue/DEV-2526#comment-558e04ea
(Sorry it's a private team issue, since it started out as a cloud-only issue, then we gradually realized it had ramifications beyond that...)
Skimming that, I think what I'm proposing is somewhat different, partially in that it'd be much more of a breaking change (or at least, it would require introducing a new concept/api sibling to current one and probably deprecate the current api for removal in a future version).
The idea I'm suggesting is that cache volumes are not only namespaced by extension but also fully controlled+isolated by a given extension. To start, only that extension's code can use a given cache volume, it's totally private to it.
And then if a need to relax that arises, we can make use of the layered cache mounts feature to do it safely.
I think it solves all the problems you listed here while remaining usable
It does look the same, or at least I can't spot the difference
What about problem 3? How does the go builder extension (for example) make sure it uses a different go cache for each go project? It has to key it off of something right? Wouldn't that require the caller passing a key?
EDIT: meant problem 3, sorry
It does look the same, or at least I can't spot the difference
There's no hierarchical namespacing and cache volumes are (to start) never passed between extensions. Maybe I'm misunderstanding the other idea or not explaining something clearly though.
What about problem 3?
Well I'd first say that doing that would most likely be a pretty obscure need; one of the main benefits of cache volumes is that they transparently optimize by de-duping work across different pipelines.
For use cases that call for this though, the org with multiple teams would need to define their own extension, in which they'd define all the separate cache volumes they want for the different teams. These get automatically uniquely named because they are in an extension.
- I don't think this is much to ask, creating a platform-wide extension is probably what any org advanced enough to have this need would be doing anyways
Then I suppose they would still want to be able to use the built-in go extension for building. So probably need to tweak the idea such that the cache volume to be used by a given extension can actually be passed to another extension's function as an argument. So the API for the go extension's Build function would accept an optional parameter that overrides the cache mount to be used.
That would still retain the same benefits since there's still no identifying of cache volumes by strings; they still all have to be fixtures defined in an extension. But then as long as that cache volume to be used by a given function is just an option, the other 95% of use cases that don't care about this will just transparently use the default cache volume and get all the benefits without having to care about any of this.
So, maybe the crux of the solution is more just that "string-identified" cache volumes are gone entirely; cache volumes can only ever be defined in an extension, in which case they get an automatically assigned unique name.
I think that approach side-steps the need for workspaces and hierarchies of cache volume keys, but maybe I'm missing some other constraints you were thinking about that called for that?
Example in Go:
package main
import "context"
func main() {
dag.CurrentEnvironment().
WithArtifact(TeamFooBinary).
WithArtifact(TeamBarBinary)
}
// in python this would just be annotated with @env.CacheVolume
func TeamFooGoModCache() *CacheVolume {
return dag.CacheVolume() // could also support opts here
}
func TeamBarGoModCache() *CacheVolume {
return dag.CacheVolume() // could also support opts here
}
func TeamFooBinary(ctx context.Context) (*File, error) {
return dag.Go().Build(GoBuildOpts{
GoModCache: TeamFooGoModCache(),
// other go build opts...
})
}
func TeamBarBinary(ctx context.Context) (*File, error) {
return dag.Go().Build(GoBuildOpts{
GoModCache: TeamBarGoModCache(),
// other go build opts...
})
}
And like I said, if the use case to export cache volumes arises (e.g. dag.CurrentEnvironment().WithCacheVolume(TeamFooGoModCache())), we could support that too, just with the rule that once exported to callers, then the existing contents of the cache volume become an immutable base layer and any other changes by the caller are a mutable layer on top of that.
Catching up..
OK just trying out a crazy idea... What if the entire concept of volume as a "special directory outside of the caching system" disappeared, and was replaced by a simple cache merge API?
-
Before: "Please create a special directory that is not affected by regular caching. Then please mount it in my container before executing this command. This is a way for me to selectively bypass the caching system"
-
After: "please execute this command, with this custom cache merge strategy"
You have my curiosity. How would you do, say, a typical $GOMODCACHE? Like these: https://github.com/sipsma/dagger/blob/zenith/universe/go/util.go#L9-L12