what are cache volumes? | Dagger | Page 1

untold nimbus Aug 4, 2023, 7:04 PM

#

🧵

#

Here’s a prior discussion if that helps: #1124075148786544671 message

#

Sometimes Dagger executes a tool that has its own caching feature: usually package managers or compilers like npm, maven, go etc. For those tools to cache properly they need their own cache data (usually a directory) to be persisted between runs. So Dagger needs to pass these input directories to the container in a special way, because a regular directory will not have its contents change from one run of the tool to the next. That’s why there’s a special type of directory called CacheVolume. Those are not meant to be shared outside of Dagger: they are for persisting specific parts of the internal state of your pipeline, for optimal use of your tool’s native caching features.

#

Dagger Cloud simply allows cross-host synchronization of the cache volume’s contents. Otherwise the benefits of the cache volume are limited to the scope of each machine

#

And in the case of ephemeral CI runners, the benefit is voided entirely since the cache volume will start empty every time

#

“environment” is an emerging concept, still being designed, that aims to address several requirements at the same time. One of those requirements is to provide a scope for caching. For example when you create a cache volume called “npm_cache”, what is the scope of that? What if you have multiple pipelines running in parallel on that engine, that use the same volume name for different purposes?

rugged comet Aug 4, 2023, 7:10 PM

#

Dagger Cloud simply allows cross-host synchronization of the cache volume’s contents.
This sounds like quite a challenge unto itself.

untold nimbus Aug 4, 2023, 7:10 PM

#

Right now the volume namespace is global to the engine. That problem is exacerbated by cross-engine sync because now the volume namespace is global to your organization

untold nimbus Aug 4, 2023, 7:10 PM

#

rugged comet > Dagger Cloud simply allows cross-host synchronization of the cache volume’s co...

Yes it is our primary commercial product and difficult enough to be worth paying for 🙂

rugged comet Aug 4, 2023, 7:54 PM

#

When I read syncing, I imagine something without strict consistency. As a user of a persistent volume, do I need to design my workloads around some consistency constraints?

#

In other words, given two independent runs both mounting and writing to the same files on the same persistent volume, will each run operate in isolation to mutations from the other? Might they interleave?

#

Like is it shared mutable state?

untold nimbus Aug 4, 2023, 9:05 PM

#

We only sync in between pipeline runs. So when you use a cache volume, you have the guarantee that it is in a state that was snapshotted after a pipeline completed. We do not sync volumes mid-pipeline

#

@soft wagon can tell you more about the specifics. But under the hood we are going for "tried and true" rather than "black magic"

#

General idea: each engine uploads a snapshot of modified cache volumes at an opportune time (ie. not mid-pipeline). If several engines are using the cache volume concurrently, they each download an entirely different snapshot, and the last snapshot upload wins

#

we may look into smarter merging in the future, but right now there is absolutely no fanciness involved

#

And I take your point about "sync" possibly giving the wrong idea @rugged comet . If other words come to mind to describe the feature, I'm interested 🙂

rugged comet Aug 4, 2023, 9:13 PM

#

I like this idea of internally-consistent views.

#

I was trying to understand if a cloud vs local build could produce indeterminate results.

soft wagon Aug 4, 2023, 9:21 PM

#

In terms of the syncing w/ our cloud, what Solomon said above 🙂 We default to as safe and unclever as possible right now. If we add smart merging in the future, it will be done very carefully since it absolutely turns into a distributed systems problem that involves logic specific to the programs accessing the cache mount.

I'll also add that in the context of just your local execution, when specifying the mount for a cache volume, it defaults to "shared" mode but there's also a "locked" mode. In shared mode, any concurrent execs can have it mounted and use it in parallel. This is common for go module caches, pip cache, npm cache, etc. since those package managers handle concurrent access well.

In locked mode, only one exec at a time can run with it mounted. Any concurrent execs trying to run with it will run in serial, one at a time. Locked mode is meant for situations where you can benefit from re-usable mutable state but the programs you're running don't handle concurrency well. Some package managers have either not supported concurrent access or had bugs in their implementation from time-to-time, which is what locked mode helps with

#what are cache volumes?