#cache mount id
1 messages · Page 1 of 1 (latest)
@bleak wharf
I think we should hide the IDs (working theory)
- We control buildkit so hypothetical conflict with non-dagger client (already low) is even lower now
- We can give mounts their own ID just like containers and directories
- it’s not weird to reference a mount by id, same pattern as for everything else (was not so in Cue)
hm interesting, so would we expose a MountID and Mount type?
fwiw I have an initial implementation without that, withMountedCache just grabs the parent container ID + mount target path to come up with a cache ID
will push soon
Would that mean that if you change something unrelated like an env var then you end up with a different cache mount ID?
oooh right it’s that kind of ID…
I think that's the problem, there's not a good "content-addressable" way of creating IDs for these (that I can immediately think of)
sorry I totally missed that part
No I mean there may be some way of doing it, it's just a lot trickier
Cache mounts in general go against the whole philosophy of content-addressable LLB, they introduce tons of weirdness like this
observation: if we don’t support sharing cache directories between containers , everything gets easier
then we can use some internal magic to infer cache id without exposing it to devs
for example it could be the graphql “path” of the mount
if you change any part of the parent yes, so you'd want to minimize change there and put everything inside of withMountedCache
so container { bustsCache { withMountedCache { usesCache } } }
yeah, this implementation becomes a more tightly scoped cache lifecycle, basically per-query, which is still a totally valid feature
are you sure? since cache mounts only are added at exec in llb… seems that everything would be the “parent”?
the ID is calculated at withMountedCache time, not LLB generation time, so it's just a string by that point
oh I see you’re not computing the ID from the actual llb
here's the WIP just to make things concrete: https://github.com/vito/dagger/commit/af0e439e4bac099792d091fe62ed0ff6c3a6ac48
Am I correct in thinking that the only way to get no weird cache misses at all, and no weird implicit API to learn, would be to require an individual API call to create each mount directory, then use their ID in a container call?
So that would be the verbose, but most correct design
would that be creating unique mount IDs?
Here's an example of where we are currently using cache mounts for npm dependencies (saves a ton of time): https://github.com/sipsma/dagger/blob/57685be3dd4eb5f95ef0e38d46ce4b35a1d03cf8/examples/yarn/index.ts#L65-L65
If we change the cache ID to be the container ID, that will includes the LLB of the rootfs and any other mounts in the exec right? And so wouldn't that mean that every time the input mount changed, a new empty cache mount would be used?
I'm just not sure if cache mounts are very useful anymore if that's true.
yes, which seems inevitable if we want to avoid weird glitches or arcane patterns
example of “weird glitch” 👆
yeah, it's very brittle, I was also worried about patterns like running go mod download as an earlier step with only go.mod/go.sum - if either changed you'd lose your cache
example of arcane pattern 👆
Yeah I think this is probably the way. Make them a first-class entity like secrets, filesystems, etc.
from experience with europa. A little verbosity on top of a robust primitive is WAY easier
undebuggable cache mount issues were s good 50% of all developer support cycles I’d say
aggravated by cue dx to be sure, but it’s not all cue’s fault
what's the difference between having a mount creation API (utlimately for ID generation) and just generating IDs client-side?
Consistency of DX I’d say
Everything else is an object with server-side generated IDs
right, but those are all content-addressed
for now
I'm not sure how the client will know when to generate a mount ID and when to re-use one (or even know how to get back to it)
but secrets may not be
and services definitely aren’t
we could call them “volumes” instead to clear up the confusion. makes it more clear that it’s not content addressed, similar to services and secrets
We would need to make it very clear that they are best-effort only, the data may be pruned at any time. So don't store your database in there, etc.
ah right
volumeish
I think calling it a CacheDirectory would work
you want to reuse the same cache content, reuse the same cache directory
CacheVolume ?
I guess one nice part is that it will make it easier to support different ways of generating an ID in the longer-term. So, we could hypothetically start with the same (confusing, bad) interface today where users just provide an ID. But then in the future we could add a different way of creating cache mounts where the ID is derived on behalf of the user and then deprecate the old way.
I think it's easier to do that with a separate API than it would be to continue with the current approach where it's all embedded in the Container api (I could be wrong though, maybe it's just easier to think about but implementation-wise it's arbitrary?)
CacheVolume seems reasonable (also implies maybe we support other types of volumes in the future)
i think this makes sense as long as the IDs are deterministic. to me the minimal API is "I give you a bunch of strings and you give me a cache ID", where those strings are just the user deciding what values should bust the cache. the dumb implementation of this would just be joining those strings together and/or hashing, but we could also add our own values to it to control scoping.
Oh sure, that sounds great too
I think that's sort of like what GHA cache does iirc
oh right that's probably a good comparison
GHA caches also has 'restore-keys' and special prefix semantics which is interesting
with a key like 'foo-bar-1234' you can configure restore-keys as [foo-, foo-bar-] and then when you're creating 'foo-bar-5678' it'll find 'foo-bar-1234' and restore it first, instead of starting from scratch
i have no idea how we would implement that, but i've found it useful in the past
https://github.com/concourse/governance/blob/69b3d4aaae0d9df7f71a1b50924383f339456af5/.github/workflows/terraform.yml#L52-L53 I use it to store Terraform state so I can avoid a S3 bill 😆
Yeah we could figure out something similar to that, my only concern is that whenever I've read the GHA docs about it, my brain has started melting (same as when I try to remember all the details of buildkit cache mounts). So I think we could hopefully come up with something better (or maybe caching is just hard, idk).
(need a reaction emoji that is old man yells at AWS)
I personally am fine with starting with the "dumb implementation" you describe here

your wish is my command
Thank you, besides all the nice code, discord has also improved 10x since you joined. We appreciate your leadership on this matter
@fallow gust 👆
@nocturne charm note we'll need to use an approach like this for things like yarn
https://github.com/vito/dagger/commit/af0e439e4bac099792d091fe62ed0ff6c3a6ac48#diff-2f732241fe4998f181c3703849d454990682cc44f9981c719b75ad7de777e340R715-R720
(subbing RAND with YARN_CACHE_FOLDER etc)
We're looking to use caching patterns in extensions for various language builds, etc
I think in practice we'd still want to do something like this though, right? https://github.com/actions/setup-go/blob/main/src/cache-restore.ts#L16-L32
e.g. caching on platform + runtime version + dependency file checksum
The current plan is to hide raw builkdit cache mount IDs from direct developer control, because it's too easy to shoot yourself in the foot, and it's very different from the rest of the API
so if you need 4 "cache volumes" (current name for them) for 4 different platform/runtime combinations, then you'd make 4 calls to the Dagger API, to create 4 different cache volumes, and track their IDs in your code the usual way. Then use the IDs to mount the right volume in the right place
Slightly more tedious because you need to make 1 separate API call for each cache volume you want to create (as opposed to seamlessly configuring mount directly in your container creation call).
But way more reliable (we think)
That makes sense, but how do you reuse that cache across engines?
(full context in this thread)
Do you mean across API calls to the same engine?
Or, do you mean how to persist eg. in between CI runs?
Between CI runs
Well that's a different question, you have the same problem regardless of the API to use cache mounts
I guess it's not distinct from a regular cached layer at that point
sorry I misunderstood that code you were showing, thought it was setting up Dagger cache mounts
Cache persistence in CI environments remains a PITA and is actually unchanged in the cloak design
We basically haven't touched that part
(I think)
The solution for CI remains: 1) slow the bleeding for now; 2) perhaps make it easier to handoff run to persistent worker machines, a-la bass loop; 3) one day run a magical caching service that makes all the pain go away
yeah I think it's mostly not important to persist the cache volume across CI runs since you'd have the whole layer cached. It would be potentially useful for different pipelines that could reuse the same cache but that's way more complicated
if your cache volumes get wiped in between runs, your runs will be slower though
So if hypothetically I had a query that just did this: https://github.com/dagger/dagger/blob/cloak/examples/yarn/index.ts#L25-L45
and that layer was cached (but the cache volume isn't there because we're in a new engine) I guess is that even possible? Or am I misunderstanding how the api works
yes that will be possible, although with a slightly redesigned API per my earlier comments. In a CI environment if you don’t arrange for cache persistence, your yarn cache will always be empty
Right with the new api, cool. Yeah the reason it came up was because we were talking about putting together extensions for the popular language package managers to handle all the caching scary bits for the users. So part of that will have to be help on setting up the underlying CI cache
Got it. With dagger those 2 are decoupled:
-
Dagger extensions to handle language-specific caching (CI-agnostic)
-
CI-specific configuration to persist Dagger cache (language agnostic)