#Distributed cahing options
1 messages · Page 1 of 1 (latest)
🧵 starting a thread here @coarse torrent
May I ask why you consider Dagger Cloud to not be an option?
We are an F50 and there's no way we're going to be able to get something like Dagger Cloud through procurement.
My team is having significant issues with companies that are several times the size of Dagger that would have smaller overall liability exposure.
Got it. I do want to make sure that's not true forever, but that doesn't help you today 🙂
May I ask what your runner infra looks like? Is this self-hosted on something like Kubernetes, or using a managed CI offering?
The hurdle for companies this size is largely liability when dealing with vendors 
We're currently using self-hosted GHA runners provided by another team. The machines themselves are GCP VMs with some tools installed, but otherwise largely "vanilla"
Of course there's the standard "corp proxy" type stuff as well.
I'm going to focus on finding a path for a successful technical implementation, but at some point in the future I would love to pick your brain on that particular problem, if you'll let me?
For us, we're invoking Dagger via a CLI that we publish (we are a platform team) for our consumers. Our CLI spins up the engine container before invoking client.Connect() so that we can tweak a few things.
Sure, happy to answers questions as I'm able.
On the technical side: are those GCP VMs provisioned dynamically for each GHA job, 1 ephemeral VM per job?
Yeah.
They're pre-provisioned and sit in a pool and then commit sudoku once the job is done.
Blargh, the env variables that are supposed to be set for the URL and the token are not. I smell shenanigans from Actions.
Focusing on the Dagger workload running inside those GHA runners: what's the general shape of the workload? Builds, tests, deployments, other?
I'm asking to get a rough sense of the scale tradeoffs
Build, test, scan, scan some more, publish.
Specifically scale of compute vs. scale of cache data
We're currently kludging caches via a ridiculous process of hoisting the cache contents into and out of the engine via mounted directories and the cache action. The cache speed up is significant.
On some projects on the order of 80%
What we've found is that, when you make everything cached by default, sometimes the scale bottlenecks are not where you'd expect them. Often that's a good thing, opens more options.
One possible path, which may or may not be a good fit for you, would be to have a beefy long-running VM dedicated to the Dagger engine; then have the dagger CLI point at that engine to run the pipelines. This will give you optimal caching with no moving parts (ie. you dodge the complications of cache export, whether on GHA cache or buildkit).
Depending on the cache speed up, that may be all you need.
If it's too much load for a single beefy machine, even with cache speedup accounted for, you could look for orthogonal workloads: jobs that you are running in Dagger, but don't share any cache data. Then you can allocate each to its own beefy machine.
Cache export (what you asked your original question about) is still a viable path. It just has a bunch of operational paper cuts, that really do add up. So it's worth looking for opportunities to not do it, if possible.
A persistent machine in the same network as the CI runners is going to be a significant implementation effort for us because of the usual corporate stuff.
Using the GHA cache or even a registry allows would ideally allow us to get many of the benefits at a fraction of the overhead.
Also am I correct in understanding that the GHA cache would not save cache volumes in Dagger?
OK, it looks like that really is the only viable option for you at the moment 🙂
Yes. That is one of the downsides. This is true for all buildkit cache export targets (including GHA cache, registry, S3 etc)
Interesting, the docs only mention the limitation for the GHA backend.
does that limitation apply to the local cache export as well?
I believe so
Bleh.
It's a fundamental design decision of buildkit that cache mounts (as they are called in buildkit) are meant to be regular local bind mounts, for performance optimization purposes, and nothing more
So sounds like my option here if I want cache volumes is to actually just mount /var/lib/dagger to the host and save that.
In the context of a fleet of ephemeral VMs, that would be very tricky to do in a way that is both safe and scalable. Assuming you mean via attached block device, or something like NFS, then you'll be limited to one writer at a time - and you will be responsible for enforcing that when orchestrating jobs
If you do find a way to make it work at the host filesystem level, do let us know 🙂 Would save us a lot of trouble
It's not that tricky, just save the mounted directory to the Actions cache via the standard cache action.
It's slower, but still effective.
I see whast you mean. Not guaranteed to work, things like engine ID might change from one instance to the next.
The result is effectively having an engine container that is paused between runs, with the tradeoff of having to unpack and repack the cache each time.
But if you want to explore, we'll do our best to help
The engine ID should be stored in that same directory, no? I thought I saw it there last time I poked around.
Also, we're reaching the limits of my expertise, to continue down this particular rabbit hole, we'll have to wait for a core dev to be around (probably tomorrow pacific time)
flagging @drowsy horizon now for later 🙂
@coarse torrent my educated guess of possible complications, in the "persist the engine state" approach: that state would be all over the place. In the engine contianer; in its associated volumes; and possibly in the containerd snapshotter state?
Since the container is privileged, the state could be spread across host and container
It's also possible that I'm mixing up current and future architecture, as we're in the middle of swapping out parts of the guts of buildkit, for operational improvements orthogonal to this discussion
Buildkit's self-contained as an execution engine. Currently everything is still fairly monolithic, so if I persist all the Buildkit state it Should Work(tm)
@oak briar what about BYOB? Is that something that could work in this case if they self-host the caching infra?
Possibly, but it looks like the main obstacle for Dagger Cloud is not even technical, it's simply getting through procurement. Any technical obstacles (possibly addressed, or at least mitigated, by BYOB) would be on top of that
Since @coarse torrent mentioned the liability exposure I thought it was more related to third party hosting than anything else
Having said that, I don't think storing /var/lib/dagger in the GHA cache has a promising future given that buildkit storage is not designed to be concurrently safe. If you have multiple builds accessing the same underlying volume, concurrently it's quite likely that you'll corrupt the state metadata
Right, but the runners are ephemeral so we have a strong guarantee that the engine is only used for a single job. Multiple requests for the same cache content are fulfilled as copies and saved to unique keys (or not saved at all).
I know that it's common to run the engine as a service, but that's not the model we're using currently.
The lifecycle is tightly bound to the CI job execution.
Right, in the scenario described by @coarse torrent (dump the whole state on a bucket when the job is done, download it again on the next job), concurrent access won't be an issue. Instead possible issues will be:
-
Performance hit of re-downloading and re-upload a possibly massive state at each job
-
Concurrent uploads of state snapshot from multiple workers. You need to make sure each upload is atomic; ideally without keeping every copy forever. Not sure what the GHA cache primitives are exactly, and whether they help you deal with this
Yep, fortunately the latter is taken care of by GitHub.
Oh, I see.. you're basically creating N copies of /var/lib/dagger across several pipelines/jobs
The cache Action doesn't allow you to upload a cache to a key that already exists, so race conditions are largely avoided. In any case, we use a unique key per job run anyway, so this is all strongly guaranteed.
The former thing is something we already experience, but to your point might get worse.
Actually, scratch that, will get worse since Buildkit's state will include layers in addition to the cache mounts.
So it'll have every layer of every pipeline run that gets committed to main.
Yes, that's what I was thinking about
The amount of data will be quite large
I'm really only interested in the cache mounts, to be quite honest.
Execution caching is very much a nice to have.
We feel you. That's one of the reasons we had to build Dagger Cloud distributed caching in the first place
For the parts where we're going to get a cache hit for execution caching, the execution step will be trivial anyway. Everything else will be significant and non-cacheable.
We're basically relying on whatever caching the build tools we're wrapping around have, hence the use of cache mounts everywhere.
@coarse torrent even if you could self-host it, that wouldn't make procurement easier?
Gradle build cache, Go incremental builds, etc.
@sharp flicker easier != easy.
It took me 2 months to get a PO out to a vendor that we've already got an MSA signed with.
We have our own MSA that we provide to vendors to sign whenever we're looking at even spinning up a PoC and that has been a significant hurdle in a few cases already.
Again, largely liability related, but also a lot of privacy concerns, etc. BYOC/BYOB makes those concerns easier to explain to the lawyers, but there's a lot that goes into it still.
We'll get to the procurement part 🙂 Let's focus on the current parameters, getting something to work without having to ask management to buy a new thing
@coarse torrent if the main concern is cache volumes... What if you persisted them as part of your dagger pipeline?
Got it.. makes total sense. In that case the options here are quite limited. Maybe Erik has some other ideas, but only exporting cache volumes doesn't seem trivial
You could add specific steps that get the right volumes mounted in the right place, and either upload or download their contents
Ooo, actually that's not bad.
Just have a step that we inject into all our pipelines that just stuffs everything into the cache manually.
Then you can worry about the layer caching separately, via cache export. There are other downsides, but a major one (lack of cache volume persistence) would be eliminated
As I said I mostly don't care about layer persistence.
Ideally you would have different upload/download steps for different domains, for example one set of functions would deal with gradle persistence, another with go persistence, etc
Nah, no need.
Have a final step that you mount all the cache volumes to, with each cache volume being mounted to a folder named the ID of the cache volume, e.g. /cache/mycachevol. Have a binary that runs and saves everything directly to the cache. Do the opposite for a restore
Eventually though, you'll need to mount the right content in the right path of the right container
Anyway what you're saying works 🙂 Slight variations of the same thing. This is my favorite option so far, because you're not messing with the guts of dagger
We'll probably just mount /var/lib/dagger for the moment until we can work out the logic for the cache dance.
It gives us the functionality we care about but massively simplifies pipelines, then we can worry about building the cache dance.
Right now the pipeline complexity is killing us in maintenance overhead.
@coarse torrent curious if you've seen https://github.com/reproducible-containers/buildkit-cache-dance?
It's about as close as official for a solution like this, it's maintained by Akihiro (one of the buildkit maintainers), it does exactly what you described for native buildkit - it grabs the contents of a cache volume, and exports it to disk, before putting it into gha's cache.
I'm not exactly sure how well this translates to dagger, since you'd probably be building using the dagger sdk instead of dockerfiles, but the approach is pretty much the same.
@carmine rivet yeah, I've seen that. I'm thinking something conceptually similar, but more tightly coupled to Dagger.
I am interested in something similar, running multiple dagger-engine instances in different AZs pointing to one large NFS mount. However, someone wrote here that buildkit storage is not designed to be concurrently safe. In that case I'd rather solve this with a consistent-hashing-based routing of dagger builds so that the builds for X always land on a same machine pointing to its own NFS mount. Do you see any problem with that, or any other (better ideas)?
There is no tooling for such a hashing-based routing. Not impossible though, depending on the nature of your workload it could work. You would have to do some trailblazing though.
My recommendation is to start with Dagger Cloud distributed cache, supported & works out of the box, then see if you run into issues with that before considering building anything too custom
The other reliable option is to just use one beefy machine, and scale vertically
i'm in the same boat like OP, Fortune 500 corp here. We like the product, but for now in semi-PoC we'd like to increase our cache hit rate
we already have a dozen (10ish) of beefy machines but cache hit is not good enough. You have to run at least 10 runs to get your cache populated on all 10 machines
Perhaps naive but is it really not an option to hook up a SaaS service for a semi-poc?
same procurement issues as the OP, plus we have a lot of SOX and PCI-compliant docker images, and storing them in SaaS cloud would be a nightmare for our security teams
I wish it was easier though
We’re happy to set you up on an trial account, talk to your buyer’s & management etc. Also I’m curious if bring-your-own-storage would make a difference?
With BYOS we would never touch the cache data. Only cache orchestration and telemetry (for visualization)
if you're offerring such a hybrid model than that would be way more acceptable
what would be the pricing / cost model for that? if we bring our own storage
We also just started SOC2 type 2 evaluation period in case that helps. We’re in this to make the bosses & buyers happy 😁
Our whole pricing is metered by managed minutes, to scale with utilization. Typically pays for itself because of the compute savings from caching, and engineering savings from visualization, faster dev loop etc
then on top of that , larger buyers typically want to have a sales conversation for a yearly contract, PO, volume discounts etc. We are more than happy to accommodate all of that and make it as painless as possible for you.
We learned our lesson from Docker, and want to build out the business side early, in a way that integrates well with the community side. That will allow us to be masters of our own destiny, and actually deliver something reliable for the long term. In my experience, a great software product that is not financially viable is only temporarily great.