#Distributed cahing options

1 messages · Page 1 of 1 (latest)

oak briar
#

🧵 starting a thread here @coarse torrent

#

May I ask why you consider Dagger Cloud to not be an option?

coarse torrent
#

We are an F50 and there's no way we're going to be able to get something like Dagger Cloud through procurement.

#

My team is having significant issues with companies that are several times the size of Dagger that would have smaller overall liability exposure.

oak briar
#

May I ask what your runner infra looks like? Is this self-hosted on something like Kubernetes, or using a managed CI offering?

coarse torrent
#

The hurdle for companies this size is largely liability when dealing with vendors laughcry

#

We're currently using self-hosted GHA runners provided by another team. The machines themselves are GCP VMs with some tools installed, but otherwise largely "vanilla"

#

Of course there's the standard "corp proxy" type stuff as well.

oak briar
coarse torrent
#

For us, we're invoking Dagger via a CLI that we publish (we are a platform team) for our consumers. Our CLI spins up the engine container before invoking client.Connect() so that we can tweak a few things.

coarse torrent
oak briar
#

On the technical side: are those GCP VMs provisioned dynamically for each GHA job, 1 ephemeral VM per job?

coarse torrent
#

Yeah.

#

They're pre-provisioned and sit in a pool and then commit sudoku once the job is done.

#

Blargh, the env variables that are supposed to be set for the URL and the token are not. I smell shenanigans from Actions.

oak briar
#

Focusing on the Dagger workload running inside those GHA runners: what's the general shape of the workload? Builds, tests, deployments, other?

#

I'm asking to get a rough sense of the scale tradeoffs

coarse torrent
#

Build, test, scan, scan some more, publish.

oak briar
#

Specifically scale of compute vs. scale of cache data

coarse torrent
#

We're currently kludging caches via a ridiculous process of hoisting the cache contents into and out of the engine via mounted directories and the cache action. The cache speed up is significant.

#

On some projects on the order of 80%

oak briar
#

What we've found is that, when you make everything cached by default, sometimes the scale bottlenecks are not where you'd expect them. Often that's a good thing, opens more options.

#

One possible path, which may or may not be a good fit for you, would be to have a beefy long-running VM dedicated to the Dagger engine; then have the dagger CLI point at that engine to run the pipelines. This will give you optimal caching with no moving parts (ie. you dodge the complications of cache export, whether on GHA cache or buildkit).

Depending on the cache speed up, that may be all you need.

If it's too much load for a single beefy machine, even with cache speedup accounted for, you could look for orthogonal workloads: jobs that you are running in Dagger, but don't share any cache data. Then you can allocate each to its own beefy machine.

#

Cache export (what you asked your original question about) is still a viable path. It just has a bunch of operational paper cuts, that really do add up. So it's worth looking for opportunities to not do it, if possible.

coarse torrent
#

A persistent machine in the same network as the CI runners is going to be a significant implementation effort for us because of the usual corporate stuff.

#

Using the GHA cache or even a registry allows would ideally allow us to get many of the benefits at a fraction of the overhead.

#

Also am I correct in understanding that the GHA cache would not save cache volumes in Dagger?

oak briar
#

OK, it looks like that really is the only viable option for you at the moment 🙂

oak briar
coarse torrent
#

Interesting, the docs only mention the limitation for the GHA backend.

#

does that limitation apply to the local cache export as well?

oak briar
#

I believe so

coarse torrent
#

Bleh.

oak briar
#

It's a fundamental design decision of buildkit that cache mounts (as they are called in buildkit) are meant to be regular local bind mounts, for performance optimization purposes, and nothing more

coarse torrent
#

So sounds like my option here if I want cache volumes is to actually just mount /var/lib/dagger to the host and save that.

oak briar
#

If you do find a way to make it work at the host filesystem level, do let us know 🙂 Would save us a lot of trouble

coarse torrent
#

It's not that tricky, just save the mounted directory to the Actions cache via the standard cache action.

#

It's slower, but still effective.

oak briar
#

I see whast you mean. Not guaranteed to work, things like engine ID might change from one instance to the next.

coarse torrent
#

The result is effectively having an engine container that is paused between runs, with the tradeoff of having to unpack and repack the cache each time.

oak briar
#

But if you want to explore, we'll do our best to help

coarse torrent
#

The engine ID should be stored in that same directory, no? I thought I saw it there last time I poked around.

oak briar
#

Also, we're reaching the limits of my expertise, to continue down this particular rabbit hole, we'll have to wait for a core dev to be around (probably tomorrow pacific time)

#

flagging @drowsy horizon now for later 🙂

#

@coarse torrent my educated guess of possible complications, in the "persist the engine state" approach: that state would be all over the place. In the engine contianer; in its associated volumes; and possibly in the containerd snapshotter state?

#

Since the container is privileged, the state could be spread across host and container

#

It's also possible that I'm mixing up current and future architecture, as we're in the middle of swapping out parts of the guts of buildkit, for operational improvements orthogonal to this discussion

coarse torrent
#

Buildkit's self-contained as an execution engine. Currently everything is still fairly monolithic, so if I persist all the Buildkit state it Should Work(tm)

sharp flicker
#

@oak briar what about BYOB? Is that something that could work in this case if they self-host the caching infra?

oak briar
sharp flicker
#

Having said that, I don't think storing /var/lib/dagger in the GHA cache has a promising future given that buildkit storage is not designed to be concurrently safe. If you have multiple builds accessing the same underlying volume, concurrently it's quite likely that you'll corrupt the state metadata

coarse torrent
#

Right, but the runners are ephemeral so we have a strong guarantee that the engine is only used for a single job. Multiple requests for the same cache content are fulfilled as copies and saved to unique keys (or not saved at all).

#

I know that it's common to run the engine as a service, but that's not the model we're using currently.

#

The lifecycle is tightly bound to the CI job execution.

oak briar
#

Right, in the scenario described by @coarse torrent (dump the whole state on a bucket when the job is done, download it again on the next job), concurrent access won't be an issue. Instead possible issues will be:

  1. Performance hit of re-downloading and re-upload a possibly massive state at each job

  2. Concurrent uploads of state snapshot from multiple workers. You need to make sure each upload is atomic; ideally without keeping every copy forever. Not sure what the GHA cache primitives are exactly, and whether they help you deal with this

coarse torrent
#

Yep, fortunately the latter is taken care of by GitHub.

sharp flicker
#

Oh, I see.. you're basically creating N copies of /var/lib/dagger across several pipelines/jobs

coarse torrent
#

The cache Action doesn't allow you to upload a cache to a key that already exists, so race conditions are largely avoided. In any case, we use a unique key per job run anyway, so this is all strongly guaranteed.

#

The former thing is something we already experience, but to your point might get worse.

#

Actually, scratch that, will get worse since Buildkit's state will include layers in addition to the cache mounts.

#

So it'll have every layer of every pipeline run that gets committed to main.

sharp flicker
#

The amount of data will be quite large

coarse torrent
#

I'm really only interested in the cache mounts, to be quite honest.

#

Execution caching is very much a nice to have.

sharp flicker
coarse torrent
#

For the parts where we're going to get a cache hit for execution caching, the execution step will be trivial anyway. Everything else will be significant and non-cacheable.

#

We're basically relying on whatever caching the build tools we're wrapping around have, hence the use of cache mounts everywhere.

sharp flicker
#

@coarse torrent even if you could self-host it, that wouldn't make procurement easier?

coarse torrent
#

Gradle build cache, Go incremental builds, etc.

#

@sharp flicker easier != easy.

#

It took me 2 months to get a PO out to a vendor that we've already got an MSA signed with.

#

We have our own MSA that we provide to vendors to sign whenever we're looking at even spinning up a PoC and that has been a significant hurdle in a few cases already.

#

Again, largely liability related, but also a lot of privacy concerns, etc. BYOC/BYOB makes those concerns easier to explain to the lawyers, but there's a lot that goes into it still.

oak briar
#

We'll get to the procurement part 🙂 Let's focus on the current parameters, getting something to work without having to ask management to buy a new thing

#

@coarse torrent if the main concern is cache volumes... What if you persisted them as part of your dagger pipeline?

sharp flicker
#

Got it.. makes total sense. In that case the options here are quite limited. Maybe Erik has some other ideas, but only exporting cache volumes doesn't seem trivial

oak briar
#

You could add specific steps that get the right volumes mounted in the right place, and either upload or download their contents

coarse torrent
#

Ooo, actually that's not bad.

#

Just have a step that we inject into all our pipelines that just stuffs everything into the cache manually.

oak briar
#

Then you can worry about the layer caching separately, via cache export. There are other downsides, but a major one (lack of cache volume persistence) would be eliminated

coarse torrent
#

As I said I mostly don't care about layer persistence.

oak briar
coarse torrent
#

Nah, no need.

#

Have a final step that you mount all the cache volumes to, with each cache volume being mounted to a folder named the ID of the cache volume, e.g. /cache/mycachevol. Have a binary that runs and saves everything directly to the cache. Do the opposite for a restore

oak briar
#

Eventually though, you'll need to mount the right content in the right path of the right container

#

Anyway what you're saying works 🙂 Slight variations of the same thing. This is my favorite option so far, because you're not messing with the guts of dagger

coarse torrent
#

We'll probably just mount /var/lib/dagger for the moment until we can work out the logic for the cache dance.

#

It gives us the functionality we care about but massively simplifies pipelines, then we can worry about building the cache dance.

#

Right now the pipeline complexity is killing us in maintenance overhead.

carmine rivet
#

@coarse torrent curious if you've seen https://github.com/reproducible-containers/buildkit-cache-dance?
It's about as close as official for a solution like this, it's maintained by Akihiro (one of the buildkit maintainers), it does exactly what you described for native buildkit - it grabs the contents of a cache volume, and exports it to disk, before putting it into gha's cache.
I'm not exactly sure how well this translates to dagger, since you'd probably be building using the dagger sdk instead of dockerfiles, but the approach is pretty much the same.

coarse torrent
#

@carmine rivet yeah, I've seen that. I'm thinking something conceptually similar, but more tightly coupled to Dagger.

fiery plover
#

I am interested in something similar, running multiple dagger-engine instances in different AZs pointing to one large NFS mount. However, someone wrote here that buildkit storage is not designed to be concurrently safe. In that case I'd rather solve this with a consistent-hashing-based routing of dagger builds so that the builds for X always land on a same machine pointing to its own NFS mount. Do you see any problem with that, or any other (better ideas)?

oak briar
#

The other reliable option is to just use one beefy machine, and scale vertically

fiery plover
#

i'm in the same boat like OP, Fortune 500 corp here. We like the product, but for now in semi-PoC we'd like to increase our cache hit rate

#

we already have a dozen (10ish) of beefy machines but cache hit is not good enough. You have to run at least 10 runs to get your cache populated on all 10 machines

oak briar
#

Perhaps naive but is it really not an option to hook up a SaaS service for a semi-poc?

fiery plover
#

same procurement issues as the OP, plus we have a lot of SOX and PCI-compliant docker images, and storing them in SaaS cloud would be a nightmare for our security teams

#

I wish it was easier though

oak briar
#

We’re happy to set you up on an trial account, talk to your buyer’s & management etc. Also I’m curious if bring-your-own-storage would make a difference?

#

With BYOS we would never touch the cache data. Only cache orchestration and telemetry (for visualization)

fiery plover
#

if you're offerring such a hybrid model than that would be way more acceptable

#

what would be the pricing / cost model for that? if we bring our own storage

oak briar
#

We also just started SOC2 type 2 evaluation period in case that helps. We’re in this to make the bosses & buyers happy 😁

#

Our whole pricing is metered by managed minutes, to scale with utilization. Typically pays for itself because of the compute savings from caching, and engineering savings from visualization, faster dev loop etc

oak briar
#

We learned our lesson from Docker, and want to build out the business side early, in a way that integrates well with the community side. That will allow us to be masters of our own destiny, and actually deliver something reliable for the long term. In my experience, a great software product that is not financially viable is only temporarily great.