Distributed cahing options | Dagger | Page 1

oak briar Nov 2, 2023, 2:15 AM

#

🧵 starting a thread here @coarse torrent

#

May I ask why you consider Dagger Cloud to not be an option?

coarse torrent Nov 2, 2023, 2:23 AM

#

We are an F50 and there's no way we're going to be able to get something like Dagger Cloud through procurement.

#

My team is having significant issues with companies that are several times the size of Dagger that would have smaller overall liability exposure.

oak briar Nov 2, 2023, 2:25 AM

#

coarse torrent We are an F50 and there's no way we're going to be able to get something like Da...

Got it. I do want to make sure that's not true forever, but that doesn't help you today 🙂

#

May I ask what your runner infra looks like? Is this self-hosted on something like Kubernetes, or using a managed CI offering?

coarse torrent Nov 2, 2023, 2:26 AM

#

The hurdle for companies this size is largely liability when dealing with vendors laughcry

#

We're currently using self-hosted GHA runners provided by another team. The machines themselves are GCP VMs with some tools installed, but otherwise largely "vanilla"

#

Of course there's the standard "corp proxy" type stuff as well.

oak briar Nov 2, 2023, 2:28 AM

#

coarse torrent The hurdle for companies this size is largely liability when dealing with vendor...

I'm going to focus on finding a path for a successful technical implementation, but at some point in the future I would love to pick your brain on that particular problem, if you'll let me?

coarse torrent Nov 2, 2023, 2:28 AM

#

For us, we're invoking Dagger via a CLI that we publish (we are a platform team) for our consumers. Our CLI spins up the engine container before invoking client.Connect() so that we can tweak a few things.

coarse torrent Nov 2, 2023, 2:28 AM

#

oak briar I'm going to focus on finding a path for a successful technical implementation, ...

Sure, happy to answers questions as I'm able.

oak briar Nov 2, 2023, 2:30 AM

#

On the technical side: are those GCP VMs provisioned dynamically for each GHA job, 1 ephemeral VM per job?

coarse torrent Nov 2, 2023, 2:30 AM

#

Yeah.

#

They're pre-provisioned and sit in a pool and then commit sudoku once the job is done.

#

Blargh, the env variables that are supposed to be set for the URL and the token are not. I smell shenanigans from Actions.

oak briar Nov 2, 2023, 2:32 AM

#

Focusing on the Dagger workload running inside those GHA runners: what's the general shape of the workload? Builds, tests, deployments, other?

#

I'm asking to get a rough sense of the scale tradeoffs

coarse torrent Nov 2, 2023, 2:33 AM

#

Build, test, scan, scan some more, publish.

oak briar Nov 2, 2023, 2:33 AM

#

Specifically scale of compute vs. scale of cache data

coarse torrent Nov 2, 2023, 2:34 AM

#

We're currently kludging caches via a ridiculous process of hoisting the cache contents into and out of the engine via mounted directories and the cache action. The cache speed up is significant.

#

On some projects on the order of 80%

oak briar Nov 2, 2023, 2:34 AM

#

What we've found is that, when you make everything cached by default, sometimes the scale bottlenecks are not where you'd expect them. Often that's a good thing, opens more options.

#

One possible path, which may or may not be a good fit for you, would be to have a beefy long-running VM dedicated to the Dagger engine; then have the dagger CLI point at that engine to run the pipelines. This will give you optimal caching with no moving parts (ie. you dodge the complications of cache export, whether on GHA cache or buildkit).

Depending on the cache speed up, that may be all you need.

If it's too much load for a single beefy machine, even with cache speedup accounted for, you could look for orthogonal workloads: jobs that you are running in Dagger, but don't share any cache data. Then you can allocate each to its own beefy machine.

#

Cache export (what you asked your original question about) is still a viable path. It just has a bunch of operational paper cuts, that really do add up. So it's worth looking for opportunities to not do it, if possible.

coarse torrent Nov 2, 2023, 2:41 AM

#

A persistent machine in the same network as the CI runners is going to be a significant implementation effort for us because of the usual corporate stuff.

#

Using the GHA cache or even a registry allows would ideally allow us to get many of the benefits at a fraction of the overhead.

#

Also am I correct in understanding that the GHA cache would not save cache volumes in Dagger?

oak briar Nov 2, 2023, 2:43 AM

#

OK, it looks like that really is the only viable option for you at the moment 🙂

oak briar Nov 2, 2023, 2:44 AM

#

coarse torrent Also am I correct in understanding that the GHA cache would not save cache volum...

Yes. That is one of the downsides. This is true for all buildkit cache export targets (including GHA cache, registry, S3 etc)

coarse torrent Nov 2, 2023, 2:45 AM

#

Interesting, the docs only mention the limitation for the GHA backend.

#

does that limitation apply to the local cache export as well?

oak briar Nov 2, 2023, 2:47 AM

#

I believe so

coarse torrent Nov 2, 2023, 2:47 AM

#

Bleh.

oak briar Nov 2, 2023, 2:48 AM

#

It's a fundamental design decision of buildkit that cache mounts (as they are called in buildkit) are meant to be regular local bind mounts, for performance optimization purposes, and nothing more

coarse torrent Nov 2, 2023, 2:48 AM

#

So sounds like my option here if I want cache volumes is to actually just mount /var/lib/dagger to the host and save that.

oak briar Nov 2, 2023, 2:49 AM

#

coarse torrent So sounds like my option here if I want cache volumes is to actually just mount ...

In the context of a fleet of ephemeral VMs, that would be very tricky to do in a way that is both safe and scalable. Assuming you mean via attached block device, or something like NFS, then you'll be limited to one writer at a time - and you will be responsible for enforcing that when orchestrating jobs

#

If you do find a way to make it work at the host filesystem level, do let us know 🙂 Would save us a lot of trouble

coarse torrent Nov 2, 2023, 2:50 AM

#

It's not that tricky, just save the mounted directory to the Actions cache via the standard cache action.

#

It's slower, but still effective.

oak briar Nov 2, 2023, 2:51 AM

#

I see whast you mean. Not guaranteed to work, things like engine ID might change from one instance to the next.

coarse torrent Nov 2, 2023, 2:51 AM

#

The result is effectively having an engine container that is paused between runs, with the tradeoff of having to unpack and repack the cache each time.

oak briar Nov 2, 2023, 2:51 AM

#

But if you want to explore, we'll do our best to help

coarse torrent Nov 2, 2023, 2:51 AM

#

The engine ID should be stored in that same directory, no? I thought I saw it there last time I poked around.

oak briar Nov 2, 2023, 2:52 AM

#

Also, we're reaching the limits of my expertise, to continue down this particular rabbit hole, we'll have to wait for a core dev to be around (probably tomorrow pacific time)

#

flagging @drowsy horizon now for later 🙂

#

@coarse torrent my educated guess of possible complications, in the "persist the engine state" approach: that state would be all over the place. In the engine contianer; in its associated volumes; and possibly in the containerd snapshotter state?

#

Since the container is privileged, the state could be spread across host and container

#

It's also possible that I'm mixing up current and future architecture, as we're in the middle of swapping out parts of the guts of buildkit, for operational improvements orthogonal to this discussion

coarse torrent Nov 2, 2023, 2:57 AM

#

Buildkit's self-contained as an execution engine. Currently everything is still fairly monolithic, so if I persist all the Buildkit state it Should Work(tm)

sharp flicker Nov 2, 2023, 3:03 AM

#

@oak briar what about BYOB? Is that something that could work in this case if they self-host the caching infra?

oak briar Nov 2, 2023, 3:04 AM

#

sharp flicker <@488409085998530571> what about BYOB? Is that something that could work in this...

Possibly, but it looks like the main obstacle for Dagger Cloud is not even technical, it's simply getting through procurement. Any technical obstacles (possibly addressed, or at least mitigated, by BYOB) would be on top of that

sharp flicker Nov 2, 2023, 3:05 AM

#

oak briar Possibly, but it looks like the main obstacle for Dagger Cloud is not even techn...

Since @coarse torrent mentioned the liability exposure I thought it was more related to third party hosting than anything else

#

Having said that, I don't think storing /var/lib/dagger in the GHA cache has a promising future given that buildkit storage is not designed to be concurrently safe. If you have multiple builds accessing the same underlying volume, concurrently it's quite likely that you'll corrupt the state metadata

coarse torrent Nov 2, 2023, 3:10 AM

#

Right, but the runners are ephemeral so we have a strong guarantee that the engine is only used for a single job. Multiple requests for the same cache content are fulfilled as copies and saved to unique keys (or not saved at all).

#

I know that it's common to run the engine as a service, but that's not the model we're using currently.

#

The lifecycle is tightly bound to the CI job execution.

oak briar Nov 2, 2023, 3:13 AM

#

Right, in the scenario described by @coarse torrent (dump the whole state on a bucket when the job is done, download it again on the next job), concurrent access won't be an issue. Instead possible issues will be:

Performance hit of re-downloading and re-upload a possibly massive state at each job
Concurrent uploads of state snapshot from multiple workers. You need to make sure each upload is atomic; ideally without keeping every copy forever. Not sure what the GHA cache primitives are exactly, and whether they help you deal with this

coarse torrent Nov 2, 2023, 3:14 AM

#

Yep, fortunately the latter is taken care of by GitHub.

sharp flicker Nov 2, 2023, 3:15 AM

#

Oh, I see.. you're basically creating N copies of /var/lib/dagger across several pipelines/jobs

coarse torrent Nov 2, 2023, 3:16 AM

#

The cache Action doesn't allow you to upload a cache to a key that already exists, so race conditions are largely avoided. In any case, we use a unique key per job run anyway, so this is all strongly guaranteed.

#

The former thing is something we already experience, but to your point might get worse.

#

Actually, scratch that, will get worse since Buildkit's state will include layers in addition to the cache mounts.

#

So it'll have every layer of every pipeline run that gets committed to main.

sharp flicker Nov 2, 2023, 3:17 AM

#

coarse torrent Actually, scratch that, _will_ get worse since Buildkit's state will include lay...

Yes, that's what I was thinking about

#

The amount of data will be quite large

coarse torrent Nov 2, 2023, 3:18 AM

#

I'm really only interested in the cache mounts, to be quite honest.

#

Execution caching is very much a nice to have.

sharp flicker Nov 2, 2023, 3:19 AM

#

coarse torrent I'm really _only_ interested in the cache mounts, to be quite honest.

We feel you. That's one of the reasons we had to build Dagger Cloud distributed caching in the first place

coarse torrent Nov 2, 2023, 3:19 AM

#

For the parts where we're going to get a cache hit for execution caching, the execution step will be trivial anyway. Everything else will be significant and non-cacheable.

#

We're basically relying on whatever caching the build tools we're wrapping around have, hence the use of cache mounts everywhere.

sharp flicker Nov 2, 2023, 3:21 AM

#

@coarse torrent even if you could self-host it, that wouldn't make procurement easier?

coarse torrent Nov 2, 2023, 3:21 AM

#

Gradle build cache, Go incremental builds, etc.

#

@sharp flicker easier != easy.

#

It took me 2 months to get a PO out to a vendor that we've already got an MSA signed with.

#

We have our own MSA that we provide to vendors to sign whenever we're looking at even spinning up a PoC and that has been a significant hurdle in a few cases already.

#

Again, largely liability related, but also a lot of privacy concerns, etc. BYOC/BYOB makes those concerns easier to explain to the lawyers, but there's a lot that goes into it still.

oak briar Nov 2, 2023, 3:25 AM

#

We'll get to the procurement part 🙂 Let's focus on the current parameters, getting something to work without having to ask management to buy a new thing

#

@coarse torrent if the main concern is cache volumes... What if you persisted them as part of your dagger pipeline?

sharp flicker Nov 2, 2023, 3:26 AM

#

Got it.. makes total sense. In that case the options here are quite limited. Maybe Erik has some other ideas, but only exporting cache volumes doesn't seem trivial

oak briar Nov 2, 2023, 3:26 AM

#

You could add specific steps that get the right volumes mounted in the right place, and either upload or download their contents

coarse torrent Nov 2, 2023, 3:27 AM

#

Ooo, actually that's not bad.

#

Just have a step that we inject into all our pipelines that just stuffs everything into the cache manually.

oak briar Nov 2, 2023, 3:27 AM

#

Then you can worry about the layer caching separately, via cache export. There are other downsides, but a major one (lack of cache volume persistence) would be eliminated

coarse torrent Nov 2, 2023, 3:27 AM

#

As I said I mostly don't care about layer persistence.

oak briar Nov 2, 2023, 3:28 AM

#

coarse torrent Just have a step that we inject into all our pipelines that just stuffs everythi...

Ideally you would have different upload/download steps for different domains, for example one set of functions would deal with gradle persistence, another with go persistence, etc

coarse torrent Nov 2, 2023, 3:29 AM

#

Nah, no need.

#

Have a final step that you mount all the cache volumes to, with each cache volume being mounted to a folder named the ID of the cache volume, e.g. /cache/mycachevol. Have a binary that runs and saves everything directly to the cache. Do the opposite for a restore

oak briar Nov 2, 2023, 3:31 AM

#

Eventually though, you'll need to mount the right content in the right path of the right container

#

Anyway what you're saying works 🙂 Slight variations of the same thing. This is my favorite option so far, because you're not messing with the guts of dagger

coarse torrent Nov 2, 2023, 3:33 AM

#

We'll probably just mount /var/lib/dagger for the moment until we can work out the logic for the cache dance.

#

It gives us the functionality we care about but massively simplifies pipelines, then we can worry about building the cache dance.

#

Right now the pipeline complexity is killing us in maintenance overhead.

carmine rivet Nov 2, 2023, 11:02 AM

#

@coarse torrent curious if you've seen https://github.com/reproducible-containers/buildkit-cache-dance?
It's about as close as official for a solution like this, it's maintained by Akihiro (one of the buildkit maintainers), it does exactly what you described for native buildkit - it grabs the contents of a cache volume, and exports it to disk, before putting it into gha's cache.
I'm not exactly sure how well this translates to dagger, since you'd probably be building using the dagger sdk instead of dockerfiles, but the approach is pretty much the same.

coarse torrent Nov 2, 2023, 12:36 PM

#

@carmine rivet yeah, I've seen that. I'm thinking something conceptually similar, but more tightly coupled to Dagger.

fiery plover Nov 9, 2023, 10:20 PM

#

I am interested in something similar, running multiple dagger-engine instances in different AZs pointing to one large NFS mount. However, someone wrote here that buildkit storage is not designed to be concurrently safe. In that case I'd rather solve this with a consistent-hashing-based routing of dagger builds so that the builds for X always land on a same machine pointing to its own NFS mount. Do you see any problem with that, or any other (better ideas)?

oak briar Nov 9, 2023, 10:23 PM

#

fiery plover I am interested in something similar, running multiple dagger-engine instances i...

There is no tooling for such a hashing-based routing. Not impossible though, depending on the nature of your workload it could work. You would have to do some trailblazing though.

My recommendation is to start with Dagger Cloud distributed cache, supported & works out of the box, then see if you run into issues with that before considering building anything too custom

#

The other reliable option is to just use one beefy machine, and scale vertically

fiery plover Nov 9, 2023, 10:24 PM

#

i'm in the same boat like OP, Fortune 500 corp here. We like the product, but for now in semi-PoC we'd like to increase our cache hit rate

#

we already have a dozen (10ish) of beefy machines but cache hit is not good enough. You have to run at least 10 runs to get your cache populated on all 10 machines

oak briar Nov 9, 2023, 10:25 PM

#

Perhaps naive but is it really not an option to hook up a SaaS service for a semi-poc?

fiery plover Nov 9, 2023, 10:26 PM

#

same procurement issues as the OP, plus we have a lot of SOX and PCI-compliant docker images, and storing them in SaaS cloud would be a nightmare for our security teams

#

I wish it was easier though

oak briar Nov 9, 2023, 10:27 PM

#

We’re happy to set you up on an trial account, talk to your buyer’s & management etc. Also I’m curious if bring-your-own-storage would make a difference?

#

With BYOS we would never touch the cache data. Only cache orchestration and telemetry (for visualization)

fiery plover Nov 9, 2023, 10:28 PM

#

if you're offerring such a hybrid model than that would be way more acceptable

#

what would be the pricing / cost model for that? if we bring our own storage

oak briar Nov 9, 2023, 10:28 PM

#

We also just started SOC2 type 2 evaluation period in case that helps. We’re in this to make the bosses & buyers happy 😁

#

Our whole pricing is metered by managed minutes, to scale with utilization. Typically pays for itself because of the compute savings from caching, and engineering savings from visualization, faster dev loop etc

oak briar Nov 9, 2023, 10:37 PM

#

oak briar Our whole pricing is metered by managed minutes, to scale with utilization. Typi...

then on top of that , larger buyers typically want to have a sales conversation for a yearly contract, PO, volume discounts etc. We are more than happy to accommodate all of that and make it as painless as possible for you.

#

We learned our lesson from Docker, and want to build out the business side early, in a way that integrates well with the community side. That will allow us to be masters of our own destiny, and actually deliver something reliable for the long term. In my experience, a great software product that is not financially viable is only temporarily great.

#Distributed cahing options