centralized vs p2p caching | Dagger | Page 1

potent raven Aug 25, 2023, 6:30 AM

#

If you don't mind me asking, what does this mean for Dagger Cloud? Was managed cache a major part ? I do want Dagger(as a company and project) to succeed so I'll understand whatever decision you folks make 🙂

Our approach is pretty simple:

There are 2 parts to the Dagger platform: the engine (open-source, decentralized) and the cloud (proprietary, centralized)
When we decide where a new feature should go, it's based on where it will work best. Some features work better in the decentralized engine. Others, in the centralized cloud. And others are best left to another product entirely. This is a product design decision first, business second.
This helps ensure that our ecosystem and business amplify each other, instead of being at odds, because our business doesn't need the Dagger Engine to be crippled or relicensed to succeed
Distributed caching is one feature that we believe works better in cloud. Therefore we offer a distributed cache service as part of Dagger Cloud
If we are wrong and distributed caching is not made better by a centralized service, then we will change the design. But initial results are confirming our hypothesis that a centralized service indeed works better - and actually gets even better over time.

brazen elm Aug 25, 2023, 12:08 PM

#

tha'ts good to hear. thanks for clarifying.

#

Do you mean a centralised Buildkit cache here rather than ephemeral caches? Or do you mean a centralised cache is better than P2P shared cache via containerd backend? Also, how would Dagger cloud work for people who have their own runners ?

uncut moon Aug 25, 2023, 2:18 PM

#

What are you caching exactly? Docker volumes? Docker builds?

potent raven Aug 25, 2023, 2:48 PM

#

p2p containerd backend has been tried at Netflix (ipfs + s3 fallback) and the feedback I heard is that it’s very slow

#

So working at the buildkit layer seems to work better

#

But buildkit’s implementation via cache-from and cache-to has limitations. Some of those limitations are “just” implementation (cache mounts not persisted; export is fully sequential; no manifest merge depending on the storage target) but others are more fundamental because there is no central component to orchestrate the movement of data between remote cold cache and local hot cache. Each node has no choice but to download everything and re-upload everything each time.

Basically shared buildkit cache by existing means does not scale.

#

Dagger Cloud provides a control plane that tells each engine what to download or upload and when. That can be p2p transfers between nodes, or to/from an object storage service. The important thing is that there is a centralized “brain” that keeps track of which cache object needs to go where. We also provide object storage out of the box (multi-cloud and multi-region) for convenience. But for larger teams there will be a bring-your-own-bucket feature.

#

The nice part is also a telemetry service. So the telemetry allows caching to get smarter over time. Eventually you can even move cache data preemptively based on past patterns; and provide job scheduling cues for horizontal scaling: instead of distributing jobs round robin, why not askbthe control plane which node should run it, based on past run history, current cpu load and current cache state 🙂

#

Lastly @brazen elm : Dagger Cloud is entirely “bring your own runner”, we don’t offer managed runners. So it’s designed to accelerate your runners wherever they are, not replace them.

#

I hope this helps guys

uncut moon Aug 25, 2023, 3:05 PM

#

Buildkit cache is only for Docker build layer caching right? Or are you using it in a different way?

potent raven Aug 25, 2023, 3:07 PM

#

We use buildkit under the hood as a general purpose pipeline execution engine. So Dagger pipelines always get caching out of the box, whether they’re build, test, deployment, data transformation, ML inference etc.

#

So this is really about distributing the Dagger cache. Since it’s buildkit tech under the hood, we can use all prior art for buildkit caching, which harsh js familiar with. As I explained above, that prior art has limitations so we’re improving it 🙂

uncut moon Aug 25, 2023, 3:10 PM

#

Ah well that is making a lot more sense.

#

If we have a buildkit server deployed can we plug dagger into it today?

potent raven Aug 25, 2023, 3:17 PM

#

That used to be the case but not anymore. We wrap buildkit in a pretty thick layer of wrappers and hooks, to get some of the cooler features to work out of the box, like capturing stdout/stderr or ephemeral service containers

(also support matrix was getting hairy)

#

But you can customize the engine itself if needed

#

The deployment best practices are usually different anyway

#

With dagger you typically are deploying it as a companion to your CI runner, like a specialized GPU sitting next to your CPU 🙂 So you typically want one dagger engine on each machine in your cluster; then your CI runner software hands work to the closest engine. In kubernetes we do this with a daemonset for example

uncut moon Aug 25, 2023, 3:25 PM

#

Yeah I was going to deploy buildkit as a daemonset and hand it off to that

#

But I understand now that it won't work like that

potent raven Aug 25, 2023, 3:25 PM

#

Exactly. So same thing but with dagger engine in the daemonset and you’re good to go 🙂

uncut moon Aug 25, 2023, 3:27 PM

#

I see, and is that doable today? Or are we talking about the Cloud offering?

potent raven Aug 25, 2023, 3:28 PM

#

daemonset + engine is all open source. Then you need to decide how to get observability on the pipelines, and distributed caching. That’s where you would plug in Dagger Cloud

#

Or alternatively you run dagger on a single super-powerful machine (possibly bare metal) then distributed caching is less of a problem in the short term

#

That’s the preferred model of @modern yoke for example

uncut moon Aug 25, 2023, 3:31 PM

#

Gotcha it's all making sense now, really appriciate it.

potent raven Aug 25, 2023, 3:34 PM

#

No problem!

brazen elm Aug 25, 2023, 4:24 PM

#

re: netflix - I think I followed the same person's trail who worked there- https://github.com/hinshun/ipcs#results

IPFS's performance seems to slow down as the number of nodes (size of total image) goes up. There was a recent regression in go-ipfs v0.4.21 that was fixed in this commit on master:

I've also seen this talk where GitPod trying to cache images per node with IPFS but they found issues too- https://www.youtube.com/watch?v=kS6aDScfVuw&pp=ygUTZ2l0cG9kIGlwZnMga3ViZWNvbg%3D%3D

My impression was IPFS out of box may not work but P2P sharing either via other projects or something custom using libp2p might. However, it doesn't seem like an easy path to expirement with so I totally see the value Dagger cloud's operation model can bring, especially considering how "smart" the controlplane could get by making decisions on when to pull cache vs when to build, consistent hashing and other things you've mentioned.

#centralized vs p2p caching