Theseus & state dir | Dagger | Page 1

frosty jewel Sep 16, 2025, 4:38 PM

#

🧵

#

Theseus & state dir

#

I have a specific use case in mind 🙂 So checking for landmines

arctic galleon Sep 16, 2025, 4:40 PM

#

Won't be backwards compatible, engines will lose existing cache when they upgrade but we already essentially enforce that since buildkit itself has sometimes made back-incompat changes.

The new state directory will still be using containerd snapshots+content-store, so quite a bit of it will be the same. The main difference is that the boltdbs buildkit uses for cache will be consolidated to one and will be sqlite

frosty jewel Sep 16, 2025, 4:42 PM

#

Nice

#

So my follow-up question is: if I wanted to develop a tool that can snapshot and merge state directories, for the purposes of proprietary cache optimization, should I wait for theseus to do that? 🙂 Or would it be easy enough on the current version, then can be updated later to theseus?

#

The tricky parts in my mind being the merging

arctic galleon Sep 16, 2025, 4:46 PM

#

frosty jewel So my follow-up question is: if I wanted to develop a tool that can snapshot *an...

Would be pretty much of out the question for the current version. You'd need to merge the various containerd and buildkit boltdbs (flat files)

frosty jewel Sep 16, 2025, 4:46 PM

#

arctic galleon Would be pretty much of out the question for the current version. You'd need to ...

https://tenor.com/view/benjammins-don't-threaten-me-with-a-good-time-with-a-good-time-seems-fun-sounds-fun-gif-10940005011859343234

Tenor

arctic galleon Sep 16, 2025, 4:48 PM

#

the containerd ones would probably be a bit of tedious slog to work through, but possible. The buildkit cache formats are not even worth trying (that's what magicache was more or less doing, and the main of the source of its various woes)

frosty jewel Sep 16, 2025, 4:49 PM

#

Do you think theseus will make this easier or at least possible?

#

I am thinking we could use this for a major PARC scale-out optimization

arctic galleon Sep 16, 2025, 4:50 PM

#

Yeah for sure, though the model there will be that you can export the local cache state and import it elsewhere

frosty jewel Sep 16, 2025, 4:52 PM

#

I'm thinking of an optimization where PARC itself maintains a pool of state volumes per module (and/or module), and merges them on the fly when provisioning an engine for each new session, based on the session's top-level module. A further optimization can be: PARC analyzes a module's dependencies, and includes those cache volumes in the merge too

#

So this would be distinct from cache export

arctic galleon Sep 16, 2025, 4:57 PM

#

Sure, though the only distinction with cache export though is when you do the merge.

For cache export you merge everything together in the backend when an engine does an export (i.e. when it shuts down, when an export is triggered, etc.)

For what you're describing you save the state dir and then do the merge on demand for new engines.

The actual meat of the work (the merge) is the same either way though, just arranged differently

#

(I don't have a clear point here, just musing)

#

I guess the tradeoff is:

Cache export:

Only move data to the backend that doesn't already exist (which is significant if the data started out being imported from the backend anyways)
Slower shutdown (since export needs to run)
Faster startup (since everything is already merged in backend, can be pulled lazily if desired, etc.)

Save state dir, merge on demand:

Potentially duplicating tons of data if each state dir overlaps a lot
Faster shutdown (rsync data out, or whatever equivalent)
Slower startup (need to run merge on demand)

frosty jewel Sep 16, 2025, 5:14 PM

#

Yeah pretty much 🙂 I wasn't thinking of them as mutually exclusive, although maybe I should have. In my mind, the end goal is a magical stateless engine that can juggle object storage at full granularity, possibly be extended by proprietary hooks to make smarter decisions based on proprietary telemetry, centralized scheduler etc.

But since that will take a while to fully materialize, I was picturing a stopgap where the proprietary infra (PARC) makes the most of an engine that only knows about its local state. We're doing such a stopgap anyway, but I was exploring whether we could make that stopagp smarter, by leveraging the ability to snapshot and "stack" state directories. Perhaps I've ventured too far and it no longer qualifies as a stopagp... But I was thinking that perhaps, if the directory layout allows it, one could "merge" them at the system layer, simply by stacking the directory in an overlayfs or equivalent? ie. without actually reading or editing files. Perhaps that's unrealistic.

#

Basically I was looking for a "slightly cache-aware scale-out which we could ship sooner" option

#

With the understanding that we don't want to repeat the mistake of magicache, which became a "permanent stopgap" 🙂

arctic galleon Sep 16, 2025, 5:17 PM

#

frosty jewel Yeah pretty much 🙂 I wasn't thinking of them as mutually exclusive, although ma...

Yeah I really really wish that it was possible to just straight up overlay state dirs on each other and get a merged result, but besides the buildkit boltdbs, you'd even run into problems with containerd boltdbs and containerd snapshot dirs.

The only thing that would work would be the content store since the directory names are the hashes of the content. I wish containerd had given a way for snapshots to work the same, but unfortunately the directories there are named with their DB id, which is just an incrementing int

#

All this stuff was just not designed with dagger's use case in mind 😄

frosty jewel Sep 16, 2025, 5:19 PM

#

Yeah fair enough

#

There might still be an intermediary between "engine magically works with object store" and "engine is a little smarter about merging but still only works with local state"

#

I'm kind of grasping at straws I know 🙂

#

To me anything that involves an export of any kind, is an additional step into the unknown, so automatically goes into the "will take longer" bucket

#

Maybe it could be adding support for lazy merging on the local state? ie. if the engine discovers several state directories side by side (unmerged), it knows how to merge them (and it's configurable whether it should do so)

#Theseus & state dir