Question on engine state | Dagger | Page 1

latent lantern Mar 13, 2026, 5:22 PM

#

We're experimenting with the "cache seeding" idea, as a stopgap for lack of proper persistent storage

#

The general idea: build custom engine images which are pre-warmed by running an ephemeral service on the vanilla image, having it load a module and perhaps even run checks; then snapshot the resulting state and baking it into the engine image.

Then on deployment, we have the option of deploying a blend of vanilla and specialized images. No persistent storage - further writes to cache are discarded.

The theory is that the layers that are common to every run of a given check, are expensive enough that the speedup is worth it

#

Right now we're trying to prototype this at small scale, to validate whether it is worth it

#

But, we're running into an implementation issue: you can't just snapshot containerd overlay layers, and bake them into an OCI image. It causes an "overlay on overlay" situation which breaks the final engine

primal lotus Mar 13, 2026, 5:27 PM

#

Still a viable option, I did realize there’s one wrinkle in that the engine strictly requires /var/lib/dagger to NOT be overlay (cause you can’t do overlay-on-overlay) but if the cache is stored in an image layer it’ll be overlay. So this idea requires a small dance of mv’ing the cache at startup, but I doubt that would be a huge overhead. Just fui

#

Oh

#

Lol

#

spidermanpointing

latent lantern Mar 13, 2026, 5:27 PM

#

So the question is: how difficult would it be to get, say, a super low-level "export state / import state" engine-wide, to support the same idea conceptually?

latent lantern Mar 13, 2026, 5:28 PM

#

primal lotus Still a viable option, I did realize there’s one wrinkle in that the engine stri...

I'll take this as an encouraging sign that this is possible and possibly even a good idea? 🙂

#

Actually, rather than an imperative API call, it might make more sense to make it a configurable directory in the engine image

#

-> at startup, if the engine finds anything in that path, it ingests it

primal lotus Mar 13, 2026, 5:30 PM

#

latent lantern I'll take this as an encouraging sign that this is possible and possibly even a ...

If you haven’t tried just doing a rsync of the cache off overlay to a bind mounted non-overlay dir yet (so local only rsync, just probably faster than mv), I would start there. If overhead is really bad we could try more

latent lantern Mar 13, 2026, 5:30 PM

#

Ultra bonus if it supports ingesting & merging several such directories, that would allow us to experiment with even more strategies 😇

primal lotus Mar 13, 2026, 5:31 PM

#

latent lantern -> at startup, if the engine finds anything in that path, it ingests it

It needs to be more like a full replacement than an additive ingest. Even after theseus we are still using containerd snapshots which cannot be merged like that

latent lantern Mar 13, 2026, 5:31 PM

#

primal lotus If you haven’t tried just doing a rsync of the cache off overlay to a bind mount...

Haven't tried yet. If it's at all possible to use image-baking, there is a strong benefit: tooling. We are a Dagger-savvy shop (obviously) so anything that involves assembling and moving lots of images is relatively easy for us. Would require less bespoke tooling than an rsync-based setup

latent lantern Mar 13, 2026, 5:31 PM

#

primal lotus It needs to be more like a full replacement than an additive ingest. Even after ...

Ah. Would it be basically engine doing the equivalent of the rsync, itself?

latent lantern Mar 13, 2026, 5:32 PM

#

primal lotus If you haven’t tried just doing a rsync of the cache off overlay to a bind mount...

OK I will try

primal lotus Mar 13, 2026, 5:32 PM

#

latent lantern Ah. Would it be basically engine doing the equivalent of the rsync, itself?

Yeah to be clear I only mentioned rsync cause I think it would be faster than the mv command due to parallelism, but I could be misinformed and maybe mv is already parallel

latent lantern Mar 13, 2026, 5:33 PM

#

But wait, after I rsync it, can I still pack it into an OCI layer? Or no

#

Here's what we already do:

Build a dagger engine container
start it in an ephemeral service with a dedicated state cache
run a client against it that loads the given remote module
cleanly shut the engine down once module loading completes
snapshot the seeded state back into the returned container image

We already do this with /var/lib/dagger in a cache volume (not rsync, but same idea)

But, I do this with an engine configured to use buildkit native snapshotter (plain files) because I assumed that if we use overlay, the resulting state cannot be safely packed into the image (regardless of whether it was mv-ed or rsynced out prior to baking)

#

sorry if I'm not clear 🙂

#

If the issue is packaging rather than extraction, perhaps I could wrap the raw directory in a tarball, and the OCI image contains the tarball?

primal lotus Mar 13, 2026, 5:40 PM

#

Here's what I'm suggesting spelled out more:

Baking step:

Run engine with custom engine state dir (arbitrary, but let's say /var/lib/dagger2), this is a setting that can be controlled by an arg to the engine process (and/or config file)
Load modules, do whatever you want to be baekd
Stop engine, build + publish image

Starting pre-baked engine step:

Run pre-baked engine, but with default engine dir /var/lib/dagger
On engine startup, before even starting the engine process do a mv /var/lib/dagger2 /var/lib/dagger
- I am wondering if rsync might be faster because it does stuff in parallel, but that's just an optimization maybe. That's the only reason I mentioned rsync
Then run as normal

#

There's a million variations and potential little optimizations you could try but that's the gist of the simplest approach I think

latent lantern Mar 13, 2026, 5:41 PM

#

what's the difference between that and just changing /var/lib/dagger ?

primal lotus Mar 13, 2026, 5:42 PM

#

latent lantern what's the difference between that and just changing /var/lib/dagger ?

Not following the question. What I'm suggesting avoids any overlay-on-overlay problems entirely

#

you will never try to run overlay-on-overlay

latent lantern Mar 13, 2026, 5:44 PM

#

In the baking step (while the baking ephemeral engine is running), should /var/lib/dagger2 be in a cache volume, or not in a cache volume, or indifferent?

primal lotus Mar 13, 2026, 5:45 PM

#

latent lantern In the baking step (while the baking ephemeral engine is running), should `/var/...

Cache volume (or in docker terms, a bind mount from the host). Then the baking doesn't have to be extremely slow + expensive.

latent lantern Mar 13, 2026, 5:45 PM

#

OK. At the moment we already bake in a cache volume (bind mount), but mounted at /var/lib/dagger

#

Then we copy the contents of the cache volume out into /var/lib/dagger in the container proper. And publish that

#

But if I understand correctly, the crucial step is to avoid that content being mounted into /var/lib/dagger directly when the final engine runs

primal lotus Mar 13, 2026, 5:47 PM

#

latent lantern Then we copy the contents of the cache volume out into `/var/lib/dagger` in the ...

Yeah the problem is that you want the final engine state dir to be a bind mount from the host (to avoid overlay-on-overlay). But if you make that bind mount, you are masking all of the data underneath it, which defeats the point of course

latent lantern Mar 13, 2026, 5:49 PM

#

OK. The part that seems magical to me, is that /var/lib/dagger2 which will be mounted as an overlay (because part of the image), can be copied into the cache volume at deployment, and somehow nothing is lost in translation

#

So the dagger process can't access it directly, but somehow rsync can

#

(this is definitely a me problem 🙂 )

#

just comprehension

primal lotus Mar 13, 2026, 5:53 PM

#

latent lantern So the dagger process can't access it directly, but somehow `rsync` can

To be clear this is something that'd run probably as part of the entrypoint and before the actual engine process is exec'd. It's nothing too magical, it would just be copying data from one directory (/var/lib/dagger2) to another (/var/lib/dagger).

The problem isn't that the engine couldn't use /var/lib/dagger2, it's just that it's overlay and thus trying to use it will result in the native snapshotter being used. But if the data is copied to /var/lib/dagger (which isn't overlay) then that's not an issue

#

To be clear this is all quite stupid, it's limitations in the kernel and in containerd, it's just what we have to work with for the time being

latent lantern Mar 13, 2026, 5:54 PM

#

ah I see!

#

simpler than I imagined

primal lotus Mar 13, 2026, 5:54 PM

#

IIRC you technically can use overlay as a lowerdir, but not an upperdir. Theoretically that could save us but it would require containerd snapshotters knowing how to be "split brain" between two state dirs, which isn't gonna happen in a short time frame

primal lotus Mar 13, 2026, 5:55 PM

#

latent lantern simpler than I imagined

Yeah stupid and simple. I think it's worth trying as a first step

latent lantern Mar 13, 2026, 5:55 PM

#

so the question is: do we want the engine itself to support this as an optional config key or hook - or should we keep it in the image entrypoint

#

(will do it in entry point for prototype regardless)

primal lotus Mar 13, 2026, 5:56 PM

#

latent lantern so the question is: do we want the engine itself to support this as an optional ...

I'd consider this a short term hack. Not worth enshrining more.

latent lantern Mar 13, 2026, 5:57 PM

#

side note: what I like about this caching/deployment strategy in general, is that it feels like a step towards stateless & even ephemeral engines post-theseus

primal lotus Mar 13, 2026, 5:58 PM

#

latent lantern side note: what I like about this caching/deployment strategy in general, is tha...

Yeah I agree, it's the utterly simplest version of a "remote cache import" as you could get. But conceptually it is close to where we're headed

latent lantern Mar 13, 2026, 5:58 PM

#

i'm assuming that even post-theseus, it will be useful to configure different engines to read and write cache differently (eg. pre-warm for this or that specialized workload). It's just that it will no longer require moving data in advance- the engine will be able to do its own moving

lean warren Mar 13, 2026, 6:30 PM

#

@primal lotus wouldn't it be possible to bundle this module loading layers in the same way that we're bundling the SDK's runtimes now? As part of the engine build process we generate the required .tarfiles and then we add them to the content store so everything can be bundled as part of the same OCI image and we don't have to do anything in the engine entrypoint

primal lotus Mar 13, 2026, 6:33 PM

#

lean warren <@949034677610643507> wouldn't it be possible to bundle this module loading laye...

It wouldn't really work the same. What we do with the SDKs is bundle very bespoke base images that we then hardcode certain SDKs to use for the base of their runtimes. It's not actually cache logic, it's more like hardcoding some base images and telling the SDKs to use that.

Trying to get that to work with "arbitrary user modules" would be really complicated since you can't just hardcode anymore

#

IIRC I originally tried having the SDK bundling use buildkit local cache import but obviously was a nightmare, which is how we ended up with the hardcoded images

latent lantern Mar 13, 2026, 6:34 PM

#

Post-theseus, I see a possibility that these 2 very different kinds of bundling will start converging?

#

There could be a first-class concept of "cache pool", that you configure this or that engine to read from, write to etc

#

some of which would be pre-configured, public read-only, private, etc

primal lotus Mar 13, 2026, 6:36 PM

#

latent lantern Post-theseus, I see a possibility that these 2 very different kinds of bundling ...

Yeah 100% the SDK bundles would just become an additional local cache you import from, not bespoke base images. May potentially help performance since I am not even convinced the base images really work as consistently as we want today (it's just too confusing and convoluted to think through and get working perfectly)

latent lantern Mar 13, 2026, 11:16 PM

#

https://dagger.cloud/dagger/traces/67236d8b368072ee21b792d695c9d017#fb80b10d6ba494a7

👀

Dagger Cloud

Browse and visualize Dagger traces.

#

trying

latent lantern Mar 14, 2026, 12:00 AM

#

@primal lotus I think I'm doing something stupid, what's the env var I should set to use a given image in my local docker engine? Failures are silent so it's hard to know for usre

#

I'm trying this:

# Copy my local vanilla image
docker image tag registry.dagger.io/engine:v0.20.0 dagger-vanilla-engine:v2
# Have dagger auto-start a new engine container with an empty vache (control experiment)
time _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://dagger-vanilla-engine:v2 dagger core version
v0.20.1

--> output should be v0.20.0 😭

primal lotus Mar 14, 2026, 12:01 AM

#

latent lantern I'm trying this: ``` # Copy my local vanilla image docker image tag registry.da...

that looks correct to me

#

I mean in terms of the env var

#

trying too

primal lotus Mar 14, 2026, 12:02 AM

#

latent lantern I'm trying this: ``` # Copy my local vanilla image docker image tag registry.da...

I'm presuming you actually ran docker image tag, not dagger image tag, right?

latent lantern Mar 14, 2026, 12:03 AM

#

oh..yes 🙂

#

$ docker image ls dagger-vanilla-engine:v2
REPOSITORY              TAG       IMAGE ID       CREATED       SIZE
dagger-vanilla-engine   v2        ef79ee63b9e5   2 weeks ago   676MB

#

$ docker image ls registry.dagger.io/engine:v0.20.0
REPOSITORY                  TAG       IMAGE ID       CREATED       SIZE
registry.dagger.io/engine   v0.20.0   ef79ee63b9e5   2 weeks ago   676MB

#

No engine started:

$ docker container ls | grep vanilla | wc -l
       0

primal lotus Mar 14, 2026, 12:05 AM

#

Here's what I get thinkspin

sipsma@dagger_dev:~/repo/github.com/sipsma/dagger/.github/workflows$ docker image tag registry.dagger.io/engine:v0.20.0 dagger-vanilla-engine:v2
sipsma@dagger_dev:~/repo/github.com/sipsma/dagger/.github/workflows$ time _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://dagger-vanilla-engine:v2 dagger core version
✘ connect 0.0s ERROR
! start engine: parse runner host: parse "docker-image://dagger-vanilla-engine:v2": invalid port ":v2" after host

real    0m0.835s
user    0m0.095s
sys     0m0.127s

#

do you have dagger aliased to dagger --cloud or something?

latent lantern Mar 14, 2026, 12:07 AM

#

oh... duh I had DAGGER_CLOUD_ENGINE=1 🤦‍♂️

#

But... now that I'm getting the same as you: isn't it weird that we can't set the image tag?

#

maybe it's reserved for setting the CLI version?

primal lotus Mar 14, 2026, 12:08 AM

#

latent lantern But... now that I'm getting the same as you: isn't it weird that we can't set th...

Yeah that's a bug for sure, but it does seem to work if you drop the tag

latent lantern Mar 14, 2026, 12:08 AM

#

Oh it's trying to pull

#

Ah nevermind, I have to hardcode latest

#Question on engine state