#Question on engine state

1 messages · Page 1 of 1 (latest)

latent lantern
#

We're experimenting with the "cache seeding" idea, as a stopgap for lack of proper persistent storage

#

The general idea: build custom engine images which are pre-warmed by running an ephemeral service on the vanilla image, having it load a module and perhaps even run checks; then snapshot the resulting state and baking it into the engine image.

Then on deployment, we have the option of deploying a blend of vanilla and specialized images. No persistent storage - further writes to cache are discarded.

The theory is that the layers that are common to every run of a given check, are expensive enough that the speedup is worth it

#

Right now we're trying to prototype this at small scale, to validate whether it is worth it

#

But, we're running into an implementation issue: you can't just snapshot containerd overlay layers, and bake them into an OCI image. It causes an "overlay on overlay" situation which breaks the final engine

primal lotus
#

Still a viable option, I did realize there’s one wrinkle in that the engine strictly requires /var/lib/dagger to NOT be overlay (cause you can’t do overlay-on-overlay) but if the cache is stored in an image layer it’ll be overlay. So this idea requires a small dance of mv’ing the cache at startup, but I doubt that would be a huge overhead. Just fui

#

Oh

#

Lol

latent lantern
#

So the question is: how difficult would it be to get, say, a super low-level "export state / import state" engine-wide, to support the same idea conceptually?

latent lantern
#

Actually, rather than an imperative API call, it might make more sense to make it a configurable directory in the engine image

#

-> at startup, if the engine finds anything in that path, it ingests it

primal lotus
latent lantern
#

Ultra bonus if it supports ingesting & merging several such directories, that would allow us to experiment with even more strategies 😇

primal lotus
latent lantern
latent lantern
primal lotus
latent lantern
#

But wait, after I rsync it, can I still pack it into an OCI layer? Or no

#

Here's what we already do:

  1. Build a dagger engine container
  2. start it in an ephemeral service with a dedicated state cache
  3. run a client against it that loads the given remote module
  4. cleanly shut the engine down once module loading completes
  5. snapshot the seeded state back into the returned container image

We already do this with /var/lib/dagger in a cache volume (not rsync, but same idea)

But, I do this with an engine configured to use buildkit native snapshotter (plain files) because I assumed that if we use overlay, the resulting state cannot be safely packed into the image (regardless of whether it was mv-ed or rsynced out prior to baking)

#

sorry if I'm not clear 🙂

#

If the issue is packaging rather than extraction, perhaps I could wrap the raw directory in a tarball, and the OCI image contains the tarball?

primal lotus
#

Here's what I'm suggesting spelled out more:

Baking step:

  1. Run engine with custom engine state dir (arbitrary, but let's say /var/lib/dagger2), this is a setting that can be controlled by an arg to the engine process (and/or config file)
  2. Load modules, do whatever you want to be baekd
  3. Stop engine, build + publish image

Starting pre-baked engine step:

  1. Run pre-baked engine, but with default engine dir /var/lib/dagger
  2. On engine startup, before even starting the engine process do a mv /var/lib/dagger2 /var/lib/dagger
    • I am wondering if rsync might be faster because it does stuff in parallel, but that's just an optimization maybe. That's the only reason I mentioned rsync
  3. Then run as normal
#

There's a million variations and potential little optimizations you could try but that's the gist of the simplest approach I think

latent lantern
#

what's the difference between that and just changing /var/lib/dagger ?

primal lotus
#

you will never try to run overlay-on-overlay

latent lantern
#

In the baking step (while the baking ephemeral engine is running), should /var/lib/dagger2 be in a cache volume, or not in a cache volume, or indifferent?

primal lotus
latent lantern
#

OK. At the moment we already bake in a cache volume (bind mount), but mounted at /var/lib/dagger

#

Then we copy the contents of the cache volume out into /var/lib/dagger in the container proper. And publish that

#

But if I understand correctly, the crucial step is to avoid that content being mounted into /var/lib/dagger directly when the final engine runs

primal lotus
latent lantern
#

OK. The part that seems magical to me, is that /var/lib/dagger2 which will be mounted as an overlay (because part of the image), can be copied into the cache volume at deployment, and somehow nothing is lost in translation

#

So the dagger process can't access it directly, but somehow rsync can

#

(this is definitely a me problem 🙂 )

#

just comprehension

primal lotus
# latent lantern So the dagger process can't access it directly, but somehow `rsync` can

To be clear this is something that'd run probably as part of the entrypoint and before the actual engine process is exec'd. It's nothing too magical, it would just be copying data from one directory (/var/lib/dagger2) to another (/var/lib/dagger).

The problem isn't that the engine couldn't use /var/lib/dagger2, it's just that it's overlay and thus trying to use it will result in the native snapshotter being used. But if the data is copied to /var/lib/dagger (which isn't overlay) then that's not an issue

#

To be clear this is all quite stupid, it's limitations in the kernel and in containerd, it's just what we have to work with for the time being

latent lantern
#

ah I see!

#

simpler than I imagined

primal lotus
#

IIRC you technically can use overlay as a lowerdir, but not an upperdir. Theoretically that could save us but it would require containerd snapshotters knowing how to be "split brain" between two state dirs, which isn't gonna happen in a short time frame

primal lotus
latent lantern
#

so the question is: do we want the engine itself to support this as an optional config key or hook - or should we keep it in the image entrypoint

#

(will do it in entry point for prototype regardless)

primal lotus
latent lantern
#

side note: what I like about this caching/deployment strategy in general, is that it feels like a step towards stateless & even ephemeral engines post-theseus

primal lotus
latent lantern
#

i'm assuming that even post-theseus, it will be useful to configure different engines to read and write cache differently (eg. pre-warm for this or that specialized workload). It's just that it will no longer require moving data in advance- the engine will be able to do its own moving

lean warren
#

@primal lotus wouldn't it be possible to bundle this module loading layers in the same way that we're bundling the SDK's runtimes now? As part of the engine build process we generate the required .tarfiles and then we add them to the content store so everything can be bundled as part of the same OCI image and we don't have to do anything in the engine entrypoint

primal lotus
# lean warren <@949034677610643507> wouldn't it be possible to bundle this module loading laye...

It wouldn't really work the same. What we do with the SDKs is bundle very bespoke base images that we then hardcode certain SDKs to use for the base of their runtimes. It's not actually cache logic, it's more like hardcoding some base images and telling the SDKs to use that.

Trying to get that to work with "arbitrary user modules" would be really complicated since you can't just hardcode anymore

#

IIRC I originally tried having the SDK bundling use buildkit local cache import but obviously was a nightmare, which is how we ended up with the hardcoded images

latent lantern
#

Post-theseus, I see a possibility that these 2 very different kinds of bundling will start converging?

#

There could be a first-class concept of "cache pool", that you configure this or that engine to read from, write to etc

#

some of which would be pre-configured, public read-only, private, etc

primal lotus
latent lantern
#

trying

latent lantern
#

@primal lotus I think I'm doing something stupid, what's the env var I should set to use a given image in my local docker engine? Failures are silent so it's hard to know for usre

#

I'm trying this:

# Copy my local vanilla image
docker image tag registry.dagger.io/engine:v0.20.0 dagger-vanilla-engine:v2
# Have dagger auto-start a new engine container with an empty vache (control experiment)
time _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://dagger-vanilla-engine:v2 dagger core version
v0.20.1

--> output should be v0.20.0 😭

primal lotus
#

I mean in terms of the env var

#

trying too

primal lotus
latent lantern
#

oh..yes 🙂

#
$ docker image ls dagger-vanilla-engine:v2
REPOSITORY              TAG       IMAGE ID       CREATED       SIZE
dagger-vanilla-engine   v2        ef79ee63b9e5   2 weeks ago   676MB
#
$ docker image ls registry.dagger.io/engine:v0.20.0
REPOSITORY                  TAG       IMAGE ID       CREATED       SIZE
registry.dagger.io/engine   v0.20.0   ef79ee63b9e5   2 weeks ago   676MB
#

No engine started:

$ docker container ls | grep vanilla | wc -l
       0
primal lotus
#

Here's what I get thinkspin

sipsma@dagger_dev:~/repo/github.com/sipsma/dagger/.github/workflows$ docker image tag registry.dagger.io/engine:v0.20.0 dagger-vanilla-engine:v2
sipsma@dagger_dev:~/repo/github.com/sipsma/dagger/.github/workflows$ time _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://dagger-vanilla-engine:v2 dagger core version
✘ connect 0.0s ERROR
! start engine: parse runner host: parse "docker-image://dagger-vanilla-engine:v2": invalid port ":v2" after host

real    0m0.835s
user    0m0.095s
sys     0m0.127s
#

do you have dagger aliased to dagger --cloud or something?

latent lantern
#

oh... duh I had DAGGER_CLOUD_ENGINE=1 🤦‍♂️

#

But... now that I'm getting the same as you: isn't it weird that we can't set the image tag?

#

maybe it's reserved for setting the CLI version?

primal lotus
latent lantern
#

Oh it's trying to pull

#

Ah nevermind, I have to hardcode latest