#Trouble sharing cache mounts cross machines - what is expected behavior with dagger cloud?

1 messages · Page 1 of 1 (latest)

tender magnet
#

I have the team plan on dagger cloud. My understanding is that with the distributed caching it offers, my teammates machines and our CI machines should be able to benefit from cache hits on shared cache mounts. Unfortunately I haven't observed our dagger runs in CI use any caching at all. I have seen the dev machines successfully use caching, but I can't really tell if it's their local cache or a distributed cache pull down that they are benefiting from. I've got a handful of cache mounts configured like so:

                      WithMountedCache(
            path,
            cache,
            dagger.ContainerWithMountedCacheOpts{
                Sharing: dagger.Shared,
            }),

In ci, we use github actions on ephemeral self-hosted runners. I've built an AWS AMI with the minimum dependencies and dagger engine running, which has saved us 30-60s per run already. My hope and understanding was that this should be enough for them to take advantage of the caches. I've ensured that we're calling docker stop on the dagger engine and waiting up to 300s for that to finish. I've spent a dozen or more hours just on this aspect of onboarding my team to dagger, and if I can't truly realize distributed caching, it may be a show stopper for us.

Here is an example trace: https://dagger.cloud/Tallied-Technologies-Inc/traces/e22cb64f385a60335abee08ca3635149

tender magnet
#

I'm wondering if the engine itself requires a cloud token; right now we start the engine on VM boot, no cloud token and its the dagger call that provides the cloud token. Could this result in no use of dagger cloud caches?

next sparrow
next sparrow
#

that's automatically done if dagger call is the one that spawns the engine and the DAGGER_CLOUD_TOKEN env var is set

#

if you're starting the engine yourself, you need to set that variable manually

#

when the engine is using the cache service, you'll see something like this in the engine logs:

DEBU[2024-08-09T15:42:36Z] importing cache                              
DEBU[2024-08-09T15:42:36Z] calling import cache                         
DEBU[2024-08-09T15:42:39Z] finished import cache call in 2.556976163s   
DEBU[2024-08-09T15:42:39Z] creating descriptor provider pairs           
DEBU[2024-08-09T15:42:39Z] finished creating descriptor provider pairs in 674.829µs 
DEBU[2024-08-09T15:42:39Z] parsing cache config
tender magnet
#

Thanks for the reply! I just confirmed, the caching is working if i let the dagger action start the engine with the cloud token env var.

#

It takes ~1min for dagger connect though, presumably it's pulling down the cache/caches.

#

When we let dagger call start the action, I see exec docker run --name dagger-engine-f16ce8ba8491e744 -d --restart always -v /var/lib/dagger --privileged -e DAGGER_CLOUD_TOKEN registry.dagger.io/engine:v0.12.4 --debug. Do I understand correctly that -e DAGGER_CLOUD_TOKEN binds the hosts DAGGER_CLOUD_TOKEN var to the engine container at the same var? As in, it must be set on the host prior to docker run ?

#

Which has me thinking... does the dagger engine pull down caches on startup, or upon the first session connection?

next sparrow
#

Which has me thinking... does the dagger engine pull down caches on startup, or upon the first session connection?

On startup. We're currently in the works of making this initial cache pull more efficient by making it lazy. It's currently downloading a bit more than what it needs and we'll be able to make it pull some things only when needed which should speed up the startup time considerably

tender magnet
#

That's great news, I think for the time being, I'm going to embed the dagger cloud token in our private AMI (locked down into a GHA runners AWS account with no access by others). We start the dagger engine as a daemon as early as possible, so the sooner the caches start downloading, the better. Will test and report back if there were any substantial gains worth the short-term security tradeoff.

tender magnet
#

How does the dagger engine authenticate when running on our machines? I don't see DAGGER_CLOUD_TOKEN set. I just tested:

  1. docker rm -fv $(docker ps --filter name="dagger-engine-*" -q)
  2. dagger login
  3. dagger call ...

Engine startup and client connection was near instant, and no caches were used; the full dagger call execution occurred.

next sparrow
tender magnet
#

I just ran the engine myself with the cloud token set, and I'm seeing cache activity.

next sparrow
#

in general we don't see developers using the distributed cache locally in their machines mostly for several reasons:

  • Local cache is generally good enough for the typical inner dev loop of change, build, etc.
  • A developer might accidentally wipe the cache while doing some tests locally which will affect the production engines as well
  • Dagger cloud distributed cache is "cloud/locality aware". This means that if your GHA runners are withing the AWS network, the distributed cache will use an s3 bucket in the same region your runners are so it has the maximum perofrmance. If your dev machines are in a different network (which they generally are), CloudFlare is used in this case so they'll not be really sharing the "production" engines cache
next sparrow
#

cc @wanton cedar

tender magnet
#

Further, I'm curious if having -ci and -dev cache key suffixes (based on whether its running in GHA or dev machine) would sufficiently isolate cache usage, such that the risk of 'accidental wipe' is eliminated?

tender magnet
# next sparrow I agree with you. Adding an issue to fix that. cc <@1013436980249505882>

Just realized the linked doc suggests local dev environment -> dagger cloud cache is a thing:

If you've configured cache volumes for the first time in a local development environment, call your Dagger Module via the Dagger CLI and then run the command docker container stop -t 300 "$(docker container list --filter 'name=^dagger-engine-*' -q)". This step ensures your new cache volumes populate to Dagger Cloud as these are during the engine shutdown phase. You only need to do this the first time you use Dagger Cloud locally with cache volumes or when you add new cache volumes in your Dagger pipeline.

next sparrow
#

mostly because devs like to fiddle with their dev a lot and sometimes they remove their docker containers which will make subsequent dagger calls to take a bit as the cache needs to be re-synced, etc

tender magnet
#

That makes sense, fortunately this is the first and only use of containerization on the team, so I have somewhat of a 'clean slate' to work with in terms of setting a standard.

next sparrow
next sparrow
next sparrow
#

what I'd advise in that case, is to give you a different DAGGER_CLOUD_TOKEN so your developers use a different one than production

#

you can't currently create new tokens in the UI, but we can do it from our backend and you'll still see them in your configuration page. cc @wanton cedar WDYT?

tender magnet
#

One of the things I tried during my troubleshooting leading up to this point was setting the cache volume mount as Shared CacheSharingMode = "SHARED". I'm curious how dangerous that is. Its a go mono repo, single root go module, so it seems safe for go mod caching - but for caching go build/go test I can imagine parallel invocations (CI Run A and CI Run B) might step on each other somehow?

next sparrow
wanton cedar