#CacheVolumes different permissions?

1 messages · Page 1 of 1 (latest)

frosty token
#

We have a pipeline that mounts a dag.CacheVolume and tries to install artifacts using the cache.

base := dag.Container().
        From("rust:1.81.0").
        WithEnvVariable("CARGO_TARGET_DIR", "/rust/target").
        WithMountedCache("/rust/target", dag.CacheVolume("cargo-targets")).
        WithExec([]string{"cargo", "install", "cargo-chef@0.1.55"})

When using this locally (on my laptop), all works fine, but when running this in our remote environment (SelfHosted GitHub runners, shared dagger engine in the same node than the runners), the next scenario is seen:

Workflow ouptut:

Caused by:
  Permission denied (os error 13)
error: failed to compile `cargo-chef v0.1.55`, intermediate artifacts can be found at `/rust/target`.
To reuse those artifacts with a future compilation, set the environment variable `CARGO_TARGET_DIR` to that path.

In order to debug it further, we just created a simple function that mounts a CacheVolume, and we found permissions are different from remote to local executions.

func (m *Forge) CacheVolume(ctx context.Context) (string, error) {
    return dag.Container().
        From("rust:1.81.0").
        WithEnvVariable("CARGO_TARGET_DIR", "/rust/target").
        WithMountedCache("/rust/target", dag.CacheVolume("test-volume"), dagger.ContainerWithMountedCacheOpts{Owner: "root"}).
        WithExec([]string{"ls", "-la", "/rust/target/"}).Stdout(ctx)
}

Locally, the volume mounted is owned by root:root

$> dagger call cache-volume              
✔ connect 0.2s
✔ loading module 2.0s
✔ parsing command line arguments 0.0s
✔ forge: Forge! 0.0s
✔ Forge.cacheVolume: String! 0.9s

total 16
drwxr-xr-x 3 root root 4096 Nov 21 22:02 .
drwxr-xr-x 3 root root 4096 Dec 12 13:25 ..

While on the runner dagger-engine:

17  :   Container.withExec DONE [7.0s]
17  :   [6.9s] | total 16
17  :   [6.9s] | drwxr-xr-x 3 root 1001 4096 Nov 21 14:11 .
17  :   [6.9s] | drwxr-xr-x 3 root 1001 4096 Dec 12 13:36 ..
#

This caused the operations to fail with permissions.

Probably this is a lack of my understanding of the internals of dagger!

We can solve this by explicitly stating the user to mount the cacheVolume with (dagger.ContainerWithMountedCacheOpts{Owner: "root"})).
But the fact that it was not the same experience at both environments broke our promise of "local == remote" 🙂

Locally we run a docker desktop on mac, with the setup that comes from the dagger CLI
Our GH runners have a dagger engine installed and started through the official helm chart, and agents connect to it with the env var _EXPERIMENTAL_DAGGER_RUNNER_HOST pointing to the socket ("unix:///var/run/buildkit/buildkitd.sock"`, which is mounted on every runner in the same host as the engine lives (1 engine serves multiple gh runners)

Is there any pointers towards what may cause this difference that I can look at?

trail gate
#

This is really surprising 🤔 hailing @vague sorrel @clever flax @safe ember

clever flax
#

hm

#

indeed, really surprising

#

just to clarify the very obvious - which verisons of dagger are involved here? just want to check that there's not some subtle different between different dagger versions locally and remotely

plain yarrow
#

@frosty token by any chance are your remote runners stateful? The only thing that I'm thinking of is that at some other point, someone else populated that cache volume with different permissions which is causing this issue

#

have you tried renaming your cache volume so something unique? It's definitely strange as it shoud yield the same results everywhere

frosty token
#

the versions of dagger involved:
0.14.0

#

and yes, @plain yarrow , the remote runners are "stateful". We're running beef nodes in k8s where the dagger-engine gets ~80% of the resources. A node may live for weeks before it rotates for maintenance/operational issues/scaling
Let me do a quick test with a newly cache volume, uniquely named

#

🤔

func (m *Forge) CacheVolume(ctx context.Context) (string, error) {
    return dag.Container().
        From("rust:1.81.0").
        WithMountedCache("/cache", dag.CacheVolume("avocados-are-great")).
        WithExec([]string{"ls", "-la", "/cache"}).Stdout(ctx)
}

Locally:

$> dagger call cache-volume
✔ connect 0.3s
✔ loading module 3.4s
✔ parsing command line arguments 0.0s
✔ forge: Forge! 0.0s
✔ Forge.cacheVolume: String! 0.8s

total 8
drwxr-xr-x 2 root root 4096 Dec 12 17:31 .
drwxr-xr-x 1 root root 4096 Dec 12 17:31 ..

Remote:

9   : Forge.cacheVolume: String!
10  :   Container.from(address: "rust:1.81.0"): Container!
11  :     resolving docker.io/library/rust:1.81.0
11  :     resolving docker.io/library/rust:1.81.0 DONE [0.4s]
10  :   Container.from DONE [0.5s]
12  :   Container.withMountedCache(
12  :       cache: cacheVolume(key: "avocados-are-great"): CacheVolume!
12  :       path: "/cache"
12  :     ): Container!
13  :     cache request: pull docker.io/library/rust:1.81.0@sha256:7b7f7ae5e49819e708369d49925360bde2af4f1962842e75a14af17342f08262
14  :   Container.withExec(args: ["ls", "-la", "/cache"]): Container!
14  :   Container.withExec DONE [0.0s]
15  :   Container.stdout: String!
14  :   Container.withExec DONE [0.7s]
14  :   [0.6s] | total 8
14  :   [0.6s] | drwxr-xr-x 2 root 1001 4096 Dec 12 17:33 .
14  :   [0.6s] | drwxr-xr-x 1 root 1001 4096 Dec 12 17:33 ..
15  :   Container.stdout DONE [0.8s]
9   : Forge.cacheVolume DONE [2.7s]


A new release of dagger is available: v0.14.0 → v0.15.0
To upgrade, see https://docs.dagger.io/install
https://github.com/dagger/dagger/releases/tag/v0.15.0
total 8
drwxr-xr-x 2 root 1001 4096 Dec 12 17:33 .
drwxr-xr-x 1 root 1001 4096 Dec 12 17:33 ..
#

I'm pretty sure that no-one used that cache volume name in our engines before, and still has different user group ownership... 🤔

#

And, quoting one of the engineers that initially reported this:

the issue you’re seeing is a break in the abstraction, if you’re not weird like our project and using --insecure-root-capabilities the “root” uuid is different for every run but I don’t think the cache volume mechanism is smart enough to handle that without an explicit owner. I think this is the issue anyway
#

Not sure how close he is to the issue 🙂

plain yarrow
#

I think I've managed to repro

#

1 sec

#

yeah, I was able to repro, the TL;DR is that cache volumes will take the same user:group values that the engine executed with. @frosty token I'd assume that your self hosted runners are running the Dagger engine with root:1001 by default, that's why all the cache volumes get those permissions there

#

having said that, WithCacheVolume has an option to force the default Owner of the volume so you have a way to standardize this at least

#

just wondering if we should set the default owner to root:root @clever flax as we know that'll always work in the engine

frosty token
plain yarrow
#

@frosty token make sure you add root:root Not sure what happens if you omit the group

plain yarrow
plain yarrow
#

root alone is fine

frosty token
#

I've double checked my response to the engineers that reported this in their pipelines, and (luckily?) i responded with root:root, so all fine 🙂

plain yarrow
#

I'm opening an issue to address this

plain yarrow