#gitlab runner caching

1 messages · Page 1 of 1 (latest)

misty cove
#

I have two cpu architectures building, one uses dagger engine on kubernetes, the other is a build server with gitlab-runner running with the docker executor. I'm using this as a reference: https://docs.dagger.io/759201/gitlab-google-cloud/#appendix-b-configure-a-self-hosted-gitlab-runner-for-use-with-dagger
the problem is that the kubernetes based one seems to be caching correctly, if I rerun/restart the build, it caches everything. The build server is not caching anything with I rerun/restart. how can I verify that it's able to cache correctly?

#

I have this for the docker config (gitlab-runner 16.7.0):

    tls_verify = false
    image = "docker:20.10.16"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    shm_size = 0
    network_mtu = 0
naive smelt
#

Hey @misty cove, just checking if you still need help here. Catching up with issues

misty cove
naive smelt
#

ok, I think I know what's happening here. That reference guide doesn't have the required steps to correctly setup the engine. cc @severe cliff not sure if you might know off the top of your head how this configuration would look like

severe cliff
grand coyote
#

It sounds like there needs to be some additional configuration for the gitlab-runner cache...maybe?

#

@misty cove do you see similar behaviour if you use a Gitlab shared or dedicated runner instead of the self-hosted one?

severe cliff
#

Taking a closer at the config above, and https://docs.gitlab.com/runner/configuration/advanced-configuration.html, I am missing how the dagger-engine* container is mounting a volume from the runner which is persisted across runs.

If the Dagger Engine is getting restarted between runs, and doesn't mount a data volume which was used in a previous run, it will always start with an empty state.

I suspect that figuring out how to re-use the same volume for /var/lib/dagger in the Dagger Engine will solve this.

naive smelt
#

We need to find a setup where the engine gets started once by the gitlab runner and then subsequent jobs should connect to it

severe cliff
#

If that's the case, then yes. A few questions for @misty cove that will help move this forward:

  1. Does your Gitlab setup have the option of a long-running service which multiple jobs can connect to? This would be similar to multiple jobs connecting to the same db instance which retains data between job runs.
  2. If you had to run a long-running service outside of Gitlab, which IaaS or PaaS is your go-to? Maybe you already have AWS infra, or DigitalOcean, or even Fly.io. I would like to make a recommendation which is closest to what you have, rather than introducing something new. If you prefer something new, which scales down to 0 when not in use, let me know.
misty cove
#

it's #1. yes, the gitlab-runner on this build server is running as a daemon. it is capable of running multiple concurrent jobs, but it is only running jobs for one monorepo.

#

I tried running dagger-engine on it but don't know how to get gitlab to connect to it, probably something to do with accessing /var/run/dagger

severe cliff
#

OK, that makes sense.

If you can share the same mount between the dagger-engine and the gitlab-runner, then mount it as /var/run/buildkit in the dagger-engine and gitlab-runner too.

Then, in your jobs, use the following environment variable: _EXPERIMENTAL_DAGGER_RUNNER_HOST: "unix:///var/run/buildkit/buildkitd.sock" . E.g. https://github.com/dagger/dagger/blob/53179a064559cf376fa2ad7596e32bd4e4934c74/.github/workflows/_hack_make.yml#L79C11-L79C86

When the Dagger CLI runs, it will use the mount in gitlab-runner to connect to the Dagger Engine which will write this unix socket in the same mount. To confirm this, you can simply run dagger query to get the Engine ID. If this runs in Docker, it will be the container id. If it runs in K8s, it will be the pod name.

misty cove
#

so I've started dagger-engine like this, with some help from the operator manual https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md :
docker run --rm -it --name local-dagger --privileged --volume /data/gitlab-dagger-cache:/var/lib/dagger:rw --volume /var/run/buildkit:/var/run/buildkit:rw registry.dagger.io/engine:v0.9.7 --config /etc/dagger/engine.toml --debug

and created a new runner to test this with:
sudo gitlab-runner register -n --url https://gitlab.private/ --registration-token _reg_token_ --executor docker --description local_dagger --docker-image docker:20.10.16 --docker-privileged --docker-volumes /certs/client --docker-volumes /cache --docker-volumes /var/run/buildkit:/var/run/buildkit --tag-list "dagger-x86" --builds-dir /data/gitlab-runner-builds --cache-dir /data/gitlab-runner-cache

and set _EXPERIMENTAL_DAGGER_RUNNER_HOST: "unix:///var/run/buildkit/buildkitd.sock", I have a parallel job defined in gitlab:

    matrix:
      - ARCH: "amd64"
        RUNNER: "dagger-x86"
        IMAGE: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:20.10-git
        UPLOAD_IMAGE_REF: $CI_REGISTRY_IMAGE
        UPLOAD_IMAGE_TAG: ${CI_COMMIT_REF_SLUG}-${DOCKER_ARCH}
        _EXPERIMENTAL_DAGGER_RUNNER_HOST: unix:///var/run/buildkit/buildkitd.sock

and I can see build related logs in the local-dagger container logs

#

but I'm getting permission errors:

40: [0.23s] runtime/debug.Stack()
40: [0.23s]     /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
40: [0.23s] main.main.func1()
40: [0.23s]     /app/cmd/shim/main.go:63 +0x38
40: [0.23s] panic({0xf975c0?, 0xc0006d1170?})
40: [0.23s]     /usr/local/go/src/runtime/panic.go:914 +0x21f
40: [0.23s] main.shim()
40: [0.23s]     /app/cmd/shim/main.go:215 +0x112b
40: [0.23s] main.main()
40: [0.23s]     /app/cmd/shim/main.go:76 +0x74
40: [0.23s] 

and in dagger-engine:
DEBU[2024-01-29T23:22:32Z] getExecMetaFile: failed to stat file: lstat meta/stdout: no such file or directory client_call_digest= client_hostname=runner-xy3ewnlt-project-542-concurrent-0 client_id=338emfd57cmh07foxfiqkspkk server_id=yzp8mtjdezh141p6k1wwosqor

#

I'm guessing it is something to do with this,. I don't know buildkit well, but the permissions don't look right:

total 12
drwx--x--x+ 3 root root 4096 Jan 29 18:22 .
drwx------+ 4 root root 4096 Jan 29 18:06 ..
d---------  2 root root 4096 Jan 29 18:22 work
severe cliff
#

Yes, that permission doesn't look right, but I'm not sure where it's coming from. CC-ing @gusty vapor in case he can spot the issue. I know for sure that @trim jungle has all the context. FWIW, this is what a healthy /var/lib/dagger & /var/run/buildkit looks like:

#

The above is an empty Engine v0.9.7 in Docker, with no running pipelines.

gusty vapor
gusty vapor
misty cove
#

I see one difference in runc-overlayfs/content between yours @severe cliff , it's not world-writeable:

total 16
drwxrwxr-x    4 root     root          4096 Jan 29 23:04 .
drwx------    6 root     root          4096 Jan 29 22:17 ..
drwxr-xr-x    3 root     root          4096 Jan 29 23:04 blobs
drwxrwxr-x    2 root     root          4096 Jan 30 00:07 ingest

the dagger-engine that I started via the dagger cli DOES have it as world-writeable
sanity check:

3e9b939ebb74   registry.dagger.io/engine:v0.9.7   "dagger-entrypoint.s…"   2 minutes ago   Up 2 minutes                                           local-dagger
1cfe48c544d7   registry.dagger.io/engine:v0.9.7   "dagger-entrypoint.s…"   8 days ago      Up 2 days                                              dagger-engine-e4885af3e03f784c
severe cliff
#

How do you install the Gitlab runner? If I have a minute, I might check this out to see if anything obvious shows up.

misty cove
#

sudo gitlab-runner register -n --url REMOTE_URL --registration-token TOKEN --executor docker --description local_dagger --docker-image docker:20.10.16 --docker-privileged --docker-volumes /certs/client --docker-volumes /cache --docker-volumes /var/run/buildkit:/var/run/buildkit --tag-list "dagger-x86" --builds-dir /data/gitlab-runner-builds --cache-dir /data/gitlab-runner-cache

#

I checked umask between the two and both are showing this:

Umask:  0000
#

just remembering that I restarted the gitlab-runner yesterday to change the bind mount to "/var/run/buildkit:/var/run/buildkit:rw", so that register command may need to be adjusted for the ":rw" option.
I also created a test file in /var/run/buildkit in one of the containers and verified that the other container could see it. don't remember the order of dagger vs gitlab-runner but both are mounted RW

#

found another diff, via docker inspect:

        "Mounts": [                                                             "Mounts": [
            {                                                                       {
                "Type": "bind",                                      |                  "Type": "volume",
                "Source": "/data/gitlab-dagger-cache",               |                  "Name": "5519633ae97d3fc542644b3b586b28df503c6db782a
                                                                     >                  "Source": "/var/lib/docker/volumes/5519633ae97d3fc54
                "Destination": "/var/lib/dagger",                                       "Destination": "/var/lib/dagger",
                "Mode": "rw",                                        |                  "Driver": "local",
                                                                     >                  "Mode": "",
                "RW": true,                                                             "RW": true,
                "Propagation": "rprivate"                            |                  "Propagation": ""
            },                                                       <
            {                                                        <
                "Type": "bind",                                      <
                "Source": "/var/run/buildkit",                       <
                "Destination": "/var/run/buildkit",                  <
                "Mode": "rw",                                        <
                "RW": true,                                          <
                "Propagation": "rprivate"                            <
            }                                                                       }
        ],                                                                      ],
misty cove
#

ok, so I created a docker volume, ran the engine with:
docker run --rm -it --name local-dagger --privileged --mount source=dagger-cache,target=/var/lib/dagger --volume /var/run/buildkit:/var/run/buildkit:rw registry.dagger.io/engine:v0.9.7 --config /etc/dagger/engine.toml --debug
and now the content dir is world-writeable and my pipeline is working, just like the kubernetes one

#

also intrigued at the possibility of creating a docker ZFS volume, because this other SSD needs more work.

severe cliff
#

Great news! Anything that we can do in our docs to make this clearer / more detailed?

misty cove
#

I didn't see docs on how to run dagger-engine via docker like this anywhere, maybe I missed it. The operator manual shows that there's several more options to connect and maybe those would easier or preferred by others for setting up: https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md#connection-interface
this all helps with evaluating the potential for using gitlab's docker autoscaler, which will be next on my TODO: https://docs.gitlab.com/runner/executors/docker_autoscaler.html

by the way, since it's possible to run multiple gitlab runners on one server, have to wonder if the buildkitd.sock can handle multiple clients.

severe cliff
misty cove
#

is there a way to "trace" the cause of caching or the invalidation of the cache? or plans to explain caching? I'd like to know, for instance, which env vars were used to calculate the cache, which host/server originally produced the layer, maybe even the output of commands that produced the layer.
I've suspected, for instance, that a layer was cached where a sub command failed but didn't propogate a non-zero exit code up, and it's not easy to find the proof of that. I like the idea of running dagger with the silent mode option in automated pipelines but then I lose the ability to find what was cached when.

gusty vapor
#

Yes, we have plans for this 😄
Unfortunately, this is really tricky, but I'm slightly selfish, cause I want to use this feature 🎉

#

Matching cache hits is theoretically possible (though we don't do it today), but working out invalidation is a lot harder - when you hit, you hit a specific instance - when you miss, you miss all the potentially previous cached runs - so how do you communicate that?
If you have ideas, super open to hear them btw, I think it's a super tricky UI/UX problem

misty cove
#

maybe some OCI metadata that doesn't add too much to image size?

#

I don't know the OCI spec at all , really, just brainstorming..

gusty vapor
#

mm, so we actually have access to get the cache info, it's not that bad to find it (and maybe even display it)!
it just requires a bunch of plumbing around through buildkit/dagger/etc that we just haven't done yet