I have two cpu architectures building, one uses dagger engine on kubernetes, the other is a build server with gitlab-runner running with the docker executor. I'm using this as a reference: https://docs.dagger.io/759201/gitlab-google-cloud/#appendix-b-configure-a-self-hosted-gitlab-runner-for-use-with-dagger
the problem is that the kubernetes based one seems to be caching correctly, if I rerun/restart the build, it caches everything. The build server is not caching anything with I rerun/restart. how can I verify that it's able to cache correctly?
#gitlab runner caching
1 messages · Page 1 of 1 (latest)
I have this for the docker config (gitlab-runner 16.7.0):
tls_verify = false
image = "docker:20.10.16"
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/certs/client", "/cache"]
shm_size = 0
network_mtu = 0
Hey @misty cove, just checking if you still need help here. Catching up with issues
yes. have been busy with other things and haven't looked at this further.
ok, I think I know what's happening here. That reference guide doesn't have the required steps to correctly setup the engine. cc @severe cliff not sure if you might know off the top of your head how this configuration would look like
I don't, as I haven't used Gitlab in the last 5 years, but I know someone that might: @grand coyote @vital flume
It sounds like there needs to be some additional configuration for the gitlab-runner cache...maybe?
@misty cove do you see similar behaviour if you use a Gitlab shared or dedicated runner instead of the self-hosted one?
Taking a closer at the config above, and https://docs.gitlab.com/runner/configuration/advanced-configuration.html, I am missing how the dagger-engine* container is mounting a volume from the runner which is persisted across runs.
If the Dagger Engine is getting restarted between runs, and doesn't mount a data volume which was used in a previous run, it will always start with an empty state.
I suspect that figuring out how to re-use the same volume for /var/lib/dagger in the Dagger Engine will solve this.
I think a data volume won't be enough since multiple engines can't share the same data volume concurrently
We need to find a setup where the engine gets started once by the gitlab runner and then subsequent jobs should connect to it
If that's the case, then yes. A few questions for @misty cove that will help move this forward:
- Does your Gitlab setup have the option of a long-running service which multiple jobs can connect to? This would be similar to multiple jobs connecting to the same db instance which retains data between job runs.
- If you had to run a long-running service outside of Gitlab, which IaaS or PaaS is your go-to? Maybe you already have AWS infra, or DigitalOcean, or even Fly.io. I would like to make a recommendation which is closest to what you have, rather than introducing something new. If you prefer something new, which scales down to 0 when not in use, let me know.
it's #1. yes, the gitlab-runner on this build server is running as a daemon. it is capable of running multiple concurrent jobs, but it is only running jobs for one monorepo.
I tried running dagger-engine on it but don't know how to get gitlab to connect to it, probably something to do with accessing /var/run/dagger
OK, that makes sense.
If you can share the same mount between the dagger-engine and the gitlab-runner, then mount it as /var/run/buildkit in the dagger-engine and gitlab-runner too.
Then, in your jobs, use the following environment variable: _EXPERIMENTAL_DAGGER_RUNNER_HOST: "unix:///var/run/buildkit/buildkitd.sock" . E.g. https://github.com/dagger/dagger/blob/53179a064559cf376fa2ad7596e32bd4e4934c74/.github/workflows/_hack_make.yml#L79C11-L79C86
When the Dagger CLI runs, it will use the mount in gitlab-runner to connect to the Dagger Engine which will write this unix socket in the same mount. To confirm this, you can simply run dagger query to get the Engine ID. If this runs in Docker, it will be the container id. If it runs in K8s, it will be the pod name.
so I've started dagger-engine like this, with some help from the operator manual https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md :
docker run --rm -it --name local-dagger --privileged --volume /data/gitlab-dagger-cache:/var/lib/dagger:rw --volume /var/run/buildkit:/var/run/buildkit:rw registry.dagger.io/engine:v0.9.7 --config /etc/dagger/engine.toml --debug
and created a new runner to test this with:
sudo gitlab-runner register -n --url https://gitlab.private/ --registration-token _reg_token_ --executor docker --description local_dagger --docker-image docker:20.10.16 --docker-privileged --docker-volumes /certs/client --docker-volumes /cache --docker-volumes /var/run/buildkit:/var/run/buildkit --tag-list "dagger-x86" --builds-dir /data/gitlab-runner-builds --cache-dir /data/gitlab-runner-cache
and set _EXPERIMENTAL_DAGGER_RUNNER_HOST: "unix:///var/run/buildkit/buildkitd.sock", I have a parallel job defined in gitlab:
matrix:
- ARCH: "amd64"
RUNNER: "dagger-x86"
IMAGE: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:20.10-git
UPLOAD_IMAGE_REF: $CI_REGISTRY_IMAGE
UPLOAD_IMAGE_TAG: ${CI_COMMIT_REF_SLUG}-${DOCKER_ARCH}
_EXPERIMENTAL_DAGGER_RUNNER_HOST: unix:///var/run/buildkit/buildkitd.sock
and I can see build related logs in the local-dagger container logs
but I'm getting permission errors:
40: [0.23s] runtime/debug.Stack()
40: [0.23s] /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
40: [0.23s] main.main.func1()
40: [0.23s] /app/cmd/shim/main.go:63 +0x38
40: [0.23s] panic({0xf975c0?, 0xc0006d1170?})
40: [0.23s] /usr/local/go/src/runtime/panic.go:914 +0x21f
40: [0.23s] main.shim()
40: [0.23s] /app/cmd/shim/main.go:215 +0x112b
40: [0.23s] main.main()
40: [0.23s] /app/cmd/shim/main.go:76 +0x74
40: [0.23s]
and in dagger-engine:
DEBU[2024-01-29T23:22:32Z] getExecMetaFile: failed to stat file: lstat meta/stdout: no such file or directory client_call_digest= client_hostname=runner-xy3ewnlt-project-542-concurrent-0 client_id=338emfd57cmh07foxfiqkspkk server_id=yzp8mtjdezh141p6k1wwosqor
I'm guessing it is something to do with this,. I don't know buildkit well, but the permissions don't look right:
total 12
drwx--x--x+ 3 root root 4096 Jan 29 18:22 .
drwx------+ 4 root root 4096 Jan 29 18:06 ..
d--------- 2 root root 4096 Jan 29 18:22 work
Yes, that permission doesn't look right, but I'm not sure where it's coming from. CC-ing @gusty vapor in case he can spot the issue. I know for sure that @trim jungle has all the context. FWIW, this is what a healthy /var/lib/dagger & /var/run/buildkit looks like:
The above is an empty Engine v0.9.7 in Docker, with no running pipelines.
this is slightly bizarre - could you share your dagger pipeline?
.dagger_meta_mount should always be world-writeable: https://github.com/dagger/dagger/blob/a51c3ca670e2b88cecacfc64e64d67216882ba07/core/container.go#L1724
and the stdout file should only be created with 0o666 permissions: https://github.com/dagger/dagger/blob/a51c3ca670e2b88cecacfc64e64d67216882ba07/cmd/shim/main.go#L212-L215
I see one difference in runc-overlayfs/content between yours @severe cliff , it's not world-writeable:
total 16
drwxrwxr-x 4 root root 4096 Jan 29 23:04 .
drwx------ 6 root root 4096 Jan 29 22:17 ..
drwxr-xr-x 3 root root 4096 Jan 29 23:04 blobs
drwxrwxr-x 2 root root 4096 Jan 30 00:07 ingest
the dagger-engine that I started via the dagger cli DOES have it as world-writeable
sanity check:
3e9b939ebb74 registry.dagger.io/engine:v0.9.7 "dagger-entrypoint.s…" 2 minutes ago Up 2 minutes local-dagger
1cfe48c544d7 registry.dagger.io/engine:v0.9.7 "dagger-entrypoint.s…" 8 days ago Up 2 days dagger-engine-e4885af3e03f784c
How do you install the Gitlab runner? If I have a minute, I might check this out to see if anything obvious shows up.
sudo gitlab-runner register -n --url REMOTE_URL --registration-token TOKEN --executor docker --description local_dagger --docker-image docker:20.10.16 --docker-privileged --docker-volumes /certs/client --docker-volumes /cache --docker-volumes /var/run/buildkit:/var/run/buildkit --tag-list "dagger-x86" --builds-dir /data/gitlab-runner-builds --cache-dir /data/gitlab-runner-cache
I checked umask between the two and both are showing this:
Umask: 0000
just remembering that I restarted the gitlab-runner yesterday to change the bind mount to "/var/run/buildkit:/var/run/buildkit:rw", so that register command may need to be adjusted for the ":rw" option.
I also created a test file in /var/run/buildkit in one of the containers and verified that the other container could see it. don't remember the order of dagger vs gitlab-runner but both are mounted RW
found another diff, via docker inspect:
"Mounts": [ "Mounts": [
{ {
"Type": "bind", | "Type": "volume",
"Source": "/data/gitlab-dagger-cache", | "Name": "5519633ae97d3fc542644b3b586b28df503c6db782a
> "Source": "/var/lib/docker/volumes/5519633ae97d3fc54
"Destination": "/var/lib/dagger", "Destination": "/var/lib/dagger",
"Mode": "rw", | "Driver": "local",
> "Mode": "",
"RW": true, "RW": true,
"Propagation": "rprivate" | "Propagation": ""
}, <
{ <
"Type": "bind", <
"Source": "/var/run/buildkit", <
"Destination": "/var/run/buildkit", <
"Mode": "rw", <
"RW": true, <
"Propagation": "rprivate" <
} }
], ],
ok, so I created a docker volume, ran the engine with:
docker run --rm -it --name local-dagger --privileged --mount source=dagger-cache,target=/var/lib/dagger --volume /var/run/buildkit:/var/run/buildkit:rw registry.dagger.io/engine:v0.9.7 --config /etc/dagger/engine.toml --debug
and now the content dir is world-writeable and my pipeline is working, just like the kubernetes one
also intrigued at the possibility of creating a docker ZFS volume, because this other SSD needs more work.
Great news! Anything that we can do in our docs to make this clearer / more detailed?
I didn't see docs on how to run dagger-engine via docker like this anywhere, maybe I missed it. The operator manual shows that there's several more options to connect and maybe those would easier or preferred by others for setting up: https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md#connection-interface
this all helps with evaluating the potential for using gitlab's docker autoscaler, which will be next on my TODO: https://docs.gitlab.com/runner/executors/docker_autoscaler.html
by the way, since it's possible to run multiple gitlab runners on one server, have to wonder if the buildkitd.sock can handle multiple clients.
Yes it can. Each Dagger CLI instance opens a new session.
is there a way to "trace" the cause of caching or the invalidation of the cache? or plans to explain caching? I'd like to know, for instance, which env vars were used to calculate the cache, which host/server originally produced the layer, maybe even the output of commands that produced the layer.
I've suspected, for instance, that a layer was cached where a sub command failed but didn't propogate a non-zero exit code up, and it's not easy to find the proof of that. I like the idea of running dagger with the silent mode option in automated pipelines but then I lose the ability to find what was cached when.
Yes, we have plans for this 😄
Unfortunately, this is really tricky, but I'm slightly selfish, cause I want to use this feature 🎉
Matching cache hits is theoretically possible (though we don't do it today), but working out invalidation is a lot harder - when you hit, you hit a specific instance - when you miss, you miss all the potentially previous cached runs - so how do you communicate that?
If you have ideas, super open to hear them btw, I think it's a super tricky UI/UX problem
maybe some OCI metadata that doesn't add too much to image size?
I don't know the OCI spec at all , really, just brainstorming..
mm, so we actually have access to get the cache info, it's not that bad to find it (and maybe even display it)!
it just requires a bunch of plumbing around through buildkit/dagger/etc that we just haven't done yet