#Hi. I have a project with multiple repos
1 messages ยท Page 1 of 1 (latest)
I was thinking smth like running dagger-engine pods with hostNetwork: true or hostport and then spinning up my CI runners with
env:
- name: DAGGER_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
somewhere in envs, and using it to assemble the _EXPERIMENTAL_DAGGER_RUNNER_HOST variable.
Is there perhaps a better way?
This is something I was also putting some thinking into and am not sure I've come up with any decent solution.
I do believe you should run your engines as a daemon set. You can then use the unix socket variant of the _EXPERIMENTAL_DAGGER_RUNNER_HOSTto run your CI runners on each node and communicate directly with the engine on that node.
But, the caches of the engines won't be shared or sync'ed, which is a problem only solved currently via the paid service of the Dagger Cloud AFAIK (and I'm not totally certain about that either reading the docs).
The problem is, my CI runners are auto-scaled and when a CI job is enqueued, there is no way to know on which node it will be scheduled.
This is a predicament. The only thing I could come up with in theory is to have some form of sticky session process for your CI jobs and you'd have to have a gateway of some sort for your CI calls, to then be able to route the CI jobs to the same engine that was used before to take advantage of the cache and I'm really unsure that is feasible either. Until the cache feature offers a centralized storage and Dagger's engine somehow understands which part of the cache to hit, we are stuck with this predicament, and again, AFAIK. If there is a better solution, I'd love to hear about it too.
๐ I'm pretty sure with the daemonset configuration in the helm chart you automatically connect to the current node's engine because of the shared volume between the CI runner pod and the daemon pod. I'm not sure if we have any usage examples to point to (cc @velvet comet @warm oak )
@digital wraith @glass haven
it's a bit more tricky than that...
I've just tried implementing this with BitBucket runners deployed on k8s. I exposed dagger pods' nodeport, modified BB runners' configuration to export a NODE_IP env (dynamically obtained), and it worked as expected up to this point. What did NOT work though is getting that env var into the CI runtime ๐
The problem is, when CI runner starts a job, it spins it up in a dedicated container (think: DinD). And I did not find a way to modify that container to ingest the NODE_IP env. W/o that I can't know from inside the CI job what's the dagger engine's address.
If I hardcode node IP in the CI job (along with tcp port), the job successfully connects to the dagger engine and runs the build there.
So if I cannot pass an env into a CI job container, I see no way to pass a file or a volume (.socket) into it.
that sucks....
At this point IMO even a dagger service would be more robust, albeit somewhat slower and torturing my OCD mercilessly ๐
I see no way to pass a file or a volume (.socket) into it
I think the trick is adding the volume to the CI runner pod itself, if that's something you have control over
I have control of this, yes. But the runner's pod spins up a separate DinD runtime and spawns job containers in there rather than on the pod I pass the volume/env into. Yet another layer of containerization ๐
Got it, yeah that is tricky. I don't have any ideas at the moment but I'll think about it some more and maybe someone else has it solved ๐
Yeah.. When you write a ci definition, you DO specify an image [alpine? Maven? Nodejs? Smth else?] in your yaml. And runner spins a new container with that img. I didn't find a way to specify envs or volumes along with the image repo:tag yet ๐
Dagger service would be a very much sound approach with CNIs like Cillium, because it picks service pods closest to the requesting pod, ie if available, cillium will pick a service pod running on the same node.
But other CNIs don't do that afaik
@foggy brook - Out of curiosity, what is the reason for needing the DnD-like pod for your CI runners?
@digital wraith
because of the shared volume between the CI runner pod and the daemon pod.
What shared volume and what CI runner pod? The Helm chart only creates engine pods. I'm confused. ๐
I don't need it. But aparently the self-hosted CI runner needs it
kubectl --context "${KUBE_CONTEXT}" -n bb-runners exec -it runner-1fdec5f3-54a4-5684-a60f-38cd65fa1064-xzdf8 -- sh -c 'env | grep NODE'
Defaulted container "runner" out of: runner, docker
NODE_NAME=ip-10-101-11-170.us-west-2.compute.internal
NODE_IP=10.101.11.170
โ20:12:49 [0]
runner pod has 2 containers: "runner" and "docker". And all the jobs running on k8s are launched in a separate container, running on the BB runner pod's 'docker' container.
I don't do DinD. The BB Ci runner does :/
That setup sounds similar to how GitLab CI runners work, so maybe this is a good reference: https://docs.dagger.io/ci/integrations/gitlab#kubernetes-executor
What shared volume and what CI runner pod? The Helm chart only creates engine pods. I'm confused
The helm chart also mounts the volume /run/dagger to the engine. It can then be mounted to the CI runner pod as a means to connect to the engine's socket
Wow. Really? That is brand new to me and I just checked the docs for k8s integration and there is nothing about it I could find. Might need to be added to the docs?
You said this would also be automatic. So, if I also mount the volume in the runner pod, both the CLI and the SDK client can connect to the engine, if they find /run/dagger is mounted?
I just updated Dagger and looked over values.yaml. Not sure how to get that volume. I don't see any PVC or volume set up.
Yeah I think it's explained a bit in the examples that show running the engine as a sidecar, and afaik it works the same. Agreed it should be documented in there. It's not so much that the CLI looks for a /run/dagger mount, but rather it looks for an engine first before creating one. So if it's sharing that volume with an engine pod, it'll see that engine and skip trying to create one
https://github.com/dagger/dagger/blob/main/helm/dagger/templates/engine-daemonset.yaml#L129-L132
I guess it's .Values.engine.hostPath.runVolume.enabled? Sorry, this is outside of my expertise ๐
Yes, I saw that. But, there is no volume/PVC created, when it is set to true. And, how would the CLI communicate over a volume with the Engine in another pod? I'm sort of bewildered that is possible.
Currently, I'm using the unix socket. And I also learned a few days ago about how to open a port and get the communication going that way. The volume I'd get, if one engine were to share the cache with another engine.
cache, or engine setup. Whatever is in /run/dagger
Or is it the actual engine?
From what I can tell it's a local filesystem volume mount? https://github.com/dagger/dagger/blob/main/helm/dagger/templates/engine-daemonset.yaml#L154-L156
Here's what I've worked with personally with sidecar mode https://docs.dagger.io/ci/integrations/argo-workflows
In the argo-workflows example, it is also using the unix socket.
I also get now, the /run/dagger mount is the socket itself. Interesting. Let me play with this tomorrow and get back to you with my findings and understandings.
I'm new to unix sockets in general. ๐
@foggy brook - I apologize for sidetracking/ hijacking your thread. If you wish, I can open my own. Just let me know.
yes exactly ๐ That env var may be required if it's not the default path for the socket, I can't remember off the top of my head. The other volume for the engine is /var/lib/dagger which is for the cache. It does not need to be shared with the CI runner, but if you wanted to experiment with funky ways of sharing cache, that's the spot
I'm wondering now if it might be possible to use a Longhorn volume for a cross-node shared volume for the cache. I'm not sure how much of a performance hit that might cause or if it would even work.
afaik the big scary part here is that the engine assumes it's the only thing writing to it's volume, so if you get into a world with a single shared volume and multiple-writers it can lead to bad things. And read/write latency on that filesystem is extremely important. But you could try copy-on-read volumes if you have a way to find a smart place to copy from
Ah. So a shared cache volume won't work. Ok. Back to my work routing idea. ๐ I'll play with the socket volume tomorrow to see if I can get it to work. Thanks!
I did some experimenting today, but couldn't get the socket volume working. I'll do some more experimenting tomorrow.
I tried setting up dagger engine behind a k8s service (and tried to configure dagger client to connect to the dagger engine using k8ร service name + port). While the connection is established, dagger builds fail with some weird error saying smth about an invalid session. I straced dagger and saw that it opens x4 connections to the dagger engine. Since k8s services are load-balanced round-robin, each connection was probably made to a different engine instance, and the session was mishandled.
Question is: why does dagger open 4 sockets to the engine to do a simple dagger run echo HELLO...? Can't it do it with just one?
@sacred rose
I've given up on the volume mounting. Gone back to the experimental socket. ๐คท๐ป
no idea why, could you open an issue please?
Done. Thanks.
What is the issue? Pardon my laziness, I will copy-paste from discord I tried setting up dagger engine behind a k8s service (and tried to configure dagger client to connect to the dagger engine usi...
hey! did you succeed with sockets?
The socket was working for me from day one and also why I stopped fiddling with the socket volume. Never play with a working system, right? ๐
I made it work with a headless kubernetes service ๐
.scriptlets:
setup:
dagger: &setup-dagger >
: "Setting up Dagger and initializing modules"
&& mkdir -p ./bin/ && curl -sfL https://releases.dagger.io/dagger/install.sh | DAGGER_VERSION=0.18.5 sh && export PATH="${HOME}/.local/bin:${PWD}/bin:${PATH}"
&& dagger version
&& DAGGER_REMOTE_IP=$(getent hosts dagger-headless-svc.dagger.svc | awk '{print $1}')
&& if [ -n "${DAGGER_REMOTE_IP}" ]; then echo "Configuring remote dagger runner: ${DAGGER_REMOTE_IP}"; export _EXPERIMENTAL_DAGGER_RUNNER_HOST=tcp://${DAGGER_REMOTE_IP}:33808 ; fi
&& dagger run echo Dagger is READY
Cool, but how would you make sure cache is working with this setup?
distributed cache is another problem.
My primary problem was computing power required to build several projects in parallel. Now with that out of the way, I'll start thinking about cache centralization.
I have a few ideas, will want to sleep on them.
- using CI job name hash to calc which dagger node IP to select, rather than always selecting the first one. This will avoid the need to sync cache in the first place, as all the jobs of the same kind will always run on the same host.
- using prep job in shell's background that will connect dagger to a remote node dedicated for caching only (no CI jobs running there), pull cached data from there, re-push it to the dagger's runner node. Will update the cache node with cache contents at the end of the run. Not a big fan of this approach, as there's a lot of data going back and forth
- a preagreed external caching medium, e.g. s3. A separate dagger script that would update s3 cache at the end of a CI job
- TBD
@glass haven well... it ain't pretty, but it serves me OK as a poor-man's sticky load balancer ๐
shell: &setup-shell |
: "Set up shell"
if [ "${DEBUG}" = true ] ; then set -x ; fi
from_range_by_key() {
min="${1:?Missing min}"
max="${2:?Missing max}"
key="${3:?Missing key}"
[ "${min}" -le "${max}" ] || { echo "Invalid range: min > max [${min} > ${max}]" >&2; return 1; }
range=$((max - min + 1))
hash_hex=$(echo -n "${key}" | sha256sum | cut -c1-8)
hash_int=$((0x${hash_hex}))
_r=$((min + (hash_int % range)))
}
apk add wget curl tar gzip
dagger: &setup-dagger |
: "Setting up Dagger and initializing modules"
env
mkdir -p ./bin/ && curl -sfL https://releases.dagger.io/dagger/install.sh | DAGGER_VERSION=0.18.5 sh && export PATH="${HOME}/.local/bin:${PWD}/bin:${PATH}" \
&& dagger version \
&& if [ "${DAGGER_REMOTE:-true}" = "true" ]; then
echo "Obtaining dagger engine configuration";
IPS=$(getent ahostsv4 dagger-headless-svc.dagger.svc | awk '{print $1}' | sort | uniq);
IP_COUNT=$(($(echo ${IPS:?Missing IPs} | wc -w) - 1));
from_range_by_key 0 ${IP_COUNT} "${BITBUCKET_REPO_FULL_NAME}";
DAGGER_REMOTE_IP=$(echo "${IPS}" | sed -n "$((_r + 1))p")
fi \
&& if [ -n "${DAGGER_REMOTE_IP}" ]; then echo "Configuring remote dagger runner: ${DAGGER_REMOTE_IP}"; export _EXPERIMENTAL_DAGGER_RUNNER_HOST=tcp://${DAGGER_REMOTE_IP}:33808 ; fi \
&& dagger run echo Dagger is READY
I was considering leveraging envoy, as it should integrate with kube nicely [at least in theory], but I was not sure how well would it choose which requests belong to the same session. And if it uses src_ip as the key -- might be not universal enough.
I would be very interested in seeing benchmarks for this setup
I have been wondering about the benefits of "sharding" pipelines in this way. But we didn't get a chance to try it for ourselves yet.
My guess is that you sacrifice some compute elasticity, and in return you get better (possibly much better) cache reuse
well. cache reuse is a 100%, as if it were run locally. This way I can have lean CI runners, offloading all the actual work to beefier dagger pods. And since dagger node IP is selected/sharded using a repo name (I'm in a multi-repo project), I'm guaranteed to always hit the same dagger node from my CI jobs. This means immediate cache mounts, this means hot OCI cache.
Compared to installing and spinning up a new dagger client and engine with each CI job run, this current approach, where I offload work to remote dagger pods, is saving me >50% of CI time (smth like 8min down to 2-3min). I'm yet to refactor my biggest repository's pipelines to do the same -- maybe I'll finish before EOW.. BitBucket's caching is slow and capped and quite dumb, so I spend more time zipping/downloading/uploading BB cache than when not using BB cache at all. With dagger I'm sending code to the host where cache is (LOTS of tiny files: think .m2, node_modules). I hope it will speed up my behemoth-of-a-repo builds too.
cache reuse is a 100%, as if it were run locally.
As long as the local storage of each engine is persisted, and the entire shard can fit on a single machine.
since dagger node IP is selected/sharded using a repo name (I'm in a multi-repo project), I'm guaranteed to always hit the same dagger node from my CI jobs. This means immediate cache mounts, this means hot OCI cache.
Yeah that's the most valuable aspect of this setup for sure.
Compared to installing and spinning up a new dagger client and engine with each CI job run, this current approach, where I offload work to remote dagger pods, is saving me >50% of CI time (smth like 8min down to 2-3min).
The comparison I'm curious about, is to the most common prod architecture on self-hosted CI clusters: daemon set, each node on the cluster gets a dedicated engine, CI runners route calls to their current node's engine.
That comparison is more interesting to me, because it's very workload-specific and therefore hard to guess. You have to run both, and look at the real-world numbers. Which I'm very eager to do ๐
cc @velvet comet @nova aurora @digital wraith @warm oak
CI runners route calls to their current node's engine.
that's the part I'm unable to ensure, unfortunately. I'm using the default CNI -- VPC CNI. Should I migrate to Cilium, I guess I could leverage host-local service resolution and would not need this manual IP pinning magic ๐
in the official Dagger helm chart, the runner pods connect to their engine via a unix socket on a node volume. So CNI wouldn't impact which pod talks to which engine.
@sacred rose
ahh, you mean mounting hostLocal volume containing the unix socket on both, engine and client... Got it.
Except I don't have this flexibility with bitbucket CI runners.
BitBucket k8s CI runners have 2 pods: autoscaler and cleanup. Cleanup is a boring one, it does what it says on the label -- cleans up idle worker pods.
Now, autoscaler is more complex.
It polls BB API to see if there are any jobs queued for any repo in preconfigured workspace (think: group of repos). If found -- it creates a k8s Job based on a template, predefined in a configmap. Technically, I am able to change that template, assuming I am changing code fetched from BB runner autoscaler's repo (my changes might be backwards-incompatible with new runner versions though..). This template defines a Job pod with 2 containers: runner and docker. Docker is a docker:dind and runner uses it as a service.
When a CI job starts, the runner pod spawns a CI run pod with a container in the docker:dind with an image preconfigured in the CI yaml. Inside that container I am downloading and installing dagger.
So while I technically CAN hostLocal-mount a /run/dagger/ on my runner pod and docker:dind pod, I cannot propagate it to the container running on docker:dind -- I don't have how to configure it, because it's the runner's proprietary code that spawns the container and AFAIK there's no easy way to add my custom mounts there..
That's why I'm resorting to the more versatile CNI-based communication to the engine, rather than CSI-based.
Unless I'm missing something?
very roughly
with "CI job exec container" being the container running in the docker:dind, and it's the container where my CI job script from the CI yaml is being run in.