hosting a dagger engine - load balancing and auth | Dagger | Page 1

spiral sandal Nov 23, 2024, 4:12 AM

#

hey, I'm looking into hosting a dagger engine. Actually trying to do it has left me with a few questions:

if I run the Kubernetes integration, it doesn't look like the cache is shared at all between nodes. If I choose a random pod to execute in inside a large cluster or using ephemeral nodes, it seems more likely than not that I'd get a cold cache. Is this on purpose? Is there a way I can get around it?
how can I choose a not-busy Kubernetes pod? I guess I could sort them by node metrics, but having some kind of intelligent load balancing would be useful. Is there a way to hook it up to a Service instead?
how do you do auth in this at all? I tried searching on here and don't see a lot. For Kubernetes I guess you just need access to the API, which can be secured, but what if I just wanted to host on a single beefy node to get around some of the other problems above? Do I have to control access with VPN or SSH or something, or is there a way to do authentication against the engine API?

I do have to congratulate y'all, though, on making the first bring-your-own-compute solution that I've been able to stick into my hobby-sized test Kubernetes cluster without completely bogging it down. 😆 I unironically appreciate that!

#

oh yeah, I guess I should say that the point of doing LB and auth is that I want to have my Dagger workflows in open source projects cached the same as on my laptop. If I like this and bring it to work, it'll be GitLab CI/CD runners talking to a larger/ephemeral Kubernetes.

#

I guess I'm also confused about why the Helm chart installs a daemonset if we're supposed to choose a single pod to execute on. 🤔

#

I guess the other thing I want here is to be able to distribute parallel work among multiple workers. Say I've got a bajillion rspec tests to run… I want to distribute that work into the maximum amount of parallelism that I can. I thought Dagger might be able to do this for me, but I'm not so sure after seeing how it seems to work in practice.

vestal pawn Nov 23, 2024, 5:54 PM

#

So we do it with k8s resource requests and AWS auto scaling. I have an ASG hosting dagger engines . I run clients on those engines. The clients have resource requests that drive scaling.

#

It is a little weird since the clients don’t actually use resources but drive util in the engine in the node. So transitive resources

spiral sandal Nov 23, 2024, 6:15 PM

#

I get that part, but how do you choose which pod to use, do auth, spread out parallel workloads, etc?

vestal pawn Nov 23, 2024, 6:21 PM

#

Hmm. We just k8s scheduler handle all that

#

We are using gitlab pipeline ls but it shouldn’t matter. It asks k8s to schedule a pod and it does so based on resource requests which drives parallel work and scaling

#

There are 2 pods. The ds engine and the client pod that runs on the same machine as an engine. The client pod scheduling drives all that

#

The client pod gets scheduled to a node running an engine and uses that engine

#

See here: #gitlab message

#

That message mentions a thread with details

spiral sandal Nov 23, 2024, 6:47 PM

#

Ok that was helpful, thanks. So it looks like you’re scaling vertically instead of splitting the job across multiple nodes, right? And then I guess you’re creating runner nodes on demand. How do you manage the cache?

#

I’m gonna have to figure something else out for my own stuff because I don’t want to host my own runners if I don’t have to. We may borrow your pattern at work though!

vestal pawn Nov 23, 2024, 7:37 PM

#

No. Horizontally

#

Each pipeline run creates a pod per job and that scales out horizontally. Caching is a problem but I hear the team is working on remote caching solutions

#

We do keep a min set of nodes around to help with caching

#

Reread and yes there is a 1:1 between jobs and a dagger call. And each call goes to a single engine. Not sure how you would break that apart,

turbid parcel Nov 25, 2024, 10:42 AM

#

@spiral sandal I first thought that having multiple Dagger Engine up and Running was the best solution like you, but after some testing I hitted the same issue than you about LB. Finally, I decided to create 1 pod having the dagger-engine as a sidecar and a simple Alpine image as the main container that launch the dagger call command.
Here is some snippets from the pod I'm creating, they are all part of the pod YAML that is created on the fly when I'm requesting a new run to my CICD pipeline system (ArgoWorkflow).
The sidecar with the dagger engine

      sidecars:
        - name: dagger-engine
          image: "registry.dagger.io/engine:v{{inputs.parameters.dagger-version}}"
          securityContext:
            privileged: true
            capabilities:
              add:
                - ALL
          readinessProbe:
            exec:
              command: [ "buildctl", "debug", "workers" ]
          volumeMounts:
              # Expose the socket for the main container
            - mountPath: /var/run/buildkit
              name: dagger-socket
            - mountPath: /var/lib/dagger
              name: dagger-storage

The main container

      container:
        image: alpine:latest
        command: [ "sh", "-c" ]
        args: [ "cd /src; dagger call <your command>" ]
        workingDir: /work
        env:
          # Link the dagger socket to the one in the sidecar
          - name: "_EXPERIMENTAL_DAGGER_RUNNER_HOST"
            value: "unix:///var/run/dagger/buildkitd.sock"
          # Using Dagger cloud for the cache, see https://docs.dagger.io/api/cache-volumes
          - name: DAGGER_CLOUD_TOKEN
            valueFrom:
              secretKeyRef:
                name: dagger-cloud
                key: token
          - name: <Any other secret>
            valueFrom:
              secretKeyRef:
                ...
        volumeMounts:
          - mountPath: /var/run/dagger
            name: dagger-socket

#

I hope it can help you!

#hosting a dagger engine - load balancing and auth