#Dagger vs traditional CI

1 messages · Page 1 of 1 (latest)

indigo birch
#

Starting a thread 🙂

dusk crane
#

@cunning night would be awesome if you could give us a simplified buildkite example of how you're running your pipelines so we can take a look and give the best advise about how you could leverage Dagger

#

particularly highlighting what's the current pain you/your team is going through and what type of solutions you're aiming to

cunning night
#

definitely

#

working on anonymizing it...nothing terribly sensitive but 🤷‍♂️

#

haha dang this is gonna be slightly embarassing, and I just did a bunch of work on this

indigo birch
#

No worries, we appreciate the time you're investing in sharing all this with us

cunning night
#

The dependabot stuff is because we're happy with our rules & specs around dependency updates, so we usually just merge version bumps and do a quick smoke test if specs pass, so we don't deploy those to staging

#

I need to get us off of some of the buildkite plugins, and clean up some duplication - but part of my frustration with YAML is that it's pretty tedious to comb through and find what I can dedupe

#

oh, concurrency groups are another significant issue, forgot to mention those

#

we use a shared RDS instance for all the staging envs, the create-db step only actually does anything if we haven't made a db for the branch yet, but it's a very bad time if you get a few branches trying to create themselves at once

#

What I want is to take all that crazy, vendor-specific DAG stuff around retries, parallelism, concurrency, timeouts, and dependencies, and implement them in a real programming language, preferably a statically-typed one

cunning night
#
import Client, { connect } from "@dagger.io/dagger"

const withSource = (client: Client) => {
  buildContext = useContext('build')
  return client.container().from("base").withExec(['git', 'checkout', buildContext.commit])
}
const builder = (client: Client) => {
  envContext = useContext('env')
  return withSource(client).setEnv(envContext.env)
}

const withImage = (image: string, client: Client) => {
  withSource(client).from(image)
}

const runTestDeps = (parent: Promise, client: Client) => {
  const retPromise = new Promise()
  parent.then((imageName: string) => {
    const promises: Promise[] = []
    for (i in 0..3) {
      promises.push(withImage(imageName, client).withEnv('PARALLEL_JOB', i).run('bundle exec rspec'))
    }
    awaitAll(promises).then(r => retPromise.resolve).catch(e => retPromise.reject)
  })
  
  return retPromise
}

// initialize Dagger client
connect(async (client: Client) => {
  const promises: Promise[] = []
  const context = useContext('build')
  
  if (context.branch !== 'depdendabot') {
    const prodBuild = builder(client).run('docker/build-prod')
    promises.push(runProdDeps(prodBuild, client))
  }
  
  const testBuild = builder(client).run('docker/build-test')

  promises.push(runTestDeps(testBuild, client))
  output = awaitAll(promises)
  await annotateBuild(output)

  // print output
  console.log(`Build ${context.buildNumber} completed successfully`)
})```
#

something like that but with less hideous promise/await semantics 😆

dusk crane
#

gotta run folks, it's quite late here in Uruguay. I'll try to check this out briefly during the weekend or come back first time on Monday. Have a good weekend!

indigo birch
#

@cunning night typescript is your preferred choice of language for this correct?

cunning night
dusk crane
#

👋 was just giving a quick look at this. Some questions and preliminary thoughts:

  • How does parallelization work? Do all the steps get executed in parallel by default unless a depends_on key is used?
  • re: taking all the vendor-specific things out, that one won't be straightforward since as we mentioned before, Dagger currently doesn't handle build infrastructure, so it doesn't understand the concept of "split this DAG in X workers". Having said that, what we realized is that a lot of users generally set up parallelization in their CI's not because of performance reasons but because the DAG is sometimes easier to visualize and also to retry specific steps when these fail.

As an example, here's GHA pipeline https://github.com/salaboy/fmtok8s-frontend/blob/main/.github/workflows/ci_workflow.yml#L52 which I helped migrating to dagger recently https://github.com/salaboy/fmtok8s-frontend/blob/main/.github/workflows/dagger_workflow.yml. You'll notice that the Dagger pipeline doesn't make use of multiple parallel jobs because it's not effectively needed. The original pipeline design of using multiple jobs is mostly a convenience (which may cause other problems and has some drawbacks) so it can be visualized and retried in a more friendly way.

I see your example pipeline also has a bunch of depends_on directives. What I'd love to know, maybe having a quick chat tomorrow, is how this parallelization work for you in practice.

As a last note, taking aside parallelization, there's indeed a lot of other things like retries and plugins that could effectively be ported to Dagger so you can not only run them locally, but also have better reusability across your codebase.

cunning night
# dusk crane 👋 was just giving a quick look at this. Some questions and preliminary thoughts...
  • How does parallelization work? Do all the steps get executed in parallel by default unless a depends_on key is used?
    correct
    I see your example pipeline also has a bunch of depends_on directives. What I'd love to know, maybe having a quick chat tomorrow, is how this parallelization work for you in practice.
    I mean, it practice it works to do what parallelization is intended to - reduce wall-clock time. In aggregate that pipeline uses about 40 minutes of worker time, but completes in 9 minutes of wall-clock time. A fast step failing and retrying will often have no impact at all on the total runtime of the pipeline, because everything is executed in parallel where possible. You'll notice that a large chunk of the depends_on directives depend on one of the initial build steps, so while virtually all of the steps depend on something we still get a lot of parallelization in practice
#

The original pipeline design of using multiple jobs is mostly a convenience (which may cause other problems and has some drawbacks) so it can be visualized and retried in a more friendly way.
That may be the case for such a small pipeline, it's not the case for mine. For instance, the ABQ Test/ABQ Work steps will fully consume 3 c6i.2xlarge instances for ~5 minutes. The Jest step will do the same for 7.5 minutes. The deployment tasks (staging db, subdomain, ecs deployment, proxy config) are all a bit variable since they reach out to external services and interact with them. Little resource usage, but high latency. Having my team wait 50 minutes because I 'rolled the DAG into dagger' instead of 9 is not an option 😆

indigo birch
#

Yeah, there are definitely workflows that benefit from parallelism across machines, and we’re working on adding the hooks necessary for Dagger to handle that automatically, the same way it handles caching automatically today. Until those hooks exist, the pragmatic approach is to rely on your CI as a loader, and manually split dagger invocations into parallel CI jobs, to get those 3 2xlarge boxes you need

cunning night
#

oh, on a related note, how does dagger play with compose?

#

do I need to do compose-inside-docker?

indigo birch
#

At the same time it is very common for workflows to get vastly over-allocated compute, while still suffering from suboptimal concurrency, because spawning more VMs is the only tool available, and it’s a very coarse one. So pipelines end up too expensive and too slow. In those cases, simply switching to Dagger on only one machine (possibly a huge one if needed) may often make it magically faster and cheaper, simply by reducing waste

cunning night
#

fair enough. I am definitely eyeing adding a second buildkite stack of tiny workers so I can say 'these steps don't need compute'

indigo birch
#

It’s kind of like switching from a hand made C++ parallel computing framework, to Goroutines. Doesn’t magically make your code faster, but high likelihood that it will eventually, because it’s easier to make more of your code more concurrent without being a wizard

cunning night
#

having a c6i.2xl sitting around waiting for ECS to deploy isn't the bestest thing 😆

#

But still, those steps need to be in separate VMs, I can't just make them run serially on one box

indigo birch
#

If I may ask: do you allocate compute per job, or a fixed allocation per project?

indigo birch
cunning night
#

per job practically speaking. Our project has a max instance count but that's just because it seems safe

indigo birch
#

(but at the moment, only across the CPUs of a single machine)

cunning night
#

Realistically I could set it to be unbounded and we'd spend the same amount of money

indigo birch
#

this is on Github’s runners? Sorry if you already said it

cunning night
#

We use the buildkite cloudformation stack - it creates an EC2 auto-scale group that uses spot pricing. When there are jobs waiting for agents, the ASG grows, when the queue is empty it shrinks to 0

#

the ASG launches buildkite's agent AMI in the instance types we select

indigo birch
#

🤦‍♂️ right buildkite, sorry got mixed up with another thread

cunning night
# indigo birch that’s the thing with Dagger, everything that can run in parallel, will run in p...

That doesn't really help a lot in this scenario. Say I put 3 tasks that can be run in parallel into dagger. The runner gets preempted. Now all 3 tasks are going to be re-run. One task fails? again, all 3 get re-run. Splitting the DAG up between two tools ends up being more of a liability than anything else. If I really have 3 tasks that can run on one runner at one time, I can just do command1 &; command2 &; command3 &; (which in fact, we do in a few of our current steps)

indigo birch
#

I was just responding to “I need multiple VMs, I can’t just run them sequentially on the same machine”.

Of course there are other ways to run tasks concurrently on one machine. A shell script is one of them 🙂 There are various ways in which dagger would advantageously replace that shell script, but ultimately it depends on what you’re comfortable with.

#

Note that in your example you are in fact splitting your DAG up between 2 tools: your CI runner, and bash 🙂

cunning night
cunning night
#

@indigo birch @dusk crane any updates on this?

indigo birch
#

@cunning night 👋 can you refresh my memory? Updates on what specifically?

cunning night
#

mm, I was expecting some clarification/explanation as to how dagger would improve my existing CI as it stands today. So far it seems like dagger can't address the pain points in the actual DAG, just let me write my bash scripts in something non-bash which I already have the option to do (and am doing in a few places)

indigo birch
#

What do you mean by “address the pain points in the actual DAG”? And what problems do you have in your CI today that you were hoping Dagger might solve?

cunning night
#

😅 probably faster to read the thread? The stuff inside the bash scripts isn't the problem - it's the ci-vendor-specific program-in-yaml-but-yaml-is-not-a-programming-language stuff that's the problem. That's why marco asked for my buildkite config ^

indigo birch
#

I did read the thread, and looking for a way to avoid repeating it all over again

#

Dagger gives you an API to construct a new DAG, and an engine to run it

#

The python/js/go code you write is more than the contents of a node in the DAG. It’s the controller logic for a whole new DAG.

#

It just happens that Dagger can be embedded inside another tool’s DAG (for example buildkite). This is because CI tools, even though in theory they run your DAG, are not capable of modeling and running the whole dag. That’s why you need makefiles, shell scripts etc: to take over from CI

cunning night
#

so in my existing setup, dagger lets me replace my existing step scripts, but that's it?

indigo birch
#

It depends on what your scripts do

#

Typically Dagger would replace the terrible glue that you wish you didn’t have to read and maintain

cunning night
#

that would be the YAML 😆

indigo birch
#

That may be yaml, scripts, or a mix of both

#

depends on the particular flavor of glue

#

but yeah it replaces a lot of yaml typically

cunning night
#

ok, so, given the yaml above, what will dagger replace?

#

I can provide just about any of the step scripts if you need them

indigo birch
#

could you paste it inline or in a gist please? I’m on my phone and it won’t let me read a .yaml file

cunning night
#

it's bigger than the inline limit 😆

#
#!/bin/bash
set -e

timeout 300 bash -c "until curl --silent --output /dev/null http://$ES_HOST:9201/_cat/health?h=st; do printf '.'; sleep 5; done; printf '\n'"

bundle exec rake db:drop db:setup

set +e
mkdir -p junit
#ABQ_LOG=abq=trace abq test \
#  --worker ${BUILDKITE_PARALLEL_JOB} \
#  --run-id $RWX_RUN_ID \
#  --reporter junit-xml=junit/rspec_${BUILDKITE_PARALLEL_JOB}.xml \
#  --reporter rwx-v1-json=junit/rspec_${BUILDKITE_PARALLEL_JOB}.json \
#  -- bundle exec rspec

RWX_JSON="junit/rspec_${BUILDKITE_PARALLEL_JOB}.json"

ABQ_LOG=abq=trace captain run \
  --suite-id cm-api-rspec \
  --test-results $RWX_JSON \
  -- abq test \
  --worker ${BUILDKITE_PARALLEL_JOB} \
  --run-id ${RWX_RUN_ID}-${BUILDKITE_STEP_KEY} \
  --reporter junit-xml=junit/rspec_${BUILDKITE_PARALLEL_JOB}.xml \
  --reporter rwx-v1-json=$RWX_JSON \
  -- bundle exec rspec
CODE=$?
echo "ABQ/Captain exited with code $CODE"
mkdir -p coverage_results

cp coverage/.resultset.json coverage_results/.resultset-${BUILDKITE_PARALLEL_JOB}.json
cp coverage/lcov/content-manager.lcov coverage_results/content-manager.${BUILDKITE_PARALLEL_JOB}.lcov
if [[ $CODE -eq 1 ]]
then
  exit 2
else
  exit $CODE
fi
```  <-- about the most complex bash script there, the ruby steps are a bit more complex because they leverage the AWS SDK
indigo birch
#

Those CI yaml files are really something… I find them unreadable

cunning night
#

the choir, preaching to it you are 😆

indigo birch
#

It looks like roughly speaking they orchestrate executing shell scripts and building docker images in a certain sequence and under certain conditions

cunning night
#

I even tried going through and de-duping stuff, but the fact that you can't append to or merge arrays in yaml means I didn't actually save that much LOC :/

indigo birch
#

I guess I would try to nuke that file (possibly in incremental chunks if it’s at all possible) and replacing it with controller code that orchestrates running the same scripts and dockerfiles

cunning night
#

Does Dagger let me do that?

#

I've poked at buildkite support a bit, since they support uploading steps dynamically, but uploaded steps don't run until the uploader step has completed, so I can't like upload a step, poll its outcome, then upload another step conditionally

indigo birch
#

I can take a closer look at your config today, hopefully my buildkite noobness will not get in the way too much

cunning night
#

ok, everything I saw around orchestration + dagger indicated that was a no-go

indigo birch
#

Do you mean orchestration of the underlying job runners?

cunning night
#

maybe. The way buildkite works is, an AWS lambda listens for pending jobs on buildkite's api. When jobs are present, it scales up an ASG that runs workers. Workers then pull jobs from the queue and execute them. Theoretically what I could do is have a build just grab 10 workers and have them run like dagger-slave --worker=${BUILDKITE_PARALLEL_JOB} - this would likely be hideously inefficient if I allocated enough workers to preserve my existing wall-clock time though

#

the number of runners a build needs will oscillate between 2 and 10 during its execution

cunning night
#

just to be clear I'm not even really married to buildkite - I do absolutely need a CI env that scales dynamically and is at least as efficient, but I'm willing to entertain other options that provide a better DX

indigo birch
#

How are jobs dispatched to workers? I assumed a simple 1-1 mapping (like in Github Actions for example) but you mention "number of runners for a build" which implies something more?

dusk crane
#

@cunning night @indigo birch it pretty much still applies to what I commented here #1066081182938308608 message.

TL;DR:

Pros:

  • Define your pipelines as code
  • Execute your pipelines locally
  • Better code re-use
  • Better caching
  • Better DAG parallelization

"Cons":

  • You can't remove your CI runner entirely
  • There's still gonna be a tiny bridge to orquestrate infrastructure parallelization.
  • You'll have to migrate some of the plugins to code yourself. This is progressive, it doesn't need to happen in a one shot migration; it gives you advantage to start with the plugins that are unmaintained or a burden to use and move on.
Discord

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

indigo birch
#

Assuming a typical job dispatch system (similar to other CI platforms), the ideal pattern would be to embed a dagger engine in each buildkite runner. And dispatch jobs with buildkite the usual way. This means keeping just enough of the yaml to keep the job dispatching logic, but "carve it out" such that each job is simply running your dagger controller

cunning night
# indigo birch How are jobs dispatched to workers? I assumed a simple 1-1 mapping (like in Gith...

There's a queue of jobs from all builds - workers just work down the queue (you can have named queued which I'm planning to exploit for different instance sizes). the number of runners comment is because e.g. the build starts using one worker to build the docker image - once that's done there are around 10 steps that execute concurrently, then some of those steps trigger other steps, etc. so a given build will be using from 2-10 workers at any given time

indigo birch
#

In this context, what's inside one "build" ? Is it the entire contents of that yaml file?

cunning night
#

yeah, that yaml file is run once per build

#

the buildkite nomenclature is a 'pipeline'

cunning night
indigo birch
#

(in a call but will reply right after, sorry)

cunning night
dusk crane
#

I mean.. after all buildkite is like github actions

dusk crane
cunning night
dusk crane
#

in your case you'll still have multiple steps orchestrated by buildkite, and each step will be dagger pipeline itself

cunning night
#

this is what timing looks like

cunning night
#

@indigo birch seemed to think that dagger would let me do orchestration?

dusk crane
cunning night
dusk crane
#

Dagger won't take care of the build infrastructure for you or give you any hooks to handle that

#

I think he's referring to the same thing I'm talking about, just using "proper" terms

#

but, unless I'm missing something, you won't be able to replace your buildkite job orchestration logic by using Dagger

dusk crane
dusk crane
cunning night
#

well by not solving the infrastructure question, Dagger effectively can't provide caching/immutability 😆 there's no guarantee that the worker that picks up a given step during one build will pick it up on a second build

dusk crane
#

the cache can live in the OCI registry

cunning night
#

hmm ok, so like, what would be cached in my pipeline? and how does dagger know when cache is valid?

dusk crane
# cunning night hmm ok, so like, what would be cached in my pipeline? and how does dagger know w...

you can decide what's cached and what you want to export. And Dagger does cache invalidation at the DAG vertex level through content addressable artifacts.

i.e: You can export/import an OCI image with your node_modules dependencies and if something changed in your package.json for example, Dagger will re-trigger that DAG node and install the missing dependency. This works the same way as the native buildkit cache works.

This export/import cache is very useful for stateless runners while preserving immutability and optimal caching. Optionally, if you have statefull nodes and your CI runner (buildkite) in this case can pin specific Dagger pipelines to the same node each time, in that case you don't need to export/import the cache as Dagger will use the local node disk cache efficiently

cunning night
# dusk crane you can decide what's cached and what you want to export. And Dagger does cache ...

I mean, node_modules caching is kinda table stakes isn't it? We're already caching that kind of stuff (node_modules, bundler, system deps, etc) via a multistage dockerfile, and using --link to further reduce cache invalidations. The other steps are all dependent on the actual code being run through the pipeline (test runs, inspection) so that cache would necessarily be invalidated whenever the commit hash changed. Are you suggesting I do something like if (File.glob('*/**/*.js').changed?) { run_js_specs() }

dusk crane
#

The other steps are all dependent on the actual code being run through the pipeline (test runs, inspection) so that cache would necessarily be invalidated whenever the commit hash changed. Are you suggesting I do something like if (File.glob('/**/.js').changed?) { run_js_specs() }

This is exactly were Dagger / Buildkit caching excels. Since your cache is content addressable, it knows about the files and steps involved to generate an OCI layer. In this particular case, Dagger will realize that since something in /**/.js changed, there will be a checksum DAG vergex missmatch, and I'll run that vertex operation (run_js_specs()) again automatically. You don't need to do anything about it, since it'll work the same way as running the Dockerfile locally

#

cache import/export basically exports all the OCI layers and it's inputs content addressables SHA's to understand if something changed and re-compute it

cunning night
dusk crane
dusk crane
#

@cunning night making more sense now? I mean..we'd love to ultimately solve the CI infra part eventually.. but you know.. finite resources and we needed to start somewhere. Having said that, we're confident that having a great DX, a programmable engine with caching and container native support from the start is a great product. If you think we're missing something at this stage, happy to hear feedback

cunning night
# dusk crane <@165587497626107904> making more sense now? I mean..we'd love to ultimately so...

I mean I think I understand what you're saying. Looking at my pipeline I could maybe cache/skip jest unit tests if none of those files changed... and possibly the same thing on the ruby side. That'd be only if Dagger only hashes WithDirectory etc and doesn't care if the hash of the underlying docker image has changed. It'd also have to invalidate the cache if any of the docker-compose services' hashes changed (e.g. if we only pin a major version of redis and the minor version changes on a subsequent build - ruby specs need to re-run). But those savings seem to be more due to the fact that this project is essentially a mini monorepo so there are some chunks that can be easily skipped. It'd probably be different if this pipeline had more artifacts consumed by subsequent steps, but right now we've got 1 major artifact (the docker image) produced at the start, and all the major caching is already handled there. Not that skipping js tests/ruby tests conditionally is insignificant, but...

dusk crane
#

That'd be only if Dagger only hashes WithDirectory etc and doesn't care if the hash of the underlying docker image has changed.
if you mean as using the --link flag, yes, Dagger works that way by default.

It'd also have to invalidate the cache if any of the docker-compose services' hashes changed (e.g. if we only pin a major version of redis and the minor version changes on a subsequent build - ruby specs need to re-run).

This is something currently outside the scope of Dagger, but our "services" API is on the way (cc @low cove ) and I wouldn't surprise if this is requested by users. Good thing is that docker-compose also doesn't handle this currently.