#Dagger vs traditional CI
1 messages · Page 1 of 1 (latest)
@cunning night would be awesome if you could give us a simplified buildkite example of how you're running your pipelines so we can take a look and give the best advise about how you could leverage Dagger
particularly highlighting what's the current pain you/your team is going through and what type of solutions you're aiming to
definitely
working on anonymizing it...nothing terribly sensitive but 🤷♂️
haha dang this is gonna be slightly embarassing, and I just did a bunch of work on this
No worries, we appreciate the time you're investing in sharing all this with us
ok thar she be
The dependabot stuff is because we're happy with our rules & specs around dependency updates, so we usually just merge version bumps and do a quick smoke test if specs pass, so we don't deploy those to staging
I need to get us off of some of the buildkite plugins, and clean up some duplication - but part of my frustration with YAML is that it's pretty tedious to comb through and find what I can dedupe
oh, concurrency groups are another significant issue, forgot to mention those
we use a shared RDS instance for all the staging envs, the create-db step only actually does anything if we haven't made a db for the branch yet, but it's a very bad time if you get a few branches trying to create themselves at once
What I want is to take all that crazy, vendor-specific DAG stuff around retries, parallelism, concurrency, timeouts, and dependencies, and implement them in a real programming language, preferably a statically-typed one
import Client, { connect } from "@dagger.io/dagger"
const withSource = (client: Client) => {
buildContext = useContext('build')
return client.container().from("base").withExec(['git', 'checkout', buildContext.commit])
}
const builder = (client: Client) => {
envContext = useContext('env')
return withSource(client).setEnv(envContext.env)
}
const withImage = (image: string, client: Client) => {
withSource(client).from(image)
}
const runTestDeps = (parent: Promise, client: Client) => {
const retPromise = new Promise()
parent.then((imageName: string) => {
const promises: Promise[] = []
for (i in 0..3) {
promises.push(withImage(imageName, client).withEnv('PARALLEL_JOB', i).run('bundle exec rspec'))
}
awaitAll(promises).then(r => retPromise.resolve).catch(e => retPromise.reject)
})
return retPromise
}
// initialize Dagger client
connect(async (client: Client) => {
const promises: Promise[] = []
const context = useContext('build')
if (context.branch !== 'depdendabot') {
const prodBuild = builder(client).run('docker/build-prod')
promises.push(runProdDeps(prodBuild, client))
}
const testBuild = builder(client).run('docker/build-test')
promises.push(runTestDeps(testBuild, client))
output = awaitAll(promises)
await annotateBuild(output)
// print output
console.log(`Build ${context.buildNumber} completed successfully`)
})```
something like that but with less hideous promise/await semantics 😆
gotta run folks, it's quite late here in Uruguay. I'll try to check this out briefly during the weekend or come back first time on Monday. Have a good weekend!
@cunning night typescript is your preferred choice of language for this correct?
I can read go fine if you prefer to work examples in it, just been a long time since I used it
👋 was just giving a quick look at this. Some questions and preliminary thoughts:
- How does parallelization work? Do all the
stepsget executed in parallel by default unless adepends_onkey is used? - re: taking all the vendor-specific things out, that one won't be straightforward since as we mentioned before, Dagger currently doesn't handle build infrastructure, so it doesn't understand the concept of "split this DAG in X workers". Having said that, what we realized is that a lot of users generally set up parallelization in their CI's not because of performance reasons but because the DAG is sometimes easier to visualize and also to retry specific steps when these fail.
As an example, here's GHA pipeline https://github.com/salaboy/fmtok8s-frontend/blob/main/.github/workflows/ci_workflow.yml#L52 which I helped migrating to dagger recently https://github.com/salaboy/fmtok8s-frontend/blob/main/.github/workflows/dagger_workflow.yml. You'll notice that the Dagger pipeline doesn't make use of multiple parallel jobs because it's not effectively needed. The original pipeline design of using multiple jobs is mostly a convenience (which may cause other problems and has some drawbacks) so it can be visualized and retried in a more friendly way.
I see your example pipeline also has a bunch of depends_on directives. What I'd love to know, maybe having a quick chat tomorrow, is how this parallelization work for you in practice.
As a last note, taking aside parallelization, there's indeed a lot of other things like retries and plugins that could effectively be ported to Dagger so you can not only run them locally, but also have better reusability across your codebase.
- How does parallelization work? Do all the steps get executed in parallel by default unless a depends_on key is used?
correct
I see your example pipeline also has a bunch of depends_on directives. What I'd love to know, maybe having a quick chat tomorrow, is how this parallelization work for you in practice.
I mean, it practice it works to do what parallelization is intended to - reduce wall-clock time. In aggregate that pipeline uses about 40 minutes of worker time, but completes in 9 minutes of wall-clock time. A fast step failing and retrying will often have no impact at all on the total runtime of the pipeline, because everything is executed in parallel where possible. You'll notice that a large chunk of thedepends_ondirectives depend on one of the initial build steps, so while virtually all of the steps depend on something we still get a lot of parallelization in practice
The original pipeline design of using multiple jobs is mostly a convenience (which may cause other problems and has some drawbacks) so it can be visualized and retried in a more friendly way.
That may be the case for such a small pipeline, it's not the case for mine. For instance, the ABQ Test/ABQ Work steps will fully consume 3 c6i.2xlarge instances for ~5 minutes. The Jest step will do the same for 7.5 minutes. The deployment tasks (staging db, subdomain, ecs deployment, proxy config) are all a bit variable since they reach out to external services and interact with them. Little resource usage, but high latency. Having my team wait 50 minutes because I 'rolled the DAG into dagger' instead of 9 is not an option 😆
Yeah, there are definitely workflows that benefit from parallelism across machines, and we’re working on adding the hooks necessary for Dagger to handle that automatically, the same way it handles caching automatically today. Until those hooks exist, the pragmatic approach is to rely on your CI as a loader, and manually split dagger invocations into parallel CI jobs, to get those 3 2xlarge boxes you need
oh, on a related note, how does dagger play with compose?
do I need to do compose-inside-docker?
At the same time it is very common for workflows to get vastly over-allocated compute, while still suffering from suboptimal concurrency, because spawning more VMs is the only tool available, and it’s a very coarse one. So pipelines end up too expensive and too slow. In those cases, simply switching to Dagger on only one machine (possibly a huge one if needed) may often make it magically faster and cheaper, simply by reducing waste
fair enough. I am definitely eyeing adding a second buildkite stack of tiny workers so I can say 'these steps don't need compute'
It’s kind of like switching from a hand made C++ parallel computing framework, to Goroutines. Doesn’t magically make your code faster, but high likelihood that it will eventually, because it’s easier to make more of your code more concurrent without being a wizard
having a c6i.2xl sitting around waiting for ECS to deploy isn't the bestest thing 😆
But still, those steps need to be in separate VMs, I can't just make them run serially on one box
If I may ask: do you allocate compute per job, or a fixed allocation per project?
that’s the thing with Dagger, everything that can run in parallel, will run in parallel, implicitly
per job practically speaking. Our project has a max instance count but that's just because it seems safe
(but at the moment, only across the CPUs of a single machine)
Realistically I could set it to be unbounded and we'd spend the same amount of money
this is on Github’s runners? Sorry if you already said it
We use the buildkite cloudformation stack - it creates an EC2 auto-scale group that uses spot pricing. When there are jobs waiting for agents, the ASG grows, when the queue is empty it shrinks to 0
the ASG launches buildkite's agent AMI in the instance types we select
🤦♂️ right buildkite, sorry got mixed up with another thread
That doesn't really help a lot in this scenario. Say I put 3 tasks that can be run in parallel into dagger. The runner gets preempted. Now all 3 tasks are going to be re-run. One task fails? again, all 3 get re-run. Splitting the DAG up between two tools ends up being more of a liability than anything else. If I really have 3 tasks that can run on one runner at one time, I can just do command1 &; command2 &; command3 &; (which in fact, we do in a few of our current steps)
I was just responding to “I need multiple VMs, I can’t just run them sequentially on the same machine”.
Of course there are other ways to run tasks concurrently on one machine. A shell script is one of them 🙂 There are various ways in which dagger would advantageously replace that shell script, but ultimately it depends on what you’re comfortable with.
Note that in your example you are in fact splitting your DAG up between 2 tools: your CI runner, and bash 🙂
Indeed, not ideal! Just a hack necessitated by the current state of affairs.
@indigo birch @dusk crane any updates on this?
@cunning night 👋 can you refresh my memory? Updates on what specifically?
mm, I was expecting some clarification/explanation as to how dagger would improve my existing CI as it stands today. So far it seems like dagger can't address the pain points in the actual DAG, just let me write my bash scripts in something non-bash which I already have the option to do (and am doing in a few places)
What do you mean by “address the pain points in the actual DAG”? And what problems do you have in your CI today that you were hoping Dagger might solve?
😅 probably faster to read the thread? The stuff inside the bash scripts isn't the problem - it's the ci-vendor-specific program-in-yaml-but-yaml-is-not-a-programming-language stuff that's the problem. That's why marco asked for my buildkite config ^
I did read the thread, and looking for a way to avoid repeating it all over again
Dagger gives you an API to construct a new DAG, and an engine to run it
The python/js/go code you write is more than the contents of a node in the DAG. It’s the controller logic for a whole new DAG.
It just happens that Dagger can be embedded inside another tool’s DAG (for example buildkite). This is because CI tools, even though in theory they run your DAG, are not capable of modeling and running the whole dag. That’s why you need makefiles, shell scripts etc: to take over from CI
so in my existing setup, dagger lets me replace my existing step scripts, but that's it?
It depends on what your scripts do
Typically Dagger would replace the terrible glue that you wish you didn’t have to read and maintain
that would be the YAML 😆
That may be yaml, scripts, or a mix of both
depends on the particular flavor of glue
but yeah it replaces a lot of yaml typically
ok, so, given the yaml above, what will dagger replace?
I can provide just about any of the step scripts if you need them
could you paste it inline or in a gist please? I’m on my phone and it won’t let me read a .yaml file
it's bigger than the inline limit 😆
#!/bin/bash
set -e
timeout 300 bash -c "until curl --silent --output /dev/null http://$ES_HOST:9201/_cat/health?h=st; do printf '.'; sleep 5; done; printf '\n'"
bundle exec rake db:drop db:setup
set +e
mkdir -p junit
#ABQ_LOG=abq=trace abq test \
# --worker ${BUILDKITE_PARALLEL_JOB} \
# --run-id $RWX_RUN_ID \
# --reporter junit-xml=junit/rspec_${BUILDKITE_PARALLEL_JOB}.xml \
# --reporter rwx-v1-json=junit/rspec_${BUILDKITE_PARALLEL_JOB}.json \
# -- bundle exec rspec
RWX_JSON="junit/rspec_${BUILDKITE_PARALLEL_JOB}.json"
ABQ_LOG=abq=trace captain run \
--suite-id cm-api-rspec \
--test-results $RWX_JSON \
-- abq test \
--worker ${BUILDKITE_PARALLEL_JOB} \
--run-id ${RWX_RUN_ID}-${BUILDKITE_STEP_KEY} \
--reporter junit-xml=junit/rspec_${BUILDKITE_PARALLEL_JOB}.xml \
--reporter rwx-v1-json=$RWX_JSON \
-- bundle exec rspec
CODE=$?
echo "ABQ/Captain exited with code $CODE"
mkdir -p coverage_results
cp coverage/.resultset.json coverage_results/.resultset-${BUILDKITE_PARALLEL_JOB}.json
cp coverage/lcov/content-manager.lcov coverage_results/content-manager.${BUILDKITE_PARALLEL_JOB}.lcov
if [[ $CODE -eq 1 ]]
then
exit 2
else
exit $CODE
fi
``` <-- about the most complex bash script there, the ruby steps are a bit more complex because they leverage the AWS SDK
Those CI yaml files are really something… I find them unreadable
the choir, preaching to it you are 😆
It looks like roughly speaking they orchestrate executing shell scripts and building docker images in a certain sequence and under certain conditions
I even tried going through and de-duping stuff, but the fact that you can't append to or merge arrays in yaml means I didn't actually save that much LOC :/
I guess I would try to nuke that file (possibly in incremental chunks if it’s at all possible) and replacing it with controller code that orchestrates running the same scripts and dockerfiles
Does Dagger let me do that?
I've poked at buildkite support a bit, since they support uploading steps dynamically, but uploaded steps don't run until the uploader step has completed, so I can't like upload a step, poll its outcome, then upload another step conditionally
Yes 😁
I can take a closer look at your config today, hopefully my buildkite noobness will not get in the way too much
ok, everything I saw around orchestration + dagger indicated that was a no-go
Do you mean orchestration of the underlying job runners?
maybe. The way buildkite works is, an AWS lambda listens for pending jobs on buildkite's api. When jobs are present, it scales up an ASG that runs workers. Workers then pull jobs from the queue and execute them. Theoretically what I could do is have a build just grab 10 workers and have them run like dagger-slave --worker=${BUILDKITE_PARALLEL_JOB} - this would likely be hideously inefficient if I allocated enough workers to preserve my existing wall-clock time though
the number of runners a build needs will oscillate between 2 and 10 during its execution
just to be clear I'm not even really married to buildkite - I do absolutely need a CI env that scales dynamically and is at least as efficient, but I'm willing to entertain other options that provide a better DX
How are jobs dispatched to workers? I assumed a simple 1-1 mapping (like in Github Actions for example) but you mention "number of runners for a build" which implies something more?
@cunning night @indigo birch it pretty much still applies to what I commented here #1066081182938308608 message.
TL;DR:
Pros:
- Define your pipelines as code
- Execute your pipelines locally
- Better code re-use
- Better caching
- Better DAG parallelization
"Cons":
- You can't remove your CI runner entirely
- There's still gonna be a tiny bridge to orquestrate infrastructure parallelization.
- You'll have to migrate some of the plugins to code yourself. This is progressive, it doesn't need to happen in a one shot migration; it gives you advantage to start with the plugins that are unmaintained or a burden to use and move on.
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
Assuming a typical job dispatch system (similar to other CI platforms), the ideal pattern would be to embed a dagger engine in each buildkite runner. And dispatch jobs with buildkite the usual way. This means keeping just enough of the yaml to keep the job dispatching logic, but "carve it out" such that each job is simply running your dagger controller
There's a queue of jobs from all builds - workers just work down the queue (you can have named queued which I'm planning to exploit for different instance sizes). the number of runners comment is because e.g. the build starts using one worker to build the docker image - once that's done there are around 10 steps that execute concurrently, then some of those steps trigger other steps, etc. so a given build will be using from 2-10 workers at any given time
In this context, what's inside one "build" ? Is it the entire contents of that yaml file?
yeah, that yaml file is run once per build
the buildkite nomenclature is a 'pipeline'
how is 'dispatching logic' different from 'controller code that orchestrates' #1066081182938308608 message ?
(in a call but will reply right after, sorry)
Yeah I saw that, but theoretically this thread is about taking the concepts you outlined there and applying them to my IRL pipeline. Would you mind providing an example that simplifies the DAG defined in my pipeline in concrete terms?
gotcha. I'm a bit swamped right now , but will try to work on something with @indigo birch . Having said that, the example I posted above for github actions should give you an idea about how that YAML simplification looks like
I mean.. after all buildkite is like github actions
^ if you open both links on that message you'll see the benefits I'm highlighting
you'll notice that in the Dagger workflow, "almost" all steps are go run dagger.go instead of some custom actions yaml
yeah I read over those. They simplified the yaml by inlining work into a single worker, which won't work in my pipeline.
in your case you'll still have multiple steps orchestrated by buildkite, and each step will be dagger pipeline itself
this is what timing looks like
the step orchestration complexity is specifically what I'm trying to remove from the YAML 😆 If I wanted to I could already write all my steps in go or ruby or whatever
@indigo birch seemed to think that dagger would let me do orchestration?
I don't think that's the case. Unless he knows something that I don't 🙂. Let's see what he says afterwards
Dagger won't take care of the build infrastructure for you or give you any hooks to handle that
I think he's referring to the same thing I'm talking about, just using "proper" terms
but, unless I'm missing something, you won't be able to replace your buildkite job orchestration logic by using Dagger
you could rewrite your steps but you'll still be locked in to your CI to for caching and immutability and having some "glue" code to build your docker images and push them. Dagger solves this natively by allowing you to "progressively" migrate those pipelines to code
Not trying to convince you of anything, just clarifying the current use-case we're solving ATM
well by not solving the infrastructure question, Dagger effectively can't provide caching/immutability 😆 there's no guarantee that the worker that picks up a given step during one build will pick it up on a second build
How come? You don't need statefull machines to do immutable builds
the cache can live in the OCI registry
that's how dagger/buildkit export cache works: https://github.com/moby/buildkit#export-cache
hmm ok, so like, what would be cached in my pipeline? and how does dagger know when cache is valid?
you can decide what's cached and what you want to export. And Dagger does cache invalidation at the DAG vertex level through content addressable artifacts.
i.e: You can export/import an OCI image with your node_modules dependencies and if something changed in your package.json for example, Dagger will re-trigger that DAG node and install the missing dependency. This works the same way as the native buildkit cache works.
This export/import cache is very useful for stateless runners while preserving immutability and optimal caching. Optionally, if you have statefull nodes and your CI runner (buildkite) in this case can pin specific Dagger pipelines to the same node each time, in that case you don't need to export/import the cache as Dagger will use the local node disk cache efficiently
I mean, node_modules caching is kinda table stakes isn't it? We're already caching that kind of stuff (node_modules, bundler, system deps, etc) via a multistage dockerfile, and using --link to further reduce cache invalidations. The other steps are all dependent on the actual code being run through the pipeline (test runs, inspection) so that cache would necessarily be invalidated whenever the commit hash changed. Are you suggesting I do something like if (File.glob('*/**/*.js').changed?) { run_js_specs() }
The other steps are all dependent on the actual code being run through the pipeline (test runs, inspection) so that cache would necessarily be invalidated whenever the commit hash changed. Are you suggesting I do something like if (File.glob('/**/.js').changed?) { run_js_specs() }
This is exactly were Dagger / Buildkit caching excels. Since your cache is content addressable, it knows about the files and steps involved to generate an OCI layer. In this particular case, Dagger will realize that since something in /**/.js changed, there will be a checksum DAG vergex missmatch, and I'll run that vertex operation (run_js_specs()) again automatically. You don't need to do anything about it, since it'll work the same way as running the Dockerfile locally
cache import/export basically exports all the OCI layers and it's inputs content addressables SHA's to understand if something changed and re-compute it
so I would have to specify what files a vertex cares about or no?
yes, it's basically the same as doing Dockerfile COPY or ADD operation. Dagger has the WithFile and WithDirectory operations as well when defining your pipelines.
you basically say container.WithDirectory("/foo", mydir).WithExec("run test") > that's how you write the pipeline. And Dagger will take a content addressable SHA of the OCI layer and inputs to get there. If anything on mydir changes on the next run, it'll run the exec again
@cunning night making more sense now? I mean..we'd love to ultimately solve the CI infra part eventually.. but you know.. finite resources and we needed to start somewhere. Having said that, we're confident that having a great DX, a programmable engine with caching and container native support from the start is a great product. If you think we're missing something at this stage, happy to hear feedback
I mean I think I understand what you're saying. Looking at my pipeline I could maybe cache/skip jest unit tests if none of those files changed... and possibly the same thing on the ruby side. That'd be only if Dagger only hashes WithDirectory etc and doesn't care if the hash of the underlying docker image has changed. It'd also have to invalidate the cache if any of the docker-compose services' hashes changed (e.g. if we only pin a major version of redis and the minor version changes on a subsequent build - ruby specs need to re-run). But those savings seem to be more due to the fact that this project is essentially a mini monorepo so there are some chunks that can be easily skipped. It'd probably be different if this pipeline had more artifacts consumed by subsequent steps, but right now we've got 1 major artifact (the docker image) produced at the start, and all the major caching is already handled there. Not that skipping js tests/ruby tests conditionally is insignificant, but...
That'd be only if Dagger only hashes WithDirectory etc and doesn't care if the hash of the underlying docker image has changed.
if you mean as using the--linkflag, yes, Dagger works that way by default.
It'd also have to invalidate the cache if any of the docker-compose services' hashes changed (e.g. if we only pin a major version of redis and the minor version changes on a subsequent build - ruby specs need to re-run).
This is something currently outside the scope of Dagger, but our "services" API is on the way (cc @low cove ) and I wouldn't surprise if this is requested by users. Good thing is that docker-compose also doesn't handle this currently.