#CI & DAG visualization
1 messages · Page 1 of 1 (latest)
There are two core problems solved by our current CI (Gitlab) which Dagger cannot do on its own, forcing us to run dagger in gitlab. Our aim would be to run dagger as an external CI and have it report back status to Gitlab.
These two problems are partial ("hydrated state") re-execution (which I'll create another thread for) and execution visualisation, which this thread will cover.
As a developer, it's important to quickly see at what point a pipeline fails - did it fail to install dependencies (the dev might have e.g a higher version locally than is supported by the build platform), or in running tests, or in building the final image?
Seeing the execution state and output helps developers debug failures quicker, and also helps them determine if something failed on the CI side (e.g the process was OOMKilled, and the pipeline needs to rerun/ops needs to be contacted to resolve).
One thing I've noticed to be different with Dagger compared to our existing CI solution is also my tendency to use more, and more well-contained, containers in execution. For example, I abstract each git action (pull, push, branch, etc) out to a distinct function, as it makes composability easier, and means I (representing platform engineering) can give application developers a toolkit to build pipelines, without having to know the exact commands they want to run.
This greater use of isolation leads to a problem I find to be, if not unique, much more pronounced with dagger than other tooling: I don't want each container to be displayed separately in the visualisation. Instead, I want to group several executions into one "job", and show the entire output.
Example of grouping:
If I'm building Kubernetes manifests, I want to see the following all in one step:
- git pull
- pulumi up
- git commit
- git push
This is my "deploy" step/job/action. It's useful to see these together as bad output from one step will lead to a failure further down the line (garbage in garbage out).
This grouping also closely relate to what I'll refer to as "checkpoints" in the other (to-be-created) thread.
Thanks for the detailed notes @brave kernel !
As I mentioned in our earlier thread, improving visibility is quickly becoming a priority. We ourselves are encountering the same limitation in using Dagger for Dagger's CI 🙂 @crisp stone has lots of ideas around this.
On the topic of grouping specifically, that is something we are considering, probably via optional annotations in the GraphQL API (which would then be exposed in all SDKs). Something like WithName or WithDescription. Still TBD.
I'm very curious about your opening statement:
Our aim would be to run dagger as an external CI and have it report back status to Gitlab.
Is this because you are generally unhappy about Gitlab CI, and are now seeing a way out? Or you were fine with it, but then Dagger came along and you are rethinking your stack?
You also mention "reporting back to Gitlab" so presumably you want to keep Gitlab, just not for CI. Would you then use Gitlab only for git hosting? Or do you use some of its other features as well?
I'm curious because in general Gitlab's approach is to encourage you to use them for everything in the SDLC - they famously have a feature for everything and will argue that it works best when you use it all. What is your opinion of that model, and where does Dagger fit within it in your view?
Well, Dagger is a CI, so why use gitlab's pipelines if we could use only dagger? 😄 would require a small API layer in between ofc
Right now, I have to create steps in Gitlab, which means my Dagger gets split into multiple "commands". I'd really like to express the entire pipeline through Dagger
That's exactly what @crisp stone is running into (on Github Actions) when converting Dagger's own CI to Dagger
I have no issues with Gitlab CI, but I prefer the model of having CI as code rather than YAML. Makes it easier to reuse
@brave kernel do you host your own gitlab git server & CI runners?
Another big thing is dealing with monorepos - I'd love to just have TS look through a services directory and use a dynamic import to pull in whatever pipeline a service uses
Own runners but not server
Right now, I'm struggling with how we should setup Gitlab CI for all services. Ideally, developers shouldn't have to modify the root Gitlab CI file to add a service. Rather, all of it should be contained in the service
So we have started considering 2 approaches to solve this problem - first for ourselves then possibly for everyone else.
Approach 1: continue running in your existing CI; collect engine telemetry in your Dagger Cloud account, then present it back to git server via integration
Approach 2: run pipelines in Dagger-specific runners (not piggybacking on CI runners). Dagger Cloud handles both dispatching jobs + collecting telemetry & reporting status back to Gitlab
In both approaches, Dagger Cloud becomes a "companion" to gitlab.com. But in approach 1 it does less than in approach 2. There's a tradeoff there of course.
Yes, if I understand what you're saying correctly, this is how we do it too. Our goal is to have basically a single mapping of "git event" to "dagger pipeline". No need to filter by directory in the CI configuration, Dagger already handles that. For example if nothing has changed in the source code of service <foo> then all corresponding pipelines will just be cached.
Is this what you mean?
So we can do that in Gitlab CI with the changes keyword, but that would require adding a new job to the .gitlab-ci.yml. I want each service in a monorepo to be "auto-discovered" (I'd write that by traversing a directory) and executed on change
Yeah I believe you can do that with Dagger today. Ideally we want the CI yaml to be static glue that almost never changes. And we're also wondering if we should remove the yaml altogether and bypass the traditional CI (re: approaches 1 vs 2). But not sure yet.
Yeah from my POV any yaml is superfluous
And yes, it should be possible today already, but due to the other limitations, we're running with Gitlab and dind, so I suspect performance would be pretty rough with that
And node auto scaling wouldn't work well either
One other thing to note with the "don't run without changes" is that what I actually want is "run if the service or one of its dependencies changed". We have internal utility libraries in the same monorepo, and all services relying on that util should rebuild if the util changes
Re other features we use Gitlab registry and package registry
@brave kernel in "approach 2" (all CI on Dagger), are there features of Gitlab CI that you'd be worried of losing? Like for example efficient dispatching across multiple worker machines? Or easy configuration of the event/pipeline mapping ("run X on event Y")? Or runners on multiple runners?
Multiple workers is crucial. Take the example above with the monorepo: with gitlab spawning one kubernetes pod per job, we can easily horizontally scale up to handle any workload. This is probably the only hard requirement.
Another that we currently use which I don’t know how easy it is to do is merge request pipelines. I’d assume tag pipelines would really just be using TS to check if the most recent commit has a tag. MRs are however not built into Git but rather each provider. MR pipelines are heavily used by our frontend, allowing us to test functionality in a per-MR environment, and then merge straight into production
@brave kernel by TS you mean Typescript?
sorry, yeah
My guess is that in this scenario, you would integrate Dagger Cloud in the same way that you would integrate any external CI, for example CircleCI. Presumably Gitlab has the integration facilities to pass on MR-specific information to external CI. Of course that requires Gitlab-specific integration work on our end, but assuming we go for this option, we would just do the work
The main concern in this scenario for us are 1) do enough Dagger users actually want to use Dagger as their CI runner directly, and 2) if so, are we biting off more than we can chew engineering-wise, to deliver a solid experience.
Yep, I can’t answer those questions 😄 but regardless, the efficient distribution of work across workers is extremely useful, just for the flexibility it allows, particularly with machine types (which is an important factor if you run CI on spot instances like we do, where e.g a certain larger node might be unavailable, but with distributed work it doesn’t matter much)
@brave kernel so currently your CI setup is:
- Self-hosted Gitlab runners
- Deployed on Kubernetes
- On top of AWS EC2 spot instances
Is that right?
Yep! (The GL agent runs on regular nodes, but workers on spot)
What are your constraints for runner infrastructure?
- Is AWS a must-have?
- Is using your existing Kubernetes cluster a must-have?
- How do CI jobs currently get scheduled on your kubernetes cluster? Does gitlab.com have direct access to your cluster, or agents phone home?
- How are gitlab agents scaled? Do you independently scale the size of your kub deployment to N runners, and each runner phones home to get the same job? Or something else?
Sorry for all the questions 🙂 No obligation to answer them all of course!
Also: do you pay Gitlab? 🙂
Re paid: Yep, we're on the Gitlab premium, so $19/user/mo or whatever it is. In addition our CI env is currently something like 400-600/mo.
Gitlab has a Kubernetes executor that automatically spins up one new pod per job. Each node in the cluster has one runner registered, so the nodes automatically get registered with Gitlab. I don't recall off the top of my head if it's push or pull based.
We have no need to use our current k8s cluster. I like using k8s just to standardise operations. AWS is optional too. The main reason I like using our own runners is price:performance ratio. Using our own runners lets us decide how much resources are available for each job, and lets us pick instances that are appropriate for the jobs (and even do multiple instance pools).
In terms of scheduling, we run instances of "gitlab-runner", which Gitlab has a connection to (this is the part I don't know is pull/push)
This is very useful, thanks!
Are there any pipelines you don’t see yourself running on dagger? Like non-linux runners, or automation that is purely gitlab-specific housekeeping?
re: progress/demarcating "jobs", I really like the idea of doing that in code just like the rest of the API. it might even map well to Buildkit's 'progress groups' feature: https://github.com/moby/buildkit/blob/9624ab4710dd1a63453cc028802c9992b9715f3c/client/graph.go#L18 (added by @trim sky!)
(of course we might come up with our own progress representation in the end anyway, but it's always nice when you can represent things cleanly straight from the source)
one thing I’m curious about @gentle pagoda : with Dagger being a dynamic API rather than static yaml, it’s harder to mirror the usual visualization of DAG definition. We can visualize each pipeline execution and technically each of those is a DAG, but not the same DAG as the full definition. Have you given any thought to that?
yeah, I don't think we'd want to directly expose the lower-level DAG we send to Buildkit - at least, not at the top of the pyramid. it'd be cool as a power user thing, or drilling into an individual build, but too high fidelity otherwise. for CI/CD you probably want something more abstract, representing the flow of dependencies/artifacts, which is something that Dagger doesn't represent today at all.
this came up a lot in Concourse, and there are actually at least two ways people might want to visualize, depending on what they're looking for:
- the "current state of production": https://ci.concourse-ci.org/teams/main/pipelines/concourse - which just shows where things are failing, always rolling forward, stacking green/red statuses on top of prior runs
- the view of all builds downstream of a particular resource version (e.g. commit of a repo): https://ci.concourse-ci.org/teams/main/pipelines/concourse/resources/concourse/causality/74728214/downstream - we called this 'causality'
concourse doesn't really have "pipeline exections" (jobs are scheduled independently, pipeline is just a set of inter-dependent jobs) but the second thing is sort of the closest thing to it, since you're viewing a slice of job builds impacted by the version you're interested in
the first point is derived from the current YAML configuration, so it's static. the second point is generated based on build history, and tracking where a version was used as build inputs/outputs, etc. - so it's dynamic.
I see no reason to run anything in Gitlab. We've only been using it because GL pipelines are really good among the competition 😄
For the visualization I think there could also be a connection to sessions here. My default assumption would be that when visualizing a "build" you are visualizing everything that is happening as a result of one dagger.Connect (which is to say, everything in that session). So even if you submit multiple pipelines across the session then they all appear as one big DAG. And even though they are submitted separately, they can still have edges between them (which we'd be able to figure out via LLB). Of course, it'll also be possible for the DAG to not be fully connected, but that's fine if it reflects the reality of the session.
So I guess this would also imply that if we visualize the DAG for a session, then we need the ability for more vertices to dynamically pop up in the visualization as they get submitted. Once we have extensions, that'll be especially true since that cranks the possibility for dynamic vertices up a notch or two.
TIL! That's awesome
Agree we can't just show every single vertex all at once. You'd have to be able to group them and optionally "expand" (or maybe reveal detail when you zoom in, like digital maps do). I guess extensions will be a natural grouping point once we have them, but agree that exposing this in our API seems like a requirement in addition anyways, so makes sense to start there.
I feel like it might fit well with the "ambient context" thing where a grouping can be set and then passed transparently through to any subvertices. Related to this conversation: #maintainers message
Everything within one connect is exactly what I'd want to visualise 👍
And I think you could build a single DAG regardless; they'd just share the root node if they're not connected
My guess is that, even without any new API, we can probably approximate each graphql query to be one top-level group. Then the API adds sugar for giving a human-readable name and description to that group.
This is a very helpful discussion! We have a similar setup but in Jenkins hosted on Kubernetes and some of the points discussed here are very relevant to us too. In our current environment we will have to run Dagger in a DinD container (spun up for each build). We won't get any of the caching goodness of buildkit if we don't use a remote runner. We'll also see only a single stage in Jenkins so progress/demarcation of steps and it's visualization is also something I am eagerly looking forward to.
Out of the two approaches @turbid ridge outlined, no. 1 seems more desirable to me. Mainly because I work for a company that's in the highly regulated financial industry. We are usually averse to sharing our data to public cloud offerings. Are you planning to offer a self host-able variant of the dagger cloud?
(@gentle pagoda @trim sky FYI I'm building a small prototype around pgroups as a proof of concept)
Yes we do plan on building a self-hosted version of the control plane, eventually. How quickly we build it depends on customer interest of course 🙂 I think you would probably need this in both options, since they both involve Dagger Cloud
Hey @brave kernel -- regarding your original request regarding "grouping", we just implemented support for that (not yet released).
It's called Pipelines and does what you were describing: https://github.com/dagger/dagger/pull/4248
For now it's a "write only" API (e.g. there's nothing yet making use of those "groups"). but that's about to change.
I demo'd a few weeks ago a CLI tool that makes use of those "groups": https://www.youtube.com/watch?v=bL_9GOCZy3Y
And here's an example by @worldly verge using the new Pipeline syntax: https://github.com/kpenfound/greetings-api/blob/d4gh_demo/ci/ci.go#L56
WouldwithPipelineName or withPipeline make more sense and align with the other commands? I was following the PR and was trying to think of a good name for the step but it didn't occur to me till I saw this example.
the idea is that it’s more like container, ie the beginning of something (specifically a pipeline 🙂
@crisp stone can I do query { pipeline { container } } ?
I assumed so but just saw that’s not how @worldly verge ‘s demo uses it
sure
pipeline works on query, container and directory
client.Pipeline("foo").Container().From("alpine").Pipeline("bar").Exec("...")
if I added a pipeline here: https://github.com/kpenfound/greetings-api/blob/d4gh_demo/ci/ci.go#L80 would that make each of the 3 builds that export to it nested under that pipeline?
A typical usecase (our own CI) is like this:
node := client.Pipeline("nodejs").From("node").WithExec("yarn install") // <-- do a bunch of node related setup
lint := node.Pipeline("lint").WithExec("yarn lint")
test := node.Pipeline("test").WithExec("yarn test")
This creates:
- A parent pipeline for all things Node.js related
- That parent pipeline contains a few things itself (common yarn dependency installation)
- 2 sub-pipelines: one each for lint and test
Neat! So a "pipeline" maps roughly to a traditional CI job?
So buildkit optimisations still work properly on container-level? I.e if I have two pipelines, and container 1 depends on pipeline1:container2, is that treated the same by buildkit as just depending on the container.
Or I guess to simplify: does the pipeline construct in any way impact buildkit?
@worldly verge so here you could move the Pipeline one notch above, client.Pipeline(...).Container()
That would make the From() be part of docs (otherwise docs will be part of the root pipeline)
No effect whatsoever on caching, etc -- inputs are still evalued at the low-level operation
It's purely metadata based, comes back in reporting
We use that to "organize" how things are visualized, logs displayed etc
We also (at least in the CLI demo) use those as "virtual operations", e.g. the pipeline's total time is the sum of every operation inside, the cache status depends on all operations, etc
e.g. if you have a build pipeline that does an apk add and then go build, then the build time is going to be the sum of both. And it's flagged as cached only if both operations are cached
I have a pipeline scenario that I'm curious if it can already be achieved: only run the publish pipeline if the test pipeline passes.
I suppose it could be manually done by writing an if statement/wrapping in try catch (I'm not very familiar with error handling of dagger executions), but would be neat to instead more explicitly modify the dag by saying "I don't want to run x before y, even though it's not technically required"
I think the thing that's easiest to get confused with in Dagger is what ends up being executed in parallel and what's linear, and that's exacerbated when you put the entire pipeline into one DAG
@brave kernel Yeah, I see. It's definitely a common problem, hopefully visualization can at least make it easier to understand
Usually it's a "non problem" because if e.g. "publish" uses assets from "build", then they're naturally dependent
In your example of "publish" and "test" that's definitely not the case (e.g. wait for this completely disconnected "test" to complete before hitting "publish")
I have a half-baked proposal trying to somewhat address that: https://github.com/dagger/dagger/issues/4205
Yep, that's why it came to mind, it's probably the most obvious example of where you want a node that isn't a dependency to still block
Or rather: it's a logical and not a technical dependency
Right now there are ugly workarounds:
- somehow create a dependency between the two -- mount the tests inside publish for instance)
- OR, make tests synchronous (e.g. wait for an ExitCode() or something), then handle flow directly in code
But yeah, something like #4205 or maybe even an explicit DependsOn() for "logical" dependencies