#CI & DAG visualization

1 messages · Page 1 of 1 (latest)

turbid ridge
#

@brave kernel has questions on how to integrate Dagger in CI and visualize the DAG (placeholder)

brave kernel
#

There are two core problems solved by our current CI (Gitlab) which Dagger cannot do on its own, forcing us to run dagger in gitlab. Our aim would be to run dagger as an external CI and have it report back status to Gitlab.

These two problems are partial ("hydrated state") re-execution (which I'll create another thread for) and execution visualisation, which this thread will cover.

As a developer, it's important to quickly see at what point a pipeline fails - did it fail to install dependencies (the dev might have e.g a higher version locally than is supported by the build platform), or in running tests, or in building the final image?

Seeing the execution state and output helps developers debug failures quicker, and also helps them determine if something failed on the CI side (e.g the process was OOMKilled, and the pipeline needs to rerun/ops needs to be contacted to resolve).

One thing I've noticed to be different with Dagger compared to our existing CI solution is also my tendency to use more, and more well-contained, containers in execution. For example, I abstract each git action (pull, push, branch, etc) out to a distinct function, as it makes composability easier, and means I (representing platform engineering) can give application developers a toolkit to build pipelines, without having to know the exact commands they want to run.

This greater use of isolation leads to a problem I find to be, if not unique, much more pronounced with dagger than other tooling: I don't want each container to be displayed separately in the visualisation. Instead, I want to group several executions into one "job", and show the entire output.

Example of grouping:
If I'm building Kubernetes manifests, I want to see the following all in one step:

  • git pull
  • pulumi up
  • git commit
  • git push

This is my "deploy" step/job/action. It's useful to see these together as bad output from one step will lead to a failure further down the line (garbage in garbage out).

This grouping also closely relate to what I'll refer to as "checkpoints" in the other (to-be-created) thread.

turbid ridge
#

Thanks for the detailed notes @brave kernel !

As I mentioned in our earlier thread, improving visibility is quickly becoming a priority. We ourselves are encountering the same limitation in using Dagger for Dagger's CI 🙂 @crisp stone has lots of ideas around this.

On the topic of grouping specifically, that is something we are considering, probably via optional annotations in the GraphQL API (which would then be exposed in all SDKs). Something like WithName or WithDescription. Still TBD.

#

I'm very curious about your opening statement:

Our aim would be to run dagger as an external CI and have it report back status to Gitlab.

Is this because you are generally unhappy about Gitlab CI, and are now seeing a way out? Or you were fine with it, but then Dagger came along and you are rethinking your stack?

You also mention "reporting back to Gitlab" so presumably you want to keep Gitlab, just not for CI. Would you then use Gitlab only for git hosting? Or do you use some of its other features as well?

I'm curious because in general Gitlab's approach is to encourage you to use them for everything in the SDLC - they famously have a feature for everything and will argue that it works best when you use it all. What is your opinion of that model, and where does Dagger fit within it in your view?

brave kernel
#

Right now, I have to create steps in Gitlab, which means my Dagger gets split into multiple "commands". I'd really like to express the entire pipeline through Dagger

turbid ridge
brave kernel
turbid ridge
#

@brave kernel do you host your own gitlab git server & CI runners?

brave kernel
#

Another big thing is dealing with monorepos - I'd love to just have TS look through a services directory and use a dynamic import to pull in whatever pipeline a service uses

brave kernel
#

Right now, I'm struggling with how we should setup Gitlab CI for all services. Ideally, developers shouldn't have to modify the root Gitlab CI file to add a service. Rather, all of it should be contained in the service

turbid ridge
# brave kernel Own runners but not server

So we have started considering 2 approaches to solve this problem - first for ourselves then possibly for everyone else.

Approach 1: continue running in your existing CI; collect engine telemetry in your Dagger Cloud account, then present it back to git server via integration

Approach 2: run pipelines in Dagger-specific runners (not piggybacking on CI runners). Dagger Cloud handles both dispatching jobs + collecting telemetry & reporting status back to Gitlab

#

In both approaches, Dagger Cloud becomes a "companion" to gitlab.com. But in approach 1 it does less than in approach 2. There's a tradeoff there of course.

turbid ridge
brave kernel
turbid ridge
brave kernel
brave kernel
#

And node auto scaling wouldn't work well either

#

One other thing to note with the "don't run without changes" is that what I actually want is "run if the service or one of its dependencies changed". We have internal utility libraries in the same monorepo, and all services relying on that util should rebuild if the util changes

#

Re other features we use Gitlab registry and package registry

turbid ridge
#

@brave kernel in "approach 2" (all CI on Dagger), are there features of Gitlab CI that you'd be worried of losing? Like for example efficient dispatching across multiple worker machines? Or easy configuration of the event/pipeline mapping ("run X on event Y")? Or runners on multiple runners?

brave kernel
#

Multiple workers is crucial. Take the example above with the monorepo: with gitlab spawning one kubernetes pod per job, we can easily horizontally scale up to handle any workload. This is probably the only hard requirement.

Another that we currently use which I don’t know how easy it is to do is merge request pipelines. I’d assume tag pipelines would really just be using TS to check if the most recent commit has a tag. MRs are however not built into Git but rather each provider. MR pipelines are heavily used by our frontend, allowing us to test functionality in a per-MR environment, and then merge straight into production

turbid ridge
#

@brave kernel by TS you mean Typescript?

brave kernel
#

sorry, yeah

turbid ridge
#

My guess is that in this scenario, you would integrate Dagger Cloud in the same way that you would integrate any external CI, for example CircleCI. Presumably Gitlab has the integration facilities to pass on MR-specific information to external CI. Of course that requires Gitlab-specific integration work on our end, but assuming we go for this option, we would just do the work

#

The main concern in this scenario for us are 1) do enough Dagger users actually want to use Dagger as their CI runner directly, and 2) if so, are we biting off more than we can chew engineering-wise, to deliver a solid experience.

brave kernel
#

Yep, I can’t answer those questions 😄 but regardless, the efficient distribution of work across workers is extremely useful, just for the flexibility it allows, particularly with machine types (which is an important factor if you run CI on spot instances like we do, where e.g a certain larger node might be unavailable, but with distributed work it doesn’t matter much)

turbid ridge
#

@brave kernel so currently your CI setup is:

  • Self-hosted Gitlab runners
  • Deployed on Kubernetes
  • On top of AWS EC2 spot instances

Is that right?

brave kernel
turbid ridge
#

What are your constraints for runner infrastructure?

  • Is AWS a must-have?
  • Is using your existing Kubernetes cluster a must-have?
  • How do CI jobs currently get scheduled on your kubernetes cluster? Does gitlab.com have direct access to your cluster, or agents phone home?
  • How are gitlab agents scaled? Do you independently scale the size of your kub deployment to N runners, and each runner phones home to get the same job? Or something else?

Sorry for all the questions 🙂 No obligation to answer them all of course!

#

Also: do you pay Gitlab? 🙂

brave kernel
#

Re paid: Yep, we're on the Gitlab premium, so $19/user/mo or whatever it is. In addition our CI env is currently something like 400-600/mo.

Gitlab has a Kubernetes executor that automatically spins up one new pod per job. Each node in the cluster has one runner registered, so the nodes automatically get registered with Gitlab. I don't recall off the top of my head if it's push or pull based.

We have no need to use our current k8s cluster. I like using k8s just to standardise operations. AWS is optional too. The main reason I like using our own runners is price:performance ratio. Using our own runners lets us decide how much resources are available for each job, and lets us pick instances that are appropriate for the jobs (and even do multiple instance pools).

#

In terms of scheduling, we run instances of "gitlab-runner", which Gitlab has a connection to (this is the part I don't know is pull/push)

turbid ridge
#

This is very useful, thanks!

#

Are there any pipelines you don’t see yourself running on dagger? Like non-linux runners, or automation that is purely gitlab-specific housekeeping?

gentle pagoda
#

re: progress/demarcating "jobs", I really like the idea of doing that in code just like the rest of the API. it might even map well to Buildkit's 'progress groups' feature: https://github.com/moby/buildkit/blob/9624ab4710dd1a63453cc028802c9992b9715f3c/client/graph.go#L18 (added by @trim sky!)

GitHub

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit - buildkit/graph.go at 9624ab4710dd1a63453cc028802c9992b9715f3c · moby/buildkit

#

(of course we might come up with our own progress representation in the end anyway, but it's always nice when you can represent things cleanly straight from the source)

turbid ridge
#

one thing I’m curious about @gentle pagoda : with Dagger being a dynamic API rather than static yaml, it’s harder to mirror the usual visualization of DAG definition. We can visualize each pipeline execution and technically each of those is a DAG, but not the same DAG as the full definition. Have you given any thought to that?

gentle pagoda
#

yeah, I don't think we'd want to directly expose the lower-level DAG we send to Buildkit - at least, not at the top of the pyramid. it'd be cool as a power user thing, or drilling into an individual build, but too high fidelity otherwise. for CI/CD you probably want something more abstract, representing the flow of dependencies/artifacts, which is something that Dagger doesn't represent today at all.

#

this came up a lot in Concourse, and there are actually at least two ways people might want to visualize, depending on what they're looking for:

  1. the "current state of production": https://ci.concourse-ci.org/teams/main/pipelines/concourse - which just shows where things are failing, always rolling forward, stacking green/red statuses on top of prior runs
  2. the view of all builds downstream of a particular resource version (e.g. commit of a repo): https://ci.concourse-ci.org/teams/main/pipelines/concourse/resources/concourse/causality/74728214/downstream - we called this 'causality'

concourse doesn't really have "pipeline exections" (jobs are scheduled independently, pipeline is just a set of inter-dependent jobs) but the second thing is sort of the closest thing to it, since you're viewing a slice of job builds impacted by the version you're interested in

the first point is derived from the current YAML configuration, so it's static. the second point is generated based on build history, and tracking where a version was used as build inputs/outputs, etc. - so it's dynamic.

brave kernel
trim sky
#

For the visualization I think there could also be a connection to sessions here. My default assumption would be that when visualizing a "build" you are visualizing everything that is happening as a result of one dagger.Connect (which is to say, everything in that session). So even if you submit multiple pipelines across the session then they all appear as one big DAG. And even though they are submitted separately, they can still have edges between them (which we'd be able to figure out via LLB). Of course, it'll also be possible for the DAG to not be fully connected, but that's fine if it reflects the reality of the session.

So I guess this would also imply that if we visualize the DAG for a session, then we need the ability for more vertices to dynamically pop up in the visualization as they get submitted. Once we have extensions, that'll be especially true since that cranks the possibility for dynamic vertices up a notch or two.

trim sky
# gentle pagoda yeah, I don't think we'd want to directly expose the lower-level DAG we send to ...

Agree we can't just show every single vertex all at once. You'd have to be able to group them and optionally "expand" (or maybe reveal detail when you zoom in, like digital maps do). I guess extensions will be a natural grouping point once we have them, but agree that exposing this in our API seems like a requirement in addition anyways, so makes sense to start there.

I feel like it might fit well with the "ambient context" thing where a grouping can be set and then passed transparently through to any subvertices. Related to this conversation: #maintainers message

brave kernel
turbid ridge
green shard
#

This is a very helpful discussion! We have a similar setup but in Jenkins hosted on Kubernetes and some of the points discussed here are very relevant to us too. In our current environment we will have to run Dagger in a DinD container (spun up for each build). We won't get any of the caching goodness of buildkit if we don't use a remote runner. We'll also see only a single stage in Jenkins so progress/demarcation of steps and it's visualization is also something I am eagerly looking forward to.

Out of the two approaches @turbid ridge outlined, no. 1 seems more desirable to me. Mainly because I work for a company that's in the highly regulated financial industry. We are usually averse to sharing our data to public cloud offerings. Are you planning to offer a self host-able variant of the dagger cloud?

crisp stone
turbid ridge
crisp stone
# brave kernel There are two core problems solved by our current CI (Gitlab) which Dagger canno...

Hey @brave kernel -- regarding your original request regarding "grouping", we just implemented support for that (not yet released).

It's called Pipelines and does what you were describing: https://github.com/dagger/dagger/pull/4248

For now it's a "write only" API (e.g. there's nothing yet making use of those "groups"). but that's about to change.

I demo'd a few weeks ago a CLI tool that makes use of those "groups": https://www.youtube.com/watch?v=bL_9GOCZy3Y

turbid ridge
green shard
#

WouldwithPipelineName or withPipeline make more sense and align with the other commands? I was following the PR and was trying to think of a good name for the step but it didn't occur to me till I saw this example.

turbid ridge
#

the idea is that it’s more like container, ie the beginning of something (specifically a pipeline 🙂

#

@crisp stone can I do query { pipeline { container } } ?

#

I assumed so but just saw that’s not how @worldly verge ‘s demo uses it

crisp stone
#

pipeline works on query, container and directory

#

client.Pipeline("foo").Container().From("alpine").Pipeline("bar").Exec("...")

worldly verge
crisp stone
#

A typical usecase (our own CI) is like this:

node := client.Pipeline("nodejs").From("node").WithExec("yarn install") // <-- do a bunch of node related setup

lint := node.Pipeline("lint").WithExec("yarn lint")
test := node.Pipeline("test").WithExec("yarn test")

This creates:

  • A parent pipeline for all things Node.js related
  • That parent pipeline contains a few things itself (common yarn dependency installation)
  • 2 sub-pipelines: one each for lint and test
brave kernel
crisp stone
crisp stone
# brave kernel Neat! So a "pipeline" maps roughly to a traditional CI job? So buildkit optimis...

No effect whatsoever on caching, etc -- inputs are still evalued at the low-level operation

It's purely metadata based, comes back in reporting

We use that to "organize" how things are visualized, logs displayed etc

We also (at least in the CLI demo) use those as "virtual operations", e.g. the pipeline's total time is the sum of every operation inside, the cache status depends on all operations, etc

#

e.g. if you have a build pipeline that does an apk add and then go build, then the build time is going to be the sum of both. And it's flagged as cached only if both operations are cached

brave kernel
#

I have a pipeline scenario that I'm curious if it can already be achieved: only run the publish pipeline if the test pipeline passes.

I suppose it could be manually done by writing an if statement/wrapping in try catch (I'm not very familiar with error handling of dagger executions), but would be neat to instead more explicitly modify the dag by saying "I don't want to run x before y, even though it's not technically required"

#

I think the thing that's easiest to get confused with in Dagger is what ends up being executed in parallel and what's linear, and that's exacerbated when you put the entire pipeline into one DAG

crisp stone
#

@brave kernel Yeah, I see. It's definitely a common problem, hopefully visualization can at least make it easier to understand

Usually it's a "non problem" because if e.g. "publish" uses assets from "build", then they're naturally dependent

In your example of "publish" and "test" that's definitely not the case (e.g. wait for this completely disconnected "test" to complete before hitting "publish")

I have a half-baked proposal trying to somewhat address that: https://github.com/dagger/dagger/issues/4205

GitHub

I've noticed, either while dogfooding or looking at community demos, that the DX is lacking in synchronization mechanism and people are building workarounds around that. Problem 1: Single p...

brave kernel
#

Yep, that's why it came to mind, it's probably the most obvious example of where you want a node that isn't a dependency to still block

#

Or rather: it's a logical and not a technical dependency

crisp stone
#

Right now there are ugly workarounds:

  • somehow create a dependency between the two -- mount the tests inside publish for instance)
  • OR, make tests synchronous (e.g. wait for an ExitCode() or something), then handle flow directly in code
#

But yeah, something like #4205 or maybe even an explicit DependsOn() for "logical" dependencies