what is a project? | Dagger | Page 1

vagrant valley Jun 20, 2023, 2:18 AM

#

is there docs/info on what goes in dagger.json right now, or has that not been fully decided yet?

maiden moat Jun 20, 2023, 2:23 AM

#

There probably is, but I’m the wrong person to ask 🙂 Erik can probably fill us in

#

that said, at this stage, the code is still source of truth

#

and the format & contents will probably still evolve

#

it’s basically what you need to load the project’s controller

vagrant valley Jun 20, 2023, 2:26 AM

#

i see, yeah. Y'alls currently looks pretty minimal 😄 https://github.com/dagger/dagger/blob/dc53110836824701291cf8cdde2fe6c0b2171f4f/ci/dagger.json#L3

#

going to leave this discussion here for reference: https://github.com/dagger/dagger/pull/5060#issuecomment-1533283941

maiden moat Jun 20, 2023, 2:28 AM

#

yeah the less in that file the better

#

there is actually in alternative design, which I find very intriguing, and would love to explore further. In that design you don’t add a file to your project directory. You configure dagger out of band which controller to load for which project directory

#

more like a downstream distro model

#

it breaks your brain initially, but if you let it sit for a while, it opens interesting options. Just a thought experiment for now 🙂

#

I think that model gets more interesting as the repo & team layouts get more complex

#

because the chances increase that different teams want to look at the same directory in different ways at different times

#

but that is hard to do if you must choose one correct way to look at a given directory, and lock that into an exclusive config file

vagrant valley Jun 20, 2023, 2:46 AM

#

when you say "out of band", do you mean the end user tells the dagger CLI how they want to view the directory and that what is displayed to them is not wholly derived at runtime from configuration inputs?

maiden moat Jun 20, 2023, 2:46 AM

#

this related to the still-unresolved dilemma of “where should the controller code live?”. So far we’ve all defaulted to something like “a subdirectory called .ci” or something like that, but with the rise of reusable extensions, that may not actually be where most of the code lives in the end. So instead of splitting controller code between “not reusable stays embedded in this repo” and “reusable lives as an extension in a separate dedicated repo” maybe we reconsider, and make everything external day one? (still playing out the alternative scenario here)

maiden moat Jun 20, 2023, 2:47 AM

#

vagrant valley when you say "out of band", do you mean the end user tells the dagger CLI how th...

I mean it’s derived by things other than a config file + bespoke code embedded in the target repo

#

again this is the alternative scenario which I’m not even 100% sure makes sense, but it does address unresolved questions in the design

#

and it illustrates my overall point that we are still early days and everything is still up for discussion 🙂

vagrant valley Jun 20, 2023, 3:06 AM

#

that is an interesting alternative. Trying to unwrap it a little bit, this lets end users define what their own "project" is a bit easier and perhaps share that configuration in another location

#

(took me a few minutes to try to fully understand how this might work 😄 )

#

one thing that this makes more "clean" imo, is that I've drifted in my code more towards decoupling the actual "doing" of the thing vs the "invoking" of the thing in order to make things more composable. In my code I end up having build_task that gets reused in multiple commands as I build up my pipelines and the dependency tree between them. Perhaps this model might make that a bit easier

maiden moat Jun 20, 2023, 3:09 AM

#

Yeah, one good reference point is the pattern of a separate “platform” or “infra” repo, owned and maintained by the platform/infra/sre team. With tooling etc. The design I’n describing would basically align perfectly with that pattern.

#

And it would align less with the most common way to daggerize a repo today: embedding small python/go/js files directly in the repo

#

in your case it would be like moving all the ci.py files out of the airbyte repo, and into the aircmd repo

vagrant valley Jun 20, 2023, 3:13 AM

#

yeah, the one single entrypoint. I can see pros and cons to that approach. I will say that the embedded in repo model works really well when you have only one or a few projects

maiden moat Jun 20, 2023, 3:14 AM

#

yes it’s simple and familiar

vagrant valley Jun 20, 2023, 3:14 AM

#

but reusability and composability gets really difficult when you scale. Imagine some microservices organization with hundreds of repos that wants to centralize this kind of stuff. Doing what you're suggesting suddenly becomes really appealing for control over all of the pipelines and access to reuse the building blocks

maiden moat Jun 20, 2023, 3:15 AM

#

Yeah I’m glad you’re seeing it. It’s just a hunch at the moment and, honestly I barely discussed it with the rest of the team, it may lead nowhere. What I find appealing is that it feels like a spiritual successor of the linux distro model

#

the distro model lost relevance because of technical and cultural baggage. but the core principles for scaling software maintenance were sound. Collectively they built a cohesive software platform at a scale unmatched since

vagrant valley Jun 20, 2023, 3:16 AM

#

yeah, i was really trying hard to understand that at first, it's becoming a bit clearer now. One subtle implication here is that you're sort of inverting the dependency situation. Now this big ops repo brings in all of the other libraries and stuff in order to be able to run the builds. I'm not sure which way is better but it's food for thought

maiden moat Jun 20, 2023, 3:16 AM

#

this feels like that

#

one example: what if you want to run all tests for all connectors as released 12 months ago?

#

if the target repo defines its own pipelines, well you need to fork that repo & amend those commits to change the pipelines

#

but if you invert the dependency.. as you said, that changes

vagrant valley Jun 20, 2023, 3:20 AM

#

it's definitely counter to the way ci traditionally has worked... which is not necessarily a bad thing 😄

maiden moat Jun 20, 2023, 3:20 AM

#

how about rebuilding your java stack from scratch one day with extra optimizations (fully making shit up now). Building a patched gradle, then running all pipelines for the core app on top of that stack

#

anyway I’m derailing again sorry 😅

vagrant valley Jun 20, 2023, 3:30 AM

#

ha! patching gradle sounds like an adventure all on its own.

One other point I want to bring up, coming back to the aircmd example since i've been playing in this space a LOT for the past few weeks. What you're suggesting makes testing pipelines simpler and honestly it feels like it would make developing pipelines simpler too, at least in my use case. Right now for aircmd, to test different repositories that use it, I have to add each one of them as test dependencies to aircmd. Anytime I change something reusable, I have to go to aircmd to change the code. So i've got this workflow where I'll implement something, decide whether it's worthwhile enough to abstract into something reusable, then I have to go build and release that to be able to use it downstream. Or I'll change something reusable and still have to go build and publish that change. The model of "every project in one place" being suggested here makes that now trivial in comparison.

I'll try to devils advocate and see if the reverse is true: if I make an application change, does this model make it harder to see if the pipeline breaks and iterate on it simulataneously? Where does that iteration loop need to be tighter at ? In other words, this is starting to sound very convenient for the ops guys, but will it hurt developers as a result?

#

it might be interesting to map out a theoretical workflow where we have to make an application code change under this model, and also a pipeline change (fairly common, methinks), and how that would work when the pipeline code lives elsewhere vs in the same repo. I think the solution is still somewhat complex, but already simpler than the existing model where you might have to touch an extension, publish that, and then consume the extension bump in your own repo pipeline

maiden moat Jun 20, 2023, 4:25 AM

#

https://tenor.com/view/math-thinking-zach-galifianakis-formulas-numbers-gif-7715569

Tenor

#

yes I agree 🙂

#

this discussion is giving me ideas

maiden moat Jun 20, 2023, 7:11 PM

#

@vagrant valley I think in this scenario, the developer experience is basically that ops/platform is giving developers a ready-to-use environment

#

what is a project?

maiden moat Jun 20, 2023, 11:13 PM

#

@wraith nova @fathom iron @plain tapir @honest jacinth for reference, this is the thread about "external vs. embedded"

plain tapir Jun 20, 2023, 11:31 PM

#

Yeah, depends on use case. When I was working in a web agency, sharing a lot of the same tooling and conventions in many repos, it makes sense to have it centralized. I used to manage that with salt, but was looking to move that into containers. Now I only manage 3 websites and they're all different platforms so in this case it makes more sense to keep the pipeline with the project.

I know full well the pain of releasing, bumping and updating a dependency when I need to change some reusable piece of code in order to consume it in a project. That was specifically on shared python apps, but there's no other way there. And it's much better than what they were doing before which was copying the apps from repo to repo. So I agree, if you have multiple similar projects it makes sense to externalize.

When I started thinking about dagger I thought I could offload most of the same boilerplate into a reusable app, keeping the dagger boilerplate inside projects to a minimum. But it could also be completely outside. A lot of projects shared the same conventions so I could just dagger do django:publish --site hotelcaloura or dagger do drupal:cc --name visitazores centrally like I was already doing with salt, but more easy to maintain and more accessible to team members.

#

But if I'd expose that to my previous team, I'd make a web UI for it rather than the CLI. This is actually something I've wanted to demo since I started working on the Python SDK. We were a Django shop, I'd easily use Wagtail as an interface for the team to interact with dagger pipelines already developed into this custom platform. Same centralization, just different UI.

vagrant valley Jun 21, 2023, 12:27 AM

#

Feels like there are two distinct user stories here

the developer who wants nice pipelines but doesn’t want to deal with the complexity of spinning up a large scaled system. They need something quick, clean, efficient that won’t take forever to get set up.
The builds focused engineer who wants a good story around getting a handle around disparate builds in their organization. They need something modular and extensible

Possible to have cake and eat it too by solving both? Not sure. Maybe with the right interface!

maiden moat Jun 21, 2023, 12:28 AM

#

Yeah I think you can get both

#

developer wants a containerized environment to run their simple builds, tests and deployments in an easy, reproducible, scriptable way.

#

platform engineer wants a containerized environment to run their complex builds, tests and deployments in an easy, reproducible, scriptable way.

fathom iron Jun 21, 2023, 1:22 AM

#

Just catching up here, yeah I 100% agree it's gonna be important to make both those stories and use cases work. The "I just want to automate some stuff in my repo" user who will embed CI targets in their own repo and the platform engineer user who wants to define a platform spanning possibly many different repos.

Lots of thoughts, but I guess the most central is that the direction we've headed so far makes both approaches possible, though there's plenty of work left to make both work well. And I do feel that it is possible to make both work well and it's also important to, since choosing to support exclusively one or the other is going to cause unnecessary pain for different groups of users.

In the current (very early, very WIP) conversion of dagger's CI to be a project, the top level command actually accepts args for which git repo it should run the rest of the targets against:

// Dagger CI targets
func CI(ctx dagger.Context, repo, branch string) (CITargets, error) {
    return CITargets{
        SrcDir: ctx.Client().Git(repo).Branch(branch).Tree(),
    }, nil
}

https://github.com/dagger/dagger/blob/736a68be4640078cb2e67aca4da547d1b0e1e4ba/ci/main.go#L12-L16

We don't have default values for those yet (github.com/dagger/dagger and main would be obvious choices), but that could be added easily too.

The Directory that gets created is passed down to all the subcommands, e.g. here it's mounted in to the ci:sdk:go:lint command for linting the go sdk: https://github.com/dagger/dagger/blob/736a68be4640078cb2e67aca4da547d1b0e1e4ba/ci/go.go#L21

What's interesting is that this approach actually means the repo being tested by the CI targets is technically independent of the definition of the CI targets. E.g. both of these would work:

dagger do -p ci/ ci:sdk:go:lint --repo github.com/dagger/dagger --branch v0.5.2
dagger do -p ci/ ci:sdk:go:lint --repo github.com/sipsma/dagger --branch my-new-feature

For practical reasons we currently have to embed our CI project in the actual dagger repo (related to the fact that we are testing dagger w/ dagger), but for any dagger user besides ourselves, they could follow exactly this sort of approach and enable running their CI targets against external repos.

We'd at minimum want to polish a lot of this up more, I wouldn't want users to have to specify --repo and --branch for every single repository in their platform every time they run dagger do, but that feels do-able in a number of different ways that we can explore some more.

#

Also, the problem brought up earlier in this thread of possibly needing some sort of hierarchy of dagger.json or project imports, etc. will also come up in our dagger CI dogfooding in that we'd love for each SDK to define its CI targets in its own language, whereas today everything is go (even python and ts CI targets). But that's going to require combining projects from multiple SDKs into one interface.

Another situation with many possible approaches, but I haven't gotten to the issue for it in our backlog for it quite yet 🙂 But definitely should be addressed sooner than later since it's fundamental

#

As I mentioned before, for practical reasons the dagger CI dogfood pretty much has to be embedded in the dagger repo, at least for now, but it does feel important to get more exploration with the separate repo, platform engineering use-case early on.

So not sure exactly how to do that yet... but we'll need to figure something out.

vagrant valley Jun 21, 2023, 4:30 PM

#

What about something like this? Consider, for the purposes of simplicity and brevity the Airbyte scenario discussed above, where we have 3 distinct “projects”. “oss”, “cloud”, “infra"

dagger.json remains as-is but stores local and remote paths to source code:

{
    "name": "oss",
    "sdk": "python",
    "local_path": "local_path_to_source_code",
    "remote_path": "remote_path_to_source_code"
}

Now we introduce a top level file, that is entirely managed by the CLI, that is checked in somewhere but that can also be overridden in the user’s home directory, like ~/.dagger/config.json

{
    "projects": {
        "oss": {
            "project_path": "/path/to/oss/dagger.json",
            "overrides": {
                    "local_path": "/local/path/to/oss/source/code"
                }
        },
        "cloud": {
            "project_path": "/path/to/cloud/dagger.json"
        },
        "infra": {
            "project_path": "/path/to/infra/dagger.json"
        }
    }
}

This file contains customizations that the end user can add like excluding projects, perhaps managed by “register” and “deregister”. The CLI can discover projects and add/remove them to the config.json this way.

CI/CD uses some default version of this file checked into the project repo to define its entrypoints. Maybe you point at this file when you start dagger or maybe dagger looks for it

Developers have their own paths on local machines, so when they run dagger to initially register the projects (possibly via a recursive directory search or via specifying a file to register on the command line), their local machine configs are updated.

Could make parts of this config.json sharable so end users could share different configurations and overrides.

This config file can be checked in alongside the dagger.json's in the repository or managed in a central "all builds" repository, and is extensible/overridable by the custom configuration specified in the user's home dir

vagrant valley Jun 21, 2023, 5:15 PM

#

Couple of other QOL things that could be considered under this approach:

If left out of dagger.json, paths to source could work like the current behavior and assume the current directory contains the source files needed to define the commands
remote_path and local_path may be combined to simply "path" with some cleverness under the hood, maybe all you do is override --path on the command line like in erik's suggestion when you want to target something other than normal
maybe "name" is completely arbitrary and isn't even necessary in dagger.json anymore, it's just some unique key that the config.json defines and can be defined on a whim

maiden moat Jun 21, 2023, 7:13 PM

#

I think that instead of mapping to local paths, you could map to git remotes - it might remove a bunch of headaches, and work flawlessly 99% of the times

#

And you could group everything such that the environment can be defined in a file that is shareable and self-contained

#

{
    "projects": {
        "oss": {
            "sdk": "python"
            "git_remote": "github.com/airbytehq/airbyte"
            "overrides": {
                    "local_path": "/local/path/to/oss/source/code"
                }
        },
        ...
    }
}

vagrant valley Jun 22, 2023, 4:28 PM

#

If we did that, would this obviate the need for dagger.json in each repo? meaning you would only need one JSON file (that is overridden in the home dir with a second file)?

maiden moat Jun 22, 2023, 5:31 PM

#

Yes, that is the idea - but to reiterate it is just one idea among several 🙂

In this scenario, yeah you would leave upstream repos untouched. And separately you would develop custom environments, with a dagger.json, automation code using one of the SDKs, etc

#

The environment would be developed with the expectation that it should be "applied" to a repository

#

(or possibly multiple repositories in the future?)

#

there's a whole rabbit hole there, to figure out the mapping.

#

But then again - the current model of mixing automation logic into an upstream repo, and figuring out the mapping, is a rabbit hole too

vagrant valley Jun 28, 2023, 10:03 PM

#

sounds like yall went onto hacking on a spike for this or related, excited to see what comes out of the oven

honest jacinth Jun 29, 2023, 5:21 AM

#

https://discord.com/channels/707636530424053791/1123776136707571793 has a rough proposal of a graphql schema

vagrant valley Jun 29, 2023, 12:33 PM

#

Nice first pass. I'll walk through and see how intuitive it feels with no extra info. Comments inline, contextualizing by mapping a real world use case, Airbyte, on top as before:

"API for your Dagger environment"
type Environment {
  "Command-line tools available in the environment"
  tools: [Tool!] // do these correspond to top level command groups as defined by dagger do syntax ["oss", "cloud", "infra"]
  "Lookup a command-line tool by name"
  tool(name: String!): Tool!

  "Artifacts available in the environment"
  artifacts(filters: [String!]): [Artifact!]

  """
  Checks available in the environment
  This could be unit tests, integration tests, linting, etc.
  """
  checks: [Checks!] // ["ossUnit", "cloudIntegration", "infraLint"] ???
  "Lookup a check by name"
  check(name: String!): Check!

  "Services available in the environment"
  services: [Service!] // ["redis", "postgres"]
  "Lookup a service by name"
  service(name: String!): Service!
  
  "Interactive shells available in the environment"
  shells: [Shell!]
  "Lookup a shell by name"
  shell(name: String!): Shell!

  "The working directory for this environment"
  workdir: Directory! // /path/to/central/builds/monorepo or //path/to/individual/repo ??

  "Direct access to the environment as a data graph. Great for scripting and composition"
  graph: EnvironmentGraph! // makes sense

  """
  Extend this environment with the capabilities of other environments.
  If no namespace is provided, environments are merged - this may cause naming conflicts.
  """
  withExtension(namespace: String, environment: Environment!): Environment!
  "Extensions currently added to the environment"
  extensions: [String!]! // ["Gale"]
  "Lookup an extension by its namespace"
  extension(namespace: String!) Environment!

  "Warm up the environment cache by pre-executing as much of the DAG as possible. This may make subsequent operations considerably faster"
  warmup: Environment! // neat!
}

#


type EnvironmentGraph {
    # This type will be extended by an environment-specific graphql schema
}

type Tool {
    name: String!
    description: String
    commands: [ToolCommand!]! // I see, this is either a command or a group of commands
}

type ToolCommand {
    name: String!
    description: String
    subcommands: [ToolCommand!]
    # invocation? // this should be optional probably, if not supplied perhaps it's a group?
}

type Artifact {
    name: String!
    description: String
    version: String
    labels: [String!]
    contents: ArtifactData!
    sbom: String! // what is sbom here?
}

union ArtifactData {
    Directory
    File
    Container // I saw some discussion about this, what if it's a pypi/npm/jar file for instance? is it just a File then?
}


type Checks {
  name: String!
  description: String
  result: CheckResult! // having a unified api around this is nice and something that it feels like Dagger Cloud needs
  subchecks: [Check!] // what is an example of a subcheck?
}

type CheckResult {
  success: Boolean!
  output: String!
}

type Shell { // I think I understand why you want this, but how does it relate to Tools?
    name: String!
    description: String
    terminal: Stream
}

// A bi-directional byte stream (eg. for terminal emulation)
type Stream {
    // Path to the websocket stream, relative to the current HTTP server
    wsPath: String!

#

Couple open questions/comments off the top of my head that I left out of comments:

Chaining commands (do build, then test, then publish) and command dependencies (publish requires test to execute) looks like it was left out of the proposal. Perhaps that is intentional? It gets more into the "orchestration" aspect of things, which is perhaps out of scope here
One environment appears to enforce that only one version of the extension, let's call it Gale, is available. I think this is a good thing
Are "Checks" wholly independent of the concept of what a "Pipeline" is?

maiden moat Jun 29, 2023, 7:04 PM

#

@vagrant valley CLI tools are meant to be a UI construct: something run by an end-user in the terminal, in a familiar way (like a shell script or npm script). They are not meant to be composed or wrapped by code, so there is on attempt to make that easy

#

However that's what graph is for: that's where an environment can expose an arbitrary data graph, with the full power of GraphQL's type system, for scripting and composition (and therefore chaining)

#

Are "Checks" wholly independent of the concept of what a "Pipeline" is?

Yes. The concept of "pipeline" becomes less important for the end user. They are more of an implementation detail - what happens under the hood when a client queries the DAG

vagrant valley Jun 29, 2023, 7:21 PM

#

makes total sense! Agree that the command should not be what is chained here. So in this approach, would you define your building blocks that actually do something (let's just call them something like oss_build_task, oss_test_task, and oss_publish_task) in the environment and then the commands simply call these building blocks via the GraphQL API?

maiden moat Jun 29, 2023, 7:25 PM

#

The way I see it, you would create an environment (let's call it "airbyte-connector-factory") with everything needed to develop airbyte connectors all packaged together for maximum productivity

#

And that environment would offer both:

A simple CLI tool called aircmd, invokable via dagger do or via graphql query { tool(name: "aircmd") { command(name: "build") { ..} } etc. (of course you can continue shipping it as a standalone binary outside of the dagger environment)
A complete graph for querying and composition, including custom build, test and publish tasks, at whatever level of detail you want for scripting

#what is a project?