#CI improvements

1 messages ยท Page 1 of 1 (latest)

lusty grove
#

Lets start it here!

Categories I brought up earlier:

  • [ ] dagger-for-github improvements to make it compatible with our own workflows
  • [ ] dagger git improvements. Currently we do some git checkout dance to get a shallow clone that also has some extra bits we need. In an ideal world, we can just run dagger call -m github.com/dagger/dagger@ref instead of doing a checkout which requires some work in dagger such as https://github.com/dagger/dagger/pull/11030
  • [ ] stop building the dev engine in every workflow. Instead, the current set of workflows will run against a release engine. This is ok because they already build their own dev engine as a service in the test pipeline. To make sure we still validate the module loading the same way building a dev engine does, we will create a new workflow that creates a dev engine and runs a simple job against it. See https://github.com/dagger/dagger/pull/9853 for more context on the type of error that this is intended to catch

The hopeful outcome is that each workflow can be a very simple event definition + dagger call

warm anvil
#

CI improvements

#

Kind of a stretch goal, but maybe we take the opportunity to further break up and simplify the monolith? ๐Ÿ™‚

#

If there's an easy target I can take that on

#

I still cringe every time I have to open /.dagger/*.go

glacial moth
#

based on a discussion in #maintainers curious about reworking the changie flow as well? i've wanted to remove it for a while.

#

chatting with @lime kestrel i actually really like the idea of just using labels (like changelog/feature, changelog/fix, etc) to decide which PRs are relevant, and then only extracting the PR title in the changelog.
that forces us to write nice, user-readable PR titles, which ain't a bad thing.
not a huge fan of using an agent, but if we changed the flow to this, it would be a lot easier to give that a try later.

glacial moth
warm anvil
#

another one: you mentioned path filtering @lusty grove .

Hear me out, what if we generated gha path filters from scanning dagger module dependencies? It would get us closer to the goal of "smart events" in native CI (where you don't need explicit path filtering at all).

wdyt?

glacial moth
#

i don't think it's entirely trivial, since we'd need to extract include/exclude filters from dagger.json and from the context args

#

(those are also buildkit-style ignore patterns, which may-or-may-not be convertible to shell-style globs)

warm anvil
#

right. What's interesting is that we need to do that anyway for snart events. So could be a stepping stone. cc @brittle sphinx @opal minnow

glacial moth
#

if we do it, imo, it should be part of the engine functionality - i don't think we should have extra logic that's not part of the engine to detect this.

warm anvil
#

yeah it's going to be in cloud in the end

#

but there could be helpers in the engine, just not sure what the interface would be

#

I guess just a list of paths?

#

I'm guessing you don't want an actual gha config generator in the engine ๐Ÿ™‚

glacial moth
#

yeah, probably a list of paths? i'm not really sure

#

obviously, there might be a super easy way to do this, but my guestimation is that getting something that matches the compat of include/excludes of modules is gonna be fiddly

glacial moth
warm anvil
#

@kpiroddi @lusty grove do you think we can get our CI config so standardized and vanilla, that the YAML could eventually be generated without custom code? ๐Ÿ™‚

(Sorry Kelvin for the accidental ping!)

#

Just look at the modules, boom, spit out some yaml

#

(not as a short term thing, more like an aspirational goal)

#

also a stepping stone to native CI ๐Ÿ™‚

lusty grove
warm anvil
#

I love the idea of providing the primitive in the engine, it makes perfect sense. It's non-trivial, but feels clean, and once we have it, it will make it trivial to generate path filters for all CI systems ๐Ÿ˜

lusty grove
#

it wont actually run until the first 2 TODOs are done

#

the 3rd TODO is a bigger one so maybe we can pair on that Monday @brittle sphinx ?

brittle sphinx
lusty grove
# lusty grove draft: https://github.com/dagger/dagger/pull/11099

@brittle sphinx the key parts of this PR:

The remaining TODO that I mentioned in standup is to update /modules/gha/steps.go:callDaggerStep to use dagger-for-github instead of the bash its currently using, however this can totally be done in a separate PR to keep things simple.

Currently CI is failing due to the ongoing network/docker/ns issues, so I'm not sure how close the current PR is to being ready

brittle sphinx
#

I know you are on PTO, just sharing here for context. I'll take a look at that last change. I was reviewing the PR, one thing we should change is the name of the jobs to not include the dev in their name. This way we can visiblity into the switch on honeycomb

brittle sphinx
#

Switching to dagger-for-github as is means we are loosing the summary attached to each workflow (see screenshot). We could add this capability to dagger-for-github, Its nice to have the trace immediately there without having to search through the logs. WDYT? @glacial moth. In that case I would merge this PR as is and do the callDaggerStep change in a follow up as Kyle suggested

glacial moth
#

uhhh

#

i don't use the summary that much honestly

#

having the links is useful!

#

there's an output from the dagger-for-github actions

#

so wouldn't be that hard to plumb into the summary? we just need output it

brittle sphinx
#

Are the engine logs on engine & cli tests used?

glacial moth
#

yesss - you can find them under the AsService logs span

#

takes a bit of clicking around, but they're there

warm anvil
#

It would make me happy if we didn't throw away the summary feature, it took me a long time to debug ๐Ÿ˜ญ By the way, the generator made the devloop on the summary easier, because I could call the individual dagger function that generates the markdown, with mocked inputs, and locally check that it looked ok.

glacial moth
#

regardless of how long it took to work on, is it something that you use? i find the github summaries are not super useful to navigate to. if we do keep them, i'd rather have them as just a link directly to the trace, instead of having another way to visualize traces.

again, it's the dogfooding argument - if traces isn't good enough, we should make it good enough, but if we're relying on a tool our users aren't using, then we're going to miss those things

warm anvil
#

Well, it's a native feature of Github Actions that makes our integration with it slightly better. I am all in on Dagger Cloud, and the CI integration itself is not something I actively debug and maintain these days, so I go straight to the Dagger Cloud traces. But that won't be true of all our users. Having gone to the trouble of implementing a native summary is a sign of a well-maintained action that worried about the details (IMO)

#

Occasionally when the CI config itself is what I care about, then yes, I absolutely use the summary. Much better than rawdogging the logs

brittle sphinx
brittle sphinx
#

Sharing some more details of today's update re: https://github.com/dagger/dagger/pull/11133 @lusty grove

There's two separate changes in that PR:

  1. Use dagger-for-github instead of the exec script. This removes the install-dagger as well
  2. Using PARC by setting dagger-flags: --cloud and cloud-token: oidc on the dagger-for-github action

For 2. to go out we need tomorrow's release. We could split the PR in two but since we only had to wait until tomorrow I thought it didn't matter that much. This is the change we need before using PARC in CI: https://github.com/dagger/dagger/commit/e669c2a22a319691f82ae2cfbe4d9925ef26e0c0#diff-a779dc2d07845bab6445b177d3ab8e5b5f3ffb6e48d72198b9b92cce6837402eR287. It's a hacky version of scale out, developed as a temporary meassure to start testing the performance of PARC in alternative CI engines.

Next steps:

  1. Develop the v0 scale out we mentioned on today's call. Module + function
  2. Refactor our CI to have separate functions for github actions jobs: check --targets=docs -> check-docs, etc. This way our github actions YAML won't have anything special, it will just call functions with no required arguments
lusty grove
lime kestrel
#

will just run em locally for now

glacial moth
#

i would add changelog/feature, changelog/fix, etc labels to correspond with the changie categories

#

then, when we generate a changelog, simply get all the PRs since the last release, grab em from github, filter by labels, and run the titles + authors through a go template

#

then, implement something like the changie merge functionality to get it into the CHANGELOG

#

that's my vision at least!

#

if we wanna add llm stuff after that that's potentially interesting, but i wouldn't start with that

lime kestrel
glacial moth
glacial moth
lime kestrel
#

well, we're still publishing github releases right? this is just a bool/flag on the github release creation api

#

i guess the question becomes whether we value a CHANGELOG file?

glacial moth
#

@lusty grove was this one of the things you were interested on working on?

lusty grove
warm anvil
#

@lusty grove @lime kestrel I'm polishing my codegen improvements PR... A small issue with node_modules which is now getting pulled into the Changeset. Looking for the cleanest way to omit it. Is it acceptable to mount node_modules in a tmpfs?

lusty grove
#

I think that should be ok in our case. Is it not possible to exclude it from the directory?

warm anvil
lusty grove
#

Got it, I was thinking like Directory.WithoutDirectory("/src/node_modules").Changes() but maybe that doesn't make sense

warm anvil
#

Yeah that's what I'm going to do for now..

#

Just not sure what is more or less efficient..

warm anvil
#

Update: codegen refactor is looking good! will open PR soon

#

I kind of want to attack our version module next... So slow to load..

#

I added a compat layer, so we can upgrade each individual SDK codegen to Changeset, incrementally (the module supports both)

#

Also simplified sdk_all.go which had a lot of unnecessary abstractions

#

@lusty grove how do you feel about porting all the official SDK CI modules back to Go?

#

Right now we're in an awkward middle ground, where:

  1. We still have centralized per-SDK CI in the monolith (.dagger/sdk_xxx.go)

  2. Separately, each SDK has a standalone "dev" module, written in its native language.

  3. Some parts of the monolith do a pass-through to the dev module, for some functions. For example Python and Elixir pass-through to the dev module for generate and lint.

  4. Other parts of the monolith still duplicate what's in the dev modules, so there's drift and dead code.

  5. We get the downsides of multi-language (slowness, more complexity, more boilerplate), but we don't get the benefits

lusty grove
#

the alternative would be to fix number 4 i guess?

warm anvil
#

My proposal (for discussion):

  1. We port the dev modules to Go

  2. Once everything is in Go, much easier to finish the job, and spin out 100% of SDK CI in sub-modules

  3. This is a forcing function for cleaning up the very complex relationship between SDK build and engine build. There's a circular dependency because: engine build bundles SDKs; and SDK tests require engine. Until we clear 1 and 2, we can't touch 3.

#

As a bonus, it would make our entire CI faster

#

everything is so complicated...

#

๐Ÿคฎ

warm anvil
#

Running into a fundamental UX question: different generation functions have different lifecycles

#
  • In devloop: generate SDKs and docs
  • In release loop: generate changelog (because that's how Changie works)

So far we made the assumption that all generation is always safe to run during devloop, and therefore should be required by CI (ie. there's a check that fails if you forgot to generate anything). But we can't do that for changelog... But it's still a generation function! So... how to expose this lifecycle difference in a way that makes sense to the user?

#

cc <@&946480760016207902>

#

Maybe we'll eventually need a first-class concept of "loop" or "gate"?

#

maybe to be reconciled with the Concourse-like declarative model? Here we would be declaring the stages of your lifecycle

#

And how to graduate from one to the other

#

(just spitballing)

warm anvil
warm anvil
lusty grove
#

I think we're ready to simplify the changelog generation so that will go away probably

warm anvil
#

Just playing devil's advocate, I don't think changing how we generate changelog is the lowest-hanging fruit - even though it sounds cool and I'm not a changie fan

#

(raw reaction from spelunking in our monolith... it's pretty gnarly in there still)

lusty grove
#

Definitely not lowest-hanging, but it fits somewhere in the CI improvements roadmap

warm anvil
#

@lusty grove when you're around I'd love to show you around my monolith simplification PR, it's pretty satisfying

warm anvil
#

@lusty grove not urgent - what's the context for this GHA workflow added last week?

        withWorkflow(
            ci.AltRunner,
            true,
            "Dev Engine",
            "check-sdks",
        ).
#

It seems redundant with this other one:

    withSDKWorkflows(
            ci.AltRunnerWithCache,
            "SDKs",
            "python",
            "typescript",
            "go",
            "java",
            "elixir",
            "rust",
            "php",
            "dotnet",
        ).

Except for the different runner

lusty grove
#

When we stopped running everything on dev engines, we agreed to have one small task that still runs against a dev engine to verify our CI module loading against dev and all that. So the specific check is somewhat arbitrary

warm anvil
#

Ah I see.

#

@lusty grove I'm trying to organize the "check matrix" in the monolith..

#

The older workflow (withSDKWorkflows) runs each SDK check in a parallel GHA job. Whereas yours runs all SDK checks together in a single job. I'm wondering if I could nuke the parallel one, and just apply the same pattern as you in both cases

#

At the moment a SDK check is 1) sdk lint 2) sdk test 3) test-publish

I would have assumed it was too big to group all 3 for all sdks in a single gha job. But maybe not?

#

(assumption based on the pre-existing parallel workflow)

#

if it all fits in a single GHA job, I could nuke some complexity in the module

#

will check prod traces to get a sense for the size of these jobs

#

OK .... if I'm not mistaken, your combined "all in one" job is actually WAY FASTER... don't understand how that's possible

#

I guess it's because of the new runners

#

?

#

Based on these numbers it seems very reasonable to nuke the parallel jobs and just run all-in-one SDK checks in a single job, in both old and new runners