#CI improvements
1 messages ยท Page 1 of 1 (latest)
Lets start it here!
Categories I brought up earlier:
- [ ] dagger-for-github improvements to make it compatible with our own workflows
- [ ] dagger git improvements. Currently we do some git checkout dance to get a shallow clone that also has some extra bits we need. In an ideal world, we can just run
dagger call -m github.com/dagger/dagger@refinstead of doing a checkout which requires some work in dagger such as https://github.com/dagger/dagger/pull/11030 - [ ] stop building the dev engine in every workflow. Instead, the current set of workflows will run against a release engine. This is ok because they already build their own dev engine as a service in the test pipeline. To make sure we still validate the module loading the same way building a dev engine does, we will create a new workflow that creates a dev engine and runs a simple job against it. See https://github.com/dagger/dagger/pull/9853 for more context on the type of error that this is intended to catch
The hopeful outcome is that each workflow can be a very simple event definition + dagger call
CI improvements
Kind of a stretch goal, but maybe we take the opportunity to further break up and simplify the monolith? ๐
If there's an easy target I can take that on
I still cringe every time I have to open /.dagger/*.go
SGTM ๐
based on a discussion in #maintainers curious about reworking the changie flow as well? i've wanted to remove it for a while.
chatting with @lime kestrel i actually really like the idea of just using labels (like changelog/feature, changelog/fix, etc) to decide which PRs are relevant, and then only extracting the PR title in the changelog.
that forces us to write nice, user-readable PR titles, which ain't a bad thing.
not a huge fan of using an agent, but if we changed the flow to this, it would be a lot easier to give that a try later.
.dagger/scan.gocould mostly be a resuablemodules/trivysdk_typescript.godoesn't use thesdk/typescript/devmodulesdk_go.goshould mostly be split into a dev module forsdk/go/dev
another one: you mentioned path filtering @lusty grove .
Hear me out, what if we generated gha path filters from scanning dagger module dependencies? It would get us closer to the goal of "smart events" in native CI (where you don't need explicit path filtering at all).
wdyt?
i don't think it's entirely trivial, since we'd need to extract include/exclude filters from dagger.json and from the context args
(those are also buildkit-style ignore patterns, which may-or-may-not be convertible to shell-style globs)
right. What's interesting is that we need to do that anyway for snart events. So could be a stepping stone. cc @brittle sphinx @opal minnow
if we do it, imo, it should be part of the engine functionality - i don't think we should have extra logic that's not part of the engine to detect this.
yeah it's going to be in cloud in the end
but there could be helpers in the engine, just not sure what the interface would be
I guess just a list of paths?
I'm guessing you don't want an actual gha config generator in the engine ๐
yeah, probably a list of paths? i'm not really sure
obviously, there might be a super easy way to do this, but my guestimation is that getting something that matches the compat of include/excludes of modules is gonna be fiddly
yeah I see the appeal.
@lusty grove do you think we can get our CI config so standardized and vanilla, that the YAML could eventually be generated without custom code? ๐@kpiroddi
(Sorry Kelvin for the accidental ping!)
Just look at the modules, boom, spit out some yaml
(not as a short term thing, more like an aspirational goal)
also a stepping stone to native CI ๐
yes i've 100% been thinking of it as a stepping stone. we'll keep pushing for the config to be as minimal as possible
As discussed: https://github.com/dagger/dagger/issues/11098
I love the idea of providing the primitive in the engine, it makes perfect sense. It's non-trivial, but feels clean, and once we have it, it will make it trivial to generate path filters for all CI systems ๐
it wont actually run until the first 2 TODOs are done
the 3rd TODO is a bigger one so maybe we can pair on that Monday @brittle sphinx ?
Sounds good! We should have a bit more data on parc by monday so we could potentially start migrating some workflows as part of this work
@brittle sphinx the key parts of this PR:
- https://github.com/dagger/dagger/pull/11099/files#diff-8faf949e9daf6a3189631345b957f26bf1df9843f7fbbebc5a314e49bc9edfcc
.github/main.go: disables the 3 spots where we specify that we want a dev engine and adds a new workflow that runsdagger call check --targets=SDKswith a dev engine - https://github.com/dagger/dagger/pull/11099/files#diff-45807afeb464794b5fe50db55cfca3642aca0555d095315263f84a45dcab4ed1
modules/gha/steps.go: replace the install-dagger bash step withdagger-for-github@v8.2.0. This is using the action as a "setup" action - it just installs the CLI. In runs where the correct version is already installed, it's a no-op
The remaining TODO that I mentioned in standup is to update /modules/gha/steps.go:callDaggerStep to use dagger-for-github instead of the bash its currently using, however this can totally be done in a separate PR to keep things simple.
Currently CI is failing due to the ongoing network/docker/ns issues, so I'm not sure how close the current PR is to being ready
I know you are on PTO, just sharing here for context. I'll take a look at that last change. I was reviewing the PR, one thing we should change is the name of the jobs to not include the dev in their name. This way we can visiblity into the switch on honeycomb
Switching to dagger-for-github as is means we are loosing the summary attached to each workflow (see screenshot). We could add this capability to dagger-for-github, Its nice to have the trace immediately there without having to search through the logs. WDYT? @glacial moth. In that case I would merge this PR as is and do the callDaggerStep change in a follow up as Kyle suggested
uhhh
i don't use the summary that much honestly
having the links is useful!
there's an output from the dagger-for-github actions
so wouldn't be that hard to plumb into the summary? we just need output it
Are the engine logs on engine & cli tests used?
yesss - you can find them under the AsService logs span
takes a bit of clicking around, but they're there
It would make me happy if we didn't throw away the summary feature, it took me a long time to debug ๐ญ By the way, the generator made the devloop on the summary easier, because I could call the individual dagger function that generates the markdown, with mocked inputs, and locally check that it looked ok.
regardless of how long it took to work on, is it something that you use? i find the github summaries are not super useful to navigate to. if we do keep them, i'd rather have them as just a link directly to the trace, instead of having another way to visualize traces.
again, it's the dogfooding argument - if traces isn't good enough, we should make it good enough, but if we're relying on a tool our users aren't using, then we're going to miss those things
Well, it's a native feature of Github Actions that makes our integration with it slightly better. I am all in on Dagger Cloud, and the CI integration itself is not something I actively debug and maintain these days, so I go straight to the Dagger Cloud traces. But that won't be true of all our users. Having gone to the trouble of implementing a native summary is a sign of a well-maintained action that worried about the details (IMO)
Occasionally when the CI config itself is what I care about, then yes, I absolutely use the summary. Much better than rawdogging the logs
Added a summary to dagger-for-github here: https://github.com/dagger/dagger-for-github/pull/192
I included the actual command being executed (or script if running dagger shell), the link to the trace and the output of dagger version
Sharing some more details of today's update re: https://github.com/dagger/dagger/pull/11133 @lusty grove
There's two separate changes in that PR:
- Use
dagger-for-githubinstead of theexecscript. This removes theinstall-daggeras well - Using PARC by setting
dagger-flags: --cloudandcloud-token: oidcon the dagger-for-github action
For 2. to go out we need tomorrow's release. We could split the PR in two but since we only had to wait until tomorrow I thought it didn't matter that much. This is the change we need before using PARC in CI: https://github.com/dagger/dagger/commit/e669c2a22a319691f82ae2cfbe4d9925ef26e0c0#diff-a779dc2d07845bab6445b177d3ab8e5b5f3ffb6e48d72198b9b92cce6837402eR287. It's a hacky version of scale out, developed as a temporary meassure to start testing the performance of PARC in alternative CI engines.
Next steps:
- Develop the v0 scale out we mentioned on today's call. Module + function
- Refactor our CI to have separate functions for github actions jobs:
check --targets=docs->check-docs, etc. This way our github actions YAML won't have anything special, it will just call functions with no required arguments
left a comment on the PR!
dont want to lose this thread (regarding changelogs / removing changie). What tooling do we need to land on a better flow?
@brittle sphinx looks like the evals job doesn't work anymore since it runs with final engine instead of dev: https://dagger.cloud/dagger/traces/39defc61d65f33469406c3c8e090d2af
will just run em locally for now
i would suggest that use a github label enabled flow
i would add changelog/feature, changelog/fix, etc labels to correspond with the changie categories
then, when we generate a changelog, simply get all the PRs since the last release, grab em from github, filter by labels, and run the titles + authors through a go template
then, implement something like the changie merge functionality to get it into the CHANGELOG
that's my vision at least!
if we wanna add llm stuff after that that's potentially interesting, but i wouldn't start with that
github has basically all that built in btw, there's just a tiny straightforward config file: https://docs.github.com/en/repositories/releasing-projects-on-github/automatically-generated-release-notes#example-configurations
i still would really like the option to manually tidy up changelogs if i don't like it
i use this for bass: https://github.com/vito/bass/blob/main/.github/release.yml
sgtm, if we're happy using a github tool, i thought part of the goal was to detach from a github-centric process?
well, we're still publishing github releases right? this is just a bool/flag on the github release creation api
i guess the question becomes whether we value a CHANGELOG file?
there's https://github.com/rhysd/changelog-from-release which seems ok. but i'm not strongly opinionated. just trying to raise options for minimal effort ๐
@lusty grove was this one of the things you were interested on working on?
yeah I'm happy to carry this one. Sounds like a good discussion topic for next week!
@lusty grove @lime kestrel I'm polishing my codegen improvements PR... A small issue with node_modules which is now getting pulled into the Changeset. Looking for the cleanest way to omit it. Is it acceptable to mount node_modules in a tmpfs?
I think that should be ok in our case. Is it not possible to exclude it from the directory?
In my case, no practical way...
- There's no
Directory.Changes(exclude) - Also no
Container.Directory(exclude)
So it's possible but I have to refactor a bit
Got it, I was thinking like Directory.WithoutDirectory("/src/node_modules").Changes() but maybe that doesn't make sense
Yeah that's what I'm going to do for now..
Just not sure what is more or less efficient..
Update: codegen refactor is looking good! will open PR soon
I kind of want to attack our version module next... So slow to load..
I added a compat layer, so we can upgrade each individual SDK codegen to Changeset, incrementally (the module supports both)
Also simplified sdk_all.go which had a lot of unnecessary abstractions
@lusty grove how do you feel about porting all the official SDK CI modules back to Go?
Right now we're in an awkward middle ground, where:
-
We still have centralized per-SDK CI in the monolith (
.dagger/sdk_xxx.go) -
Separately, each SDK has a standalone "dev" module, written in its native language.
-
Some parts of the monolith do a pass-through to the dev module, for some functions. For example Python and Elixir pass-through to the dev module for generate and lint.
-
Other parts of the monolith still duplicate what's in the dev modules, so there's drift and dead code.
-
We get the downsides of multi-language (slowness, more complexity, more boilerplate), but we don't get the benefits
the alternative would be to fix number 4 i guess?
My proposal (for discussion):
-
We port the dev modules to Go
-
Once everything is in Go, much easier to finish the job, and spin out 100% of SDK CI in sub-modules
-
This is a forcing function for cleaning up the very complex relationship between SDK build and engine build. There's a circular dependency because: engine build bundles SDKs; and SDK tests require engine. Until we clear 1 and 2, we can't touch 3.
As a bonus, it would make our entire CI faster
everything is so complicated...
๐คฎ
Running into a fundamental UX question: different generation functions have different lifecycles
- In devloop: generate SDKs and docs
- In release loop: generate changelog (because that's how Changie works)
So far we made the assumption that all generation is always safe to run during devloop, and therefore should be required by CI (ie. there's a check that fails if you forgot to generate anything). But we can't do that for changelog... But it's still a generation function! So... how to expose this lifecycle difference in a way that makes sense to the user?
cc <@&946480760016207902>
Maybe we'll eventually need a first-class concept of "loop" or "gate"?
maybe to be reconciled with the Concourse-like declarative model? Here we would be declaring the stages of your lifecycle
And how to graduate from one to the other
(just spitballing)
Starting to get there: https://github.com/dagger/dagger/pull/11161
Update, looks like I got it wrong, the check for generated Changelog is called in devloop (and checked in every PR). I was missing the fact that new entries from changie new don't flow directly into the generated changelog - they have to be staged with changie batch first
I think we're ready to simplify the changelog generation so that will go away probably
Just playing devil's advocate, I don't think changing how we generate changelog is the lowest-hanging fruit - even though it sounds cool and I'm not a changie fan
(raw reaction from spelunking in our monolith... it's pretty gnarly in there still)
Definitely not lowest-hanging, but it fits somewhere in the CI improvements roadmap
@lusty grove when you're around I'd love to show you around my monolith simplification PR, it's pretty satisfying
@lusty grove not urgent - what's the context for this GHA workflow added last week?
withWorkflow(
ci.AltRunner,
true,
"Dev Engine",
"check-sdks",
).
It seems redundant with this other one:
withSDKWorkflows(
ci.AltRunnerWithCache,
"SDKs",
"python",
"typescript",
"go",
"java",
"elixir",
"rust",
"php",
"dotnet",
).
Except for the different runner
When we stopped running everything on dev engines, we agreed to have one small task that still runs against a dev engine to verify our CI module loading against dev and all that. So the specific check is somewhat arbitrary
Ah I see.
@lusty grove I'm trying to organize the "check matrix" in the monolith..
The older workflow (withSDKWorkflows) runs each SDK check in a parallel GHA job. Whereas yours runs all SDK checks together in a single job. I'm wondering if I could nuke the parallel one, and just apply the same pattern as you in both cases
At the moment a SDK check is 1) sdk lint 2) sdk test 3) test-publish
I would have assumed it was too big to group all 3 for all sdks in a single gha job. But maybe not?
(assumption based on the pre-existing parallel workflow)
if it all fits in a single GHA job, I could nuke some complexity in the module
will check prod traces to get a sense for the size of these jobs
๐ค
OK .... if I'm not mistaken, your combined "all in one" job is actually WAY FASTER... don't understand how that's possible
I guess it's because of the new runners
?
Based on these numbers it seems very reasonable to nuke the parallel jobs and just run all-in-one SDK checks in a single job, in both old and new runners