#maintainers
1 messages ยท Page 17 of 1
talk about perfect timing ๐
@tepid nova @spark cedar signal boosting re: +gitignore, curious what yall think - https://github.com/dagger/dagger/issues/11317#issuecomment-3458882319
๐ค there needs to be a way to do +**/.gitignore if we went that route
that was the "more to bikeshed" i mentioned ๐
but yeah, uh
sneaky thing here too, is each found .gitignore needs to be applied relative to where it was found. tricky, but doable right?
i guess if they're part of the same thing it implies that they're ordered
jinx lol
uh
on the trickiness level, it would be the result of merging FilterFS and gitignoreFS into one struct
dunno, is undoing gitignores later on in a pattern an actual "legitimate" use case? like, i can't understand why that's useful
because if your module depends on it, you then can't call it in git
(some modules already have this problem, which grrr)
potentially we should take the opportunity to do codegen for the supported pragmas
e.g. for creating the helpers in python/ts
ah right, is the challenge that our gitignore support can't be translated to the outer glob system? in my mind this was a simple preprocessing thing, like 'load globs equivalent to .gitignore contents and inline them into the list', but iirc that's something we tried before and it didn't pan out?
right right
tl;dr the main difference is that git patterns support a trailing / to only match directories. docker ignore patterns don't support anything like this, and you can't really emulate it see https://github.com/dagger/dagger/pull/10870
it feels like you should be able to right ๐
yeah ๐ญ
hot take: what if we ONLY did gitignore-style? more people know that than buildkit
My assumption from the very beginning has been that we will not get 100% compatibility between buildkit-style include/exclude, and git ignores. And should be using git's actual implementation to determine what is or isn't ignored
has my vote
but bleurgh, implementing git's pattern matching in go is not incredibly trivial (and i couldn't find anyone who'd done it - although maybe for this variant?)
if we decided that route, i would probably either:
- painstakingly copy git's wildmatch logic into go, as close a translation as possible
- bite the bullet and just use CGO
(one thing to note is that the git docs they reference are actually not what git does)
e.g. character classes are not mentioned! but they are supported ๐
sorry, welcome to a special and unique hell that i spent ages in
perhaps we can elect to not care about character classes 
There is a git plumbing CLI tool
Does anyone know why there is an explicit finding module configuration step hardcoded in the CLI when loading a module, before actually loading the module?
@still garnet I'm trying to wrap up CI refactor part 2, to prep for toolchains & checks... But later today, could I pick your brain live? I have a creeping fear that checks is using errors wrong... So far I haven't gotten a single useful error check message (because I use the error message, and that is also useless on its own - ie. the error might as well say "read the logs")
yea sure. good to chat whenever
Also wondering if we could converge "checks" and "status API"
--> For when you have a single function that is a check, but you want it to report several "sub-checks" in the telemetry
Guessing at something related - we have a whole Error type in our API that supports attching data to it (ref), but it's pretty half-baked. For example you can't return one from an SDK - which seems like maybe what you want here?
Honestly not 100% sure what I want... I think what would be very useful, is if you, me and @tidal spire went through the whole UX of toolchains+checks on a real repo (either dagger/dagger or greetings-API) and discuss papercuts as they come up (errors being one of the papercuts that will definitely come up, but there are certainly others)
sounds fun
Code is getting simpler, but somehow things are slower? ๐ญ
It gets even stranger when I zoom into one particular linting task
why is the sum 43s????
Ah I have to sync... Can't have a function be lazy and with nice looking custom spans
Corrected trace: https://dagger.cloud/dagger/traces/d1e85199a8b69d971e1dbaf33600166d#d4ef390eda958473
somehow getting slower and slower
it all traces back to module loading... being very slow.... incredibly slow....
seriously
Is it possible that PARC machines are slower today @wild zephyr @astral zealot ?
20 seconds to remove a mountpoint (just randomly sampled what was running)
still going
I canceled, here's the trace if anyone's morbidly curious https://dagger.cloud/dagger/traces/d1e85199a8b69d971e1dbaf33600166d?listen=6344791acefdf24d&listen=7c5930c08414c4d9&listen=4d894122ab70d394&listen=410102fa78228f13&listen=bf15c59ced22b234&listen=d4ef390eda958473&listen=ddb2286906809cd1&listen=4f60c847dc7cec85&listen=e5d35fd31f0f4b14&listen=4f195a1c013395a6#d4ef390eda958473
have you rebased on main lately? wonder if https://github.com/dagger/dagger/pull/11320 could help
yup I'm freshly rebased
haven't been using a lot PARC today myself to validate. I don't think we have any namespace metric which tells us if machines might be slow for whatever reason. From our side, we haven't deployed anything which could be generating this particular slowness
Reference run, on a fast non-cloud machine, warm cache: https://dagger.cloud/dagger/traces/9c5e5a5ba864829c94031be1bd599932#f6b1c532570b604b
Actually, looking at traces from our current CI... it's just as slow. So this is "normal"
Not that I know off but I'll take a closer look!
I could very easily be imagining things, take with lots of salt, but I feel like I've noticed CI being fast at night and slower in the mornings. Wonder if noisy neighbors are a factor one way or another
Oh wait... my CI branch is freshly rebased... But I'm running it on a 0.19.3 engine 
And now for some variety... Has anyone seen this crash in the typescript codegen?
2025/10/29 21:27:20 INFO generating SDK library language=typescript
panic: template: pattern matches no files: `src/api.ts.gtpl`
goroutine 1 [running]:
text/template.Must(...)
/usr/lib/go/src/text/template/helper.go:26
github.com/dagger/dagger/cmd/codegen/generator/typescript/templates.New({0xc000673c80, 0x7}, {{0x7ffcd05ddca4, 0xa}, {0x7ffcd05ddcb2, 0x19}, {0x0, 0x0}, {0x0, 0x0}, ...})
/app/cmd/codegen/generator/typescript/templates/templates.go:30 +0x55a
github.com/dagger/dagger/cmd/codegen/generator/typescript.generate({{0x7ffcd05ddca4, 0xa}, {0x7ffcd05ddcb2, 0x19}, {0x0, 0x0}, {0x0, 0x0}, 0xc0002e92f0, 0x0, ...}, ...)
/app/cmd/codegen/generator/typescript/generator.go:65 +0x18e
github.com/dagger/dagger/cmd/codegen/generator/typescript.(*TypeScriptGenerator).GenerateLibrary(0xc000275950?, {0x4bb2a4?, 0x0?}, 0xc000275968?, {0xc000673c80?, 0xc000010f30?})
/app/cmd/codegen/generator/typescript/generator.go:35 +0x8a
main.Generate({0x1265b98, 0xc0002e8480}, {{0x7ffcd05ddca4, 0xa}, {0x7ffcd05ddcb2, 0x19}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...)
/app/cmd/codegen/codegen.go:62 +0x1ca
main.GenerateLibrary(0xc000247000?, {0x101da6f?, 0x4?, 0x101d967?})
/app/cmd/codegen/generate_library.go:39 +0x319
github.com/spf13/cobra.(*Command).execute(0x19e7940, {0xc0000cb9c0, 0x4, 0x4})
/go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1015 +0xb02
github.com/spf13/cobra.(*Command).ExecuteC(0x19e84c0)
/go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1148 +0x465
github.com/spf13/cobra.(*Command).Execute(...)
/go/pkg/mod/github.com/spf13/cobra@v1.10.1/command.go:1071
main.main()
/app/cmd/codegen/main.go:33 +0x1a
haven't, but it looks that path ended up being ignored or something? 
Yes, going through my PR to figure out how/why.. probably a dumb typo
It looks like the codegen binary itself is wiping the source checkout?
ok nevermind. it was my dang module (which builds cmd/codegen) that had too conservative ignore filters. Which resulted in missing files from a go-embed (not caught at build) which in turn caused runtime error
your dang module or your dang module ๐
@still garnet maybe too late for you? But I'm ready to give it a try
(I think we lost Kyle)
can chat for a bit ๐
I am constantly failing on this gc test in CI Expected used bytes to decrease below 1073741824 from gc, got 4308118153 in https://github.com/dagger/dagger/pull/11071. (unable to repro locally)
Any idea on what could be the root cause ? Trace: https://dagger.cloud/dagger/traces/bc8d65b4838ca1c9c22fdbe302802932#f5b0b0622d98e9e8:EL143
FYI, notes from a great convo on Checks, Toolchains, and Native CI UX: https://github.com/dagger/dagger/pull/11211#issuecomment-3465410857
cc @tidal spire
Ok it was a flake. Ready for re-review @fair ermine
Dang supports enums now, if anyone ran into that yet! Dagger SDK, too: https://github.com/vito/dang/blob/main/mod/test-enum/main.dang
@still garnet @tidal spire hear me out: inline dang middleware instead of json arg overrides
It sounds interesting... What could it look like? Would it be unnecessarily confusing for someone unfamiliar with the API that just wants to set some argument values?
in this variation, instead of forbidding any wrapping of toolchains in code to avoid complexity: we would allow wrapping but only with special wrappers written in dang, that are so simple and lightweight that it barely feels like code. just enough to eg. inject values in certain arguments etc.
@civic yacht @still garnet @rocky plume @spark cedar Hey dagger internal experts ๐
Do this error sounds a bell to some of you?
Error: make request: input: packageDetection.containerEcho failed to loa
ailed to load runtime: select: failed to load runtime: failed to call sd
ime: select: load host.directory(path: "/work", include: ["./dagger.json
gitignore: true).directory(path: "src"): Directory!: load: load base: f
content hash: failed to get directory: failed to get snapshot: failed t
failed to receive stat message: rpc error: code = NotFound desc = get fu
: rpc error: code = NotFound desc = eval symlinks: lstat /work: no such
ctory
It happens on my TypeScript optimization refactor: https://github.com/dagger/dagger/pull/11309
I quite de not understand why exactly it's happening, it seems that when we are lazy loading the runtime, we are not able to get the modulesource context directory snapshot when multiple calls happen in parallel: https://dagger.cloud/Quartz/traces/79dd59c08a37e61eb9987cf4917a2afd?listen=d0b8fdcaf31b041f&listen=7cabdf90e2c2109e&listen=14877eea57aa17e7&listen=ffe1200d6fa02c8e&listen=0440542341b8a3f1&listen=0ca779a763fb52ca&listen=0f5713370ceace87&listen=52a9060445ed383b&listen=c150f77cf0751ead&listen=ff45543849790c7b
I'm quite confused on that error and I'm not sure how I can debug that, do you guys have ideas?
My guess is that the call to host.Directory in that lazy context isn't quite right but it's very weird, like some traces are missing when loading the runtime (probably because they are cached?)
i've plundered my old live-dev pr: https://github.com/dagger/dagger/pull/11332
this introduces a new type Watcher, and plumbs it into just a manual call. there is currently no service integration.
however, all the host<->engine refreshing is implemented - it shouldn't be too hard to implement a WithMountedWatcher or something for use in Services
this should be possible for someone else to pick up when i'm gone!
@civic yacht @leaden glade @obsidian rover
Hey guys!
My TS optimization PR is pretty much ready, I'm gonna publish some benchmark tomorrow morning ๐
I'll follow up later with optimization on the client generation so I can fully cleanup the TS SDK module and remove a lot of files ๐
This PR contains a lot of improvements for the TypeScript SDK module support (ModuleRuntime / Codegen)
I may then split that into multiple PRs (or not), but this is a global PR so I can easily shar...
Was wondering why the finding module configuration step was so slow doing filesyncs and added some more spans. Turns out the filesync itself is essentially instantaneous and all the time is spent blocked on buildkit worker cache metadata operations that involve mutexes and writing to boltdb..... the create copy ref, finalize copy ref and release copy ref steps are the bottlenecks by far....
I think that subsystem needs to be the next target for theseus, it just keeps coming up every time I look into pretty much any perf issues cc @charred lotus
you're welcome, sorta ๐
is there an issue for 1password prompting multiple times for multiple secrets? swore someone mentioned it recently. (will create if not, didn't find one)
Hm yes, tugged on this thread a little bit more (disabled sync in boltdb in every db across buildkit+containerd) and just ran the full TestModule suite on my laptop in 9m43s... https://dagger.cloud/dagger/traces/7bc02c44646b3db013cf5835d3ed9bef?listen=b6ee1b64784edaf5&listen=8748a72037aa0dea#b6ee1b64784edaf5
Before this that would take well over 30m locally 
This definitely seems worth pursuing ๐
What the..
Dagger Cloud
...and now scalars! Here's the trace that implemented it - it was a one-shot prompt, and it compacted twice, but it got there in the end. (I was playing Necesse and forgot it was running ๐ )
Has anyone else seen connectivity issues to docker hub today, causing CI checks to fail?
--> https://dagger.cloud/dagger/traces/46a616b0ce721319d244d663d7a55579#7d3534187e318842:EL41902
Yes
There have been a lot of jobs getting stuck at loadPackages in the go sdk today too, which also touches the network, which made me wonder if there's something going wrong there (also highly possible that's something else entirely though)
Mmm, and here it's not docker hub but our own registry... https://dagger.cloud/dagger/traces/30343f4a28785ea9b483d1c0004bf665#f3defb7b886ace2f:EL1
I wonder if it's related to the pull-through cache injected by Namespace?
Nope, I reproduced on my local machine (with and without PARC)
That's the only common denominator I can think of between 1) a pull by dagger on docker hub failing with a DNS resolve error, and 2) a pull by the docker CLI on our own registry, failing with a 500 http code
One of the issues seems related specifically to the php unit tests, but only when running dagger-in-dagger??
I ruled out PARC machines.. It happens on my local machine also.
flaky php unit tests
wow, really great work and ready to merge also...it's green
I am trying to ensure that the SDK runtime always get re-built whenever i call a function (never cache in this debug mode) as part of a debug hidden field inside the dagger.json.
I can't seem to find where to burst this very specific cache (inside modulesource )... Does anyone have hints ๐
Atm, when i call a function and recall it, the runtime is always cached, running out of ideas
This comment here makes me believe that there's a caching layer involved that I don't understand ๐คฃ
I am trying to ensure that the SDK
Any objections to disabling local checkout in our CI? Ie. go full dagger -m $GITHUB_THIS/$GITHUB_THAT
Stupid question: I can see a "runtime" implementation in core/schema/module.go but not a "moduleRuntime" implementation, and yet this piece of code (https://github.com/dagger/dagger/blob/712b3f6d7218f347f73ab25cbd2fbe12f787e0b1/core/sdk/module_runtime.go#L37) is calling "moduleRuntime". Where is the implementation ?
CI REFACTOR PART 2 IS GREEN โ โ โ โ
Would appreciate an optimistic-minded review so I can merge it today before it goes stale again... ๐
Stupid question: I can see a "runtime"
Runtime is calling simply calling ModuleRuntime.
The runtime struct implements the runtime interface defined in SDK.
We also have a code generation and client generator implementation implemented in other files ๐
The moduleRuntime implementation (the one called there) is the one implemented by each SDK ๐
Im missing the step where the string "moduleRuntime" is matched in core/schema
it's a function in an sdk module, e.g. https://github.com/sipsma/dagger/blob/712b3f6d7218f347f73ab25cbd2fbe12f787e0b1/sdk/python/runtime/main.go#L174-L174
so it's not in core at all
seems like dagger develop and dagger functions don't check anymore that the code in the module actually compiles. Is this pary of the optimization we recently shipped with the typedef SDK split @leaden glade ?
mostly checking if we should do something else here given that I was a bit surprised to see that after upgrading a module to a newer version of Dagger which had a breaking change (removal of container.Build), I ran dagger functions and everything seemed to be working fine. It wasn't until I ran dagger call which I was presented with the actual compilation error
Yes, it's due to the moduleTypes SDK split and more than that to the preparation of self calls.
With self calls, it means the module code can contains calls to the module itself through the generated code. But this generated code depends on the module code.
We need to know the types and functions exposed by the module to generate the code that the module will use.
This means the body of a function using self calls will never been able to build before the engine receives the types and the SDK use them to generate the corresponding code.
With that, it means the moduleTypes (at least the implementations we have) only care about the definition/signature of the exposed types and functions. Usually it still requires the code to be syntactically valid.
One of the positive aspect is the performances as it doesn't requires anymore to build and even doesn't require the dependencies to be fetch at the point.
Regarding dagger functions I think that makes sense in the way what we want is the list of functions exposed.
But maybe dagger develop or dagger update should go one step further and ensure the module can build. I don't know, but we can discuss/try that.
I do think we need a built in way to make sure a module (and its dependencies) compile. It necessarily need to be a side effect of some other task like many of us were using dagger functions for
Whatever we have done at the engine/sdk level, we can imagine to go back to the same behavior than before at the level of the CLI. Right now it changed because to build is not necessary for the task, but we can change the CLI so that the UX is the same as previously.
Maybe only for dagger functions, that way we keep dagger develop less strict as it can help while working with self calls (and working with code that doesn't build yet)
I can propose something in that way and open a PR if you think that's better we go back to this behaviour for functions
I like dagger functions being faster, so I'd be in favor of some new command
Maybe something like a dagger verify? (or a name more scoped to module) A command that will ensure dependencies can be fetched, that will build the module to ensure it works, etc. A bit like a call but a call of nothing. No files will be exported on the host (it's the goal of develop) and no function list will be printed (it's the goal of functions)
One main difference with functions is functions can be used by users of the module where develop/verify (or any other name) is really for the developer of the module.
We discussed this briefly with @fair ermine when he was preparing to merge lazy runtime loading. But I didn't realize the behavior was already present with the SDK interface split.
I worry about adding a random command that you just have to memorize. We already have dagger developthat is already like that... We could add it to develop but then it becomes even more of an "everything command"
For the go sdk, it might be possible to catch compilation errors during the typedefs step because the package we use to parse out the schema is also capable of reporting compilation errors (it's the same package go linters use). Not 100% sure, but may be worth a check if not done already
wouldn't dagger call --help basically work the same way dagger functions used to?
it should be doable yes, but it means to have an extra step, again due to self calls. If the body of the function contains self calls, it needs to first have the generated code, so it needs first to have the types
Maybe that could be a flag on the develop command? So not a new command. And it's really like a command to run in development phase, so a develop --verify that ensure the module is valid could be nice
I don't know, even as a flag I'm still worried it makes develop even harder to unbundle
๐ ๐ ๐ ๐ ๐ ๐ ๐ guys I need to merge this today https://github.com/dagger/dagger/pull/11262
Merged
Sorry in advance if anything breaks
HEADS UP dagger call dev is now dagger call playground
@still garnet re: pretty-printing checks logs. Should I implement my own custom idtui.Frontend, and hardcode using that in the checks subcommand? Or should I modify / hook into an existing Frontend?
the latter, preferably
Or maybe I don't need a full-blown Frontend, just need to implement a small part of what it does, and call that explicitly?
yeah, i'd start by seeing if you can just extend frontend_pretty.go
But to do that, I need a way for frontendPretty to do something different when called from dagger checks vs from other CLI commands, I'm not seeing an obvious way to do that in the Frontend interface.
My understanding is, the FinalRender does the special "show the logs at the end" logic, but it does it unconditionally
quick suggestion: add a DB.CollectChecks, very similar to CollectErrors, and use it in a very similar way, with whatever UI tweaks are appropriate
can hop on in a sec too
Ah I see, so keep calling the exact same UI code, but have it behave differently if checks data are available (indirectly detecting that dagger checks has been called)
yeah exactly
But how do I detect that a check has been called?
Look for the actual function call CheckGroup.run() or whatever?
default answer would be span attributes
you can look for Call information too yeah (it's available on each span in the DB), but it might be easier if there's something on the spans themselves. where are they created at the moment?
i'd just add dagger.io/ui.check=true or something
OK looks like I need to do that regardless. Looks relevant to our "checks + status API = โค๏ธ" thread ๐
On the frontend though: it still feels very indirect to modify the standard frontend code, instead of just calling a special render function from the checks command
yeah. related, I was wondering if one option could be returning Status (object) instead of a CheckStatus 
especially since I don't want to change the in-call TUI. Only what's printed after
you could add a Frontend method for it, sure
I may not even need Frontend at all, since really I only need to query the events DB, no TUI-related code at all.
I guess I need that DB
well, you just need to make sure you're not printing straight to stdout/stderr
maybe just printing to cmd.Out() is enough?
that'll write to the "primary span" and it's the output that'll be printed on exit
Why not? might break something?
It's the actual "useful" output of the command, eg. I would expect to be able to | grep etc
yeah, while the frontend is running if something else prints to os.Stdout/os.Stderr it'll just be garbled
try this, then
Ah I see, no matter how I integrate, I can't escape the Frontend, so need to play nice with it
yeah, it's downstream of having a TUI at all, really
@still garnet does each implementation of Frontend handle its own DB? Looks like I can't just get the DB from only the interface
Also what does DB.RowsView() do exactly?
Also another issue - I dont want dagger --format=.. to affect the output of dagger checks
yeah. i wouldn't be against just adding a getter for the DB if that helps a lot
Maybe I should bypass the dagui stuff alttogether, and register my own otel collector?
sorry the names are really terrible - it just constructs an intermediary phase where you have all the data to display from whatever the zoomed scope is. the tree, basically, with convenience for looking up sub-trees by span ID
i don't think so, this is exactly the sort of thing dagui.DB is for
I am NOT going to cast the first stone on method naming ๐
I guess if I register my own collector, I lose the local queryable span DB?
That's really the only part I need - a way to query spans for the current session. I don't need any of the actual UI stuff
(since I'm just going to print a bunch of logs anyway)
yeah, that's essentially what the current dagui.DB is for
a higher level representation of what we saw from otel
OK, so I guess I just need a way to access that DB without tight coupling to how it's live-rendered to the screen
re: this ๐
claude is suggesting the getter approach it seems ๐
the only real coupling between the DB and live-rendering is the DB assumes locking is handled externally (for perf reasons)
so MAYBE there should be a Frontend.WithDB(func(db *dagui.DB) { ... }) if that becomes an issue
side note - dagui.DB is also the code that's shared between the web UI and the TUI
yeah, could be somethig to keep in mind when deciding boundaries/interfaces
@still garnet does this look like a good starter prompt?
Subject: Prototype Request - Custom Log Display for dagger checks
Could you prototype adding custom log display to the dagger checks command? Here's the context and approach:
Problem: We want dagger checks to show the logs from each check execution after completion, in addition to the current table output. This should happen after the regular TUI finishes, so it doesn't interfere with --progress flags.
Current Architecture: The frontend (like prettyFrontend) acts as an OpenTelemetry exporter and accumulates all telemetry data in an internal fe.db field during execution. This database contains all spans, logs, and execution traces, but it's currently trapped inside the frontend implementations.
Proposed Solution:
-
Extend the
Frontendinterface indagql/idtui/frontend.go:type Frontend interface { // ... existing methods ... GetDB() *dagui.DB // Add this } -
Implement in
frontendPretty(dagql/idtui/frontend_pretty.go):func (fe *frontendPretty) GetDB() *dagui.DB { return fe.db } -
Create a standalone
LogRendererutility that can query the database and render logs for specific spans (like checks). -
Modify
runChecks()incmd/dagger/checks.go: After the normal table output, callFrontend.GetDB()and use theLogRendererto display logs from check spans.
Flow:
dagger checks โ withEngine() โ Frontend.Run() โ [normal TUI] โ
Frontend.GetDB() โ Custom log rendering
This gives us clean separation, doesn't break existing behavior, and provides full access to the telemetry data for custom formatting.
The key files to look at:
cmd/dagger/checks.go- current implementationdagql/idtui/frontend.go- interface definitiondagql/idtui/frontend_pretty.go- main frontend with the databasedagql/dagui/db.go- database structure and query methods
(I'll have to implement GetDB() in the other frontends also)
sgtm, i THINK you might need something like this to handle locking though, since this will be running while the frontend is still technically running (assuming the write-to-cmd.Out()strategy)
What does handling locking imply in this context? (that doesn't even make sense) Or something smarter than that?GetDB() itself should be mutex-protected?
the DB doesn't do any locking on its own to handle concurrent reads/writes, so there could be a race condition on accessing internal maps etc
so you might want something that takes a callback, locks, calls fn with db, unlocks
Ah I see. not give unlimited access to the DB
yeah, or at least make it harder
If it helps, I only need access to the db for post-execution render. Maybe that helps? I can pass a callback, but the implementation is simpler because that callback doesn't need to be called concurrently with live-rendering
Frontend.withPostRun(hook func(*DB) ?
Or while we're at it:
Frontend.withPostRun(hook func(*DB, io.Writer) ?
hmmmm there's still a chance, i think, with how otel is wired up, but feel free to punt until -race reveals something
oh yeah that'd be clean too
I'll still need to implement it in all frontends, but I'm guessing pretty is by far the most complex?
(going through my list of questions so I can give this a shot autonomously before you logout ๐
- I think I can handle setting attributes somewhere in checks (famous last words)
- Not sure how to search for attributes in my post-run hook
I see a dagui.FindResource() that seems to be relevant?
Is it crazy that I want to use an ID from an object I built, and use that ID to get all collected spans emitted by that ID? ๐
checks := mod.Checks()
checkSpans := db.SpansForID(checks.ID())
type SpanSnapshot struct is where all attributes are stored, either conveniently preprocessed into fields, or in ExtraAttributes if it's not known to the web UI
so you should be able to write something like CollectErrors but have it look for spans having a certain attribute, rather than spans that errored
if you're wondering why SpanSnapshot exists - that's the subset of data that we're able to send over-the-wire from the web UI backend->frontend. in the web UI there's actually a dagui.DB on the backend loading everything from Clickhouse, and then sending only snapshots to a dagui.DB in the frontend, to keep required data transfer low
Reading through the general telemetry flow end-to-end, to get general awareness... wow the complexity of the otel stack.
yeah, otel is pretty dense
could be stockholm, but i can't say it's entirely redundant either
there's just a lot of things to consider ๐ฌ (like batching rates)
On a branch that adds a Debug field to the SdkConfig type, I am having a dagger call go lint linter issue that I'm not sure how to properly fix.
It seems that this linter takes the engine version in the dagger.json (at python/sdk/runtime) instead of the local version to generate the clients. So, when it regenerates the client bindings, it doesn't find my Debug() client method (it's normal, it's part of my PR)
It feels like a regression. I remember being able to extend the schema and not having such edge case (but maybe it's a new check that we added with this generic go lint function) 
I'm trying to understand the basic "connector" between Frontend and the telemetry stream.
withEnginewraps everything insideFrontend.Run()- then within that it calls
Frontend.SetClient()
--> ๐ค
Frontend.withPostRun(hook func(*DB) ?
Getting weird filesync error on very vanilla contextual dir upload
#team message, you can add this to workaround for now:
diff --git a/cmd/dagger/.dagger/main.go b/cmd/dagger/.dagger/main.go
index 9f68298c2..617f340e5 100644
--- a/cmd/dagger/.dagger/main.go
+++ b/cmd/dagger/.dagger/main.go
@@ -6,6 +6,7 @@ import (
"github.com/dagger/dagger/cmd/dagger/.dagger/internal/dagger"
)
+// +cache="session"
func New(
ctx context.Context,
Working on the fix here, hopefully can get merged + release tomorrow
(^ was meant to be reply)
Should we have an AGENTS.md / CLAUDE.md?
inside our codebase ? Yes, totally, ones that actually hint where to look per feature / what to know
@obsidian rover I see the word "deprecated" a lot in this diff... Does this seem related to your recent PR sonehow? (not sure how)
https://github.com/dagger/dagger/pull/11232#issuecomment-3488477878
(look at the diff)
I am surprised by the TUI's time spent logic
the TestModule's time seems to rotate and is > to its parent's time ๐
๐ something don't look right in the most recent release notes there @civic yacht
It's been a slog, but I got directory.File caching working under https://github.com/dagger/dagger/pull/11329
The PR itself is still really rough, and is littered with TODOs and printfs, but the (majority of) tests are passing -- and in one case TestTelemetry was updated for a case where a File op is now cached.
I still need to
- clean it up, and deal with linting issues,
- maybe even refactor it to use a file-equivalent version of
maintainContentHashing, - double check I haven't slowed things down (since there's more content hashing that's occuring now) -- so far it looks like some of the tests have been taking longer; however, I've seen a lot of network-related failures this afternoon.
- remove some hacks related to forcing host directories to copy
- figure out how to return container mounts as deps which then get passed to
core.NewDirectoryDagOp
Either way, I wanted to share a brief update before I take off.
Question, how come how much of the frontend code is under dagql/? Superficially it seems like 2 unrelated layers?
purely historical reasons, it should be moved but that's been so low on the totem pole
i wish we had a pkg/ repo layout or something, part of the hesitation is just not wanting to put it in another toplevel dir
I also was looking for pkg/ but isn't util/ basically that?
Or does util/ imply slightly more coupling to our repo?
Like "technically you could import this, but probably shouldn't"
we're in true bikeshedding territory now, but for me util/ implies a grab-bag of packages that could live in a separate repo but we can't be bothered to maintain each of them, vs. pkg/ which to me is like "ALL of the core code lives here"
the role of pkg mainly being to distinguish from other not-even-Go-code directories like docs
Oh. Well at least in the docker repo (dataset of one) pkg/ basically meant the same as your util/
i'd say it's a superset
imo pkg/ also implies "don't import from here, well you can if you want, but no guarantees, if I wanted these things to be for external use you'd be importing from a separate smaller repo"
so it's like a util/ that you just also put all your main code into
that's just what i've landed on for monorepos / application repos, plus cmd/ at the root level too, just to distinguish package main from importable packages
does trivy not have a 'human readable output' mode? 39k lines of JSON seems a bit much
is that from the flag that says show us everything, not just the important stuff?
i guess we are saying --format=json --show-suppressed. y tho
No idea. I recently noticed that during the refactor, but assumed there was a reason so left those flags alone
I would just remove it
What's the worst that can happen
(as in the flag, not the vuln scanner)
Update: cleaned up the checks PR, squashed
There are remaining failed CI checks, where I could use some help
Also the weird otel attribute errors are still there, but that's not blocking for merge (adding to TODO, forgot earlier)
@tepid nova possible rationalization for pragmas: you might want something to be both a check and an artifact - with pragmas you can apply both, with return types you'd have to choose one
Yeah that was one of @tidal spire's arguments, I agree. We actually have that already, the Rust SDK has a "check that it builds" check
(somehow that was in Lint() but let's ignore that ๐ )
btw @still garnet I'm doing monolith cleanup, an interesting pattern is emerging for generated files, will show you when the PR is up.
TLDR the module changes its Workspace field in place, and can give you changes on demand. I have this pattern in 2 places now:
- PHP SDK (weird requirement to chain client generation then docs generation from the generated client)
- Go toolchain (generate dagger runtimes, then lint from that)
Kind of a "virtual context"
Also: need to carry a bunch of paths relative to the workspace. Can't just carry Source *dagger.Directory around - instead SourcePath string which is eg. sdk/php
@civic yacht [less of a nerd snipe and more fishing for a quick 'yep', can look further myself later] - is this making the mistake of calling a method (Directory.Without) that normally presumes it's called from within a DagOp? https://github.com/dagger/dagger/blob/0c0028796eb5252e373196dfa4964f07295ab546/core/directory.go#L1338
I added rm/mv tools to Doug but after it calls them any further operations on workspace fail with this error:
select: failed to compute cache key for Query.doug: load contextual arg "source": load contextual directory "/": select: failed to load contextual directory: failed to select env directory: select: failed to remove paths: unlinkat /var/lib/dagger/worker/cachemounts
curious to see!
As the complexity unravels, it's becoming more and more tempting to port big chunks of it to dang ๐
Does anyone have ideas on how to systematic approach isolating the root cause of our ci_in_ci flake (which is quite consistent) https://dagger.cloud/dagger/traces/235542b1ab326872993b6595a2637bc5 ?
I am inside the container of one of the python failure, and I am getting whenever i try to introspect the schema or talk to the engine:
dagger.ClientConnectionError: Failed to establish client connection to the Dagger session: Failed to build schema from introspection query: get or init client: client "m18nhcf20secz91bs24z3u6xr" already exists with different secret token`
Getting out of ideas on how to isolate the bug ๐ข
I'm taking today to do flake squashing cause quite a few have arisen, I'm on the TestContainer/Test.*Containerd ones right now but was gonna get to that one next
I'm not 100% sure what's going wrong, but I highly suspect it's some confusion around dagger-in-dagger-in-dagger
Yeah it's above my level atm I think
You repro'd locally and get that in a debug terminal?
That _contextDirectory span looks new?
Had to add as part of this fix: https://github.com/dagger/dagger/pull/11350, it's not in the public API hence the _ prefix, but still does show up in telemetry like that unfortunately
All good, it looks fine - just was wondering if it was something I did ๐
Generally it's sometimes hard to mentally map a filesync to the line of code causing it
But maybe it's a me problem
Are self-calls available as an opt-in now? And are they still slower, or did all our recent speed improvements actually make them affordable?
No the spans show up in confusing places. Actually now that you mention it, I think it's because we have to load context dirs to compute the cache key, and the cache key computation happens outside of the actual call
Not sure if it's engine or infra/parc...
https://dagger.cloud/dagger/traces/662f0f4f1eba43ddb274d2685a88e833
Its happening again
trying without pARC
granted I've been messing with ignore filters in this branch... But this doesn't look like a misconfigured filter
"written bytes: 0" seems
(also really cool that we have a metric for that, if it is indeed a sign)
ooo I used my new power to connect trace->namespace instance and tracked that down, engine logs show:
2025-11-06 23:15:58.640
daggerpanic: runtime error: invalid memory address or nil pointer dereference
2025-11-06 23:15:58.640
dagger[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x19454e2]
2025-11-06 23:15:58.640
dagger
2025-11-06 23:15:58.640
daggergoroutine 17436986 [running]:
2025-11-06 23:15:58.640
daggergithub.com/dagger/dagger/engine/filesync.(*localFS).Sync.func5.1()
2025-11-06 23:15:58.640
dagger /app/engine/filesync/localfs.go:331 +0x762
2025-11-06 23:15:58.640
daggergolang.org/x/sync/errgroup.(*Group).Go.func1()
2025-11-06 23:15:58.640
dagger /go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x50
2025-11-06 23:15:58.641
daggercreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 17434501
2025-11-06 23:15:58.641
dagger /go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x95
panic from the new telemetry added for filesync stuff recently
cc @wild zephyr
I think it was supposed to be werr instead of err
Woot, yep
So if the filesync fails for whatever reason, the engine panics. Noice
interestingly I just saw in nvim that a linter named nilderef is complaining about that line, might be worth enabling that in our linter config if possible
I can run an expedited release first time my morning tomorrow
๐
@tidal spire hey you mentioned a toolchain can't have toolchains. Would that be a big change to make? ๐
Actually it's fine, I found another way ๐
@tepid nova here's a separate PR for pragmas, in case you want to just pull it into your PR: https://github.com/shykes/dagger/pull/448 - done for Go, Python, and TypeScript - can look into more tomorrow (wanna see how far automating this can go)
tested with dagger check and it seemed to find and be able to them all ๐ - (plus additional tests for TS/Python)
Thank you! And cool that you're building the automation at the same time ๐
should be doable if we prioritize it
nah it can wait ๐
@tidal spire I have a toolchain working in our monolith ๐
Since 0260062 some tests are constantly failing. I thought it was on my PR but in fact it's on main.
I'm not seeing an obvious reason it should fail based on the diff of this commit. So I guess the error is somewhere else.
If anyone has an idea
dagger call test specific --pkg="./core/integration" --run="TestContainer/TestLoadHostContainerd"
@tidal spire FYI I am seeing my toolchains pop up as dependencies in the parent module's code. But honestly I'm glad it's there, ๐
Haha yeah it's not tested yet... We can figure out how it should work
@tidal spire 2 toolchains ๐ Go SDK & PHP SDK
one of them in dang
My only papercut so far with toolchains: I want the correct module description in dagger functions and dagger toolchain list ๐
https://tenor.com/view/on-the-top-coaster-force-dollywood-wild-eagle-ride-gif-19706690
Getting reaaallllly close to unraveling the whole monolith
v0.19.6 engine container has 2 HIGH findings (trivy)
https://github.com/dagger/dagger/pull/11376 fix, on the bright side I'm fairly sure this was a timing flake in the test that had always been possible, but only became an issue recently because of the perf improvements, which got rid of the multi-second hangs between operations ๐คทโโ๏ธ ๐
Seeing if this fixes the flake in TestContainer in CI. I can't confirm locally because there I get seemingly unrelated errors related to cgroup setup ๐ตโ๐ซ, possibly some kernel-related and/o...
https://github.com/dagger/dagger/pull/11377. It was ok this morning and seems to have popped up right after the release ๐
It's supposed to be there... My dang one wasn't working but I thought it was a dang thing. Will check on that
@wild zephyr thanks for dealing with the releasing issues...
np, it was a team effort. Everyone was there to help ๐
Are there follow-up issues that still need investigation - of the kind a mere mortal can do? ๐
On the flakes etc
I was going to look into the weird changelog generation issue
that's fixed already. Fixed it as part of the releases improvements
not for the moment. Once Erik figures out the dagop thing, we should check how we are in terms of flakiness and go from there
but for the moment nothing that I can think of which has a high priority
Need help getting Checks merged... ๐ https://github.com/dagger/dagger/pull/11211
I think all the remaining failed checks are either 1) flakes or 2) already in main.
Could someone confirm, and maybe give me a quick review? The code is experimental, so the bar is lower for merging. We just need to make sure it doesn't break stable codepaths.
I keep getting this error on CI (like a looot) on https://github.com/dagger/dagger/pull/11366:
Error response from daemon: No such image: registry.dagger.io/engine:v0.19.6
connect
starting engine
create container
exec docker pull registry.dagger.io/engine:v0.19.6
failed to run command [docker pull registry.dagger.io/engine:v0.19.6]: exit status 1
Error response from daemon: Head "https://registry.dagger.io/v2/engine/manifests/v0.19.6": Get "https://registry.dagger.io/token?scope=repository%3Aengine%3Apull&service=ghcr.io": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
@wild zephyr Could it be a cached instance on namespace that isn't able to query registry.dagger.io/engine:v0.19.6 ? Seemed to work fine locally
when i retry, does it spin up a new instance with a clean cache ?
2 possible explanations:
- An issue with Namespace's pull-through registry cache
- An issue with our own pull-through registry cache
- seems fine (now)
Wouldn't 1 imply that namespace is mitm'ing our registry? Or do they apply some special dagger engine config that makes their pull-through registry cache a mirror for every registry?
Two things:
- It's possible that I'm misremembering and they they don't provide a pull-through cache. I might be confusing with our own. @wild zephyr / @astral zealot will know
- Namespace definitely does inject custom docker credentials in our CI runners at the machine level. So if they do inject a mitm proxy, that's how they would make it work. Or maybe they only inject the credentials and not a pull-through cache - still useful as it avoids their IPs getting rate-limited
@obsidian rover @wild zephyr is this a know issue? https://dagger.cloud/dagger/traces/bb4ef524ee498b77071d3c7c57b68f83#caf3a34df721b9ff:EL2932
(getting it on checks PR)
yes, just figured it out and fix here: https://github.com/dagger/dagger/pull/11380
@civic yacht strange. I recall removing the sidecar engine altogether and tests were still failing for me
I can check really quick if you want
Yeah something about the tests does not work w/ nested execs that are all running in parallel. It works exactly as expected when they actually hit the dev engine they are supposed to hit though
I think part of the issue is that some of the tests purposely close clients and then run other tests in the same container, which is breaking nested execs I think
The Version thing is a little weird but I have an idea on why that might have failed the way it did, checking that out too
Either way, that's all stuff which is in deeply obscure territory, and the tests work when they are running as intended, so nothing super broken thankfully
I see.. it's still strange that they passed using a nested session when I reverted the dagop commit? you think that was a random thing??
Well they would pass sometimes w/out that commit, but not always (hence the frequent flakes we saw before). That commit made them just always fail, but only after the release because of the bug where they weren't hitting the dev engine (they were hitting the "outer" stable engine).
The reason that made them always fail is what I was referring to in
The Version thing is a little weird but I have an idea on why that might have failed the way it did, checking that out too
I have a theory I'm testing, but if it's real it's an extremely obscure problem that would only happen with a very tangled web of nested execs and function calls, of which our CI is the only real use case for
yes exactly. Mostly curious why that dagop commit produced that strange Version message. In any case seems like you have it against the ropes. Thx for fixing the flake ๐ โ๏ธ
merged the fix, can pick up in rebase
dogfooding in hard mode
Saw your comment on how you debugged it ๐ I guess the difference is that I was trying to fix the root cause instead of trying to make the dagger-in-dagger-in-dagger not happen to unlock the pipeline
which i gave up ahahahah ๐คฃ
I guess what I do is:
- Repro
- Poke around to see if there's something plainly obvious right away
- If still lost, just go to the beginning of where everything starts (so in this case, here), trace through exactly what is supposed to happen and compare that against what actually happened, each step-by-step
- In this case, that just meant comparing the code and the trace. But to figure out what actually happened you may need println debugs, etc.
That's pretty much a foolproof algorithm I suppose, it's just sometimes finding that exact point where "expected behavior" deviates from "actual behavior" may vary in effort required. e.g. if the deviation comes from the OS or compiler or something like that it may take a while ๐
a random papercut, dunno who else is bothered by it: https://github.com/dagger/dagger/issues/11382
To clarify, this meant: "yes I have also been cut this way, many times" ๐
Current status: spinning out a 4th SDK toolchain from the monolith (Typescript) cc @fair ermine @tidal spire
Done ๐
almost done carving out toolchains/engine-dev. that's the largest remaining piece of the monolith
could use a โ on https://github.com/dagger/dagger/pull/11370 - now has eval coverage
@still garnet I just noticed custom spans from util/parallel are no longer visible in TUI or cloud... super weird. No leads from git blame ๐คทโโ๏ธ
(in 0.19.6)
will bisect
weird. got a trace link?
could check &debug
@fair ermine when you have time, suspicious looking error in typescript provisioning tests on main: https://github.com/dagger/dagger/actions/runs/19277594086/job/55121934382
error @trivago/prettier-plugin-sort-imports@6.0.0: The engine "node" is incompatible with this module. Expected version ">= 20". Got "18.20.8"
@tidal spire toolchains spun out so far:
- go sdk
- python sdk
- php sdk
- typescript sdk
- cli dev
- engine dev
- helm dev
- ci
- almost go (missing default arg)
Monolith is so small it can be ported to dang ๐
This is it! The last layer of the onion. Let's break up our CI monolith into cleanly decoupled toolchains.
Goals:
Faster CI
Simpler CI
A golden example of Dagger best practices
TODO
CI: ...
wow!! how do you feel about it so far? have you been able to try out the DX or mostly just working on splitting things out still?
I have a blocker on toolchain config only allowing string values. Otherwise it feels great.
I am embracing project-specific toolchains btw. IMO it's an improvement over littering random subdirs with .dagger.
ambiguity of contextual dirs will be an issue. but solvable
so far monolith did not get faster to load... if anything it's slightly slower. But we haven't harvested all the potential gains yet, in particular we're still paying the price for most SDKs not supporting lazy loading yet. Also the field is wide open to port most of these modules to dang (I ported 2 so far).
I think in another week this thing will fly
Interesting I didn't know configs only worked for strings. I thought a test used a file but I must be thinking about something else
The dang parts look sweet. Were those ported via claude?
asking because I saw the go-sdk one had
pub sourcePath: String! = "sdk/go"
and then
pub name: String! {
"go"
}
yeah go-sdk I used claude. typescript-sdk I did by hand (refactored straight to dang instead of 2-step refactor->dang)
nice thats cool that claude is able to do that fairly easily. Feel like its easier for claude to write dang than dagger go sdk
It wasn't perfect on the first try but it was pretty good
I had it write a manual
I didn't do any correction, so whatever's in there is what Claude understood after reading the dang repo & 1 example https://github.com/shykes/dagger/blob/0629780352ec41cf7833d97fe61fda3e08f62c10/docs/dang/go-to-dang.md
is this pretty-printing for core APIs worth it?
(it's all client-side based on call structure)
Above is 100% tightly coupled, just frontend detecting certain fields and rendering them specially. Goes back to normal when you bump the verbosity. Maybe it could be applied to everything with a heuristic: <funcName> <reqArg1> <reqArg2> (opt1: ..., opt2: ...)
TODO: pretty-print addresses, which is a separate kind of beast that wouldn't fit that heuristic
you mean the part with flattened logs & bracketed prefixes? Or the bottom part with regular trace view?
the bottom part, specifically the withExec ... part, which would normally say .withExec(args: ["..."]). but welcoming feedback on anything
just looking for angles to reduce noise, noticed for bracketed logs you opted to semantically inspect the call, so just seeing what else we can buy with that approach
side note: the check path glob syntax is growing on me. only issue i think is you have to quote it, but you have that for regexes too anyway. (example real use: dagger-dev checks 'test/{lang,tree}*')
@still garnet btw I started adding +check pragmas in that PR, if you want to try it out
I like the lightweight span. Yes I agree it helps.
But really I prefer to see only the logs under the checks by default. If there's a key to toggle from this "log view" to a "trace view" which would show the trace and not the logs, and I could easily and rapidly switch back and forth, I think that would be a superior UX IMO
Same as always, my brain struggles with logs & traces woven together in the same view. I think it's because it breaks my understanding of the "time arrow". They're like 2 different visual languages, using the same screen layout to express different things, and the result is garbled (for me at least)
that's what it does, i just toggled it open since I was asking about the stuff inside it
I know I'm a broken record and it's easier said than done
I will have to play with it, rather than imagine.
Visually, both parts look good and feel information-dense ๐
(to be clear I was saying: when a check is expanded, it should either show only its flattened logs, or only its span tree; but not both one after the other like in that screenshot)
ah got it, will look into it!
I think the gains of doing this would grow linearly with the number of checks and their complexity
Trying this out: https://asciinema.org/a/OJc3Nrk1KywmOXPDYhC1hTSmY
@Vasek - Tom C. when you have time,
dang / php caused me some pains when i cloned on Windows, the famous CRLF / LF strikes me again. ran into issues running the dagger functions from the main dagger repo because of this, caught me off guard. not sure how best to resolve that really, but .gitattributes on certain files might help, somehow, windows clones -> files loaded into dagger aren't dos2unix'ified and then blurhghghghg happens. This happened when i was playing around with the dagger sdk builds/tests when making a PR for the dotnet update on the experimental dagger dotnet module.
Any ideas on how is a good way to solve this in a smart way? .gitattri on the repo, or CRLF conversion at a dagger level when copying files or... hmm
I am tempted to clone dagger into WSL and see if i have the same problems but i have a feeling i wouldnt reproduce the same issues I had.
how was dang involved? are you trying it out or something?
also curious if you have any traces handy so i can see what happened. can totally imagine Dang doesn't parse CRLF properly atm
@tepid nova quick q: should the top-level 'rolled up logs' spans keep showing their logs if they fail, or collapse when they're done no matter what?
going with collapse-by-default for now, feels tidier and you get the root cause logs at the end anyhow. in -E you can just toggle it open
I'll have to play with it directly to be sure I understand ๐ But seems reasonable
btw my preference is to not get that big ERRORS section at the end at all
- I already know there's an error because the check says "ERROR" at the top.
- I already have the logs (rendered in the usual way)
- I already have the tree span with each span already having a clear error indicator
--> Normally, I should have everything I need from the regular render, without needing to append an ERRORS section at the end which duplicates info
My initial approach for post-execution print was:
- Always show the list of all checks, with red/green
- For failed checks, show all the logs in context
- I didn't address sub-spans, so not sure how that factors in
- No separate "errors" section. Just the above
In the context of dagger checks, IMO the whole output is the error context. There shouldn't be a difference
well, that goes back to the original question then - if it auto-collapses, you won't have the logs, and if it auto-expands, you'll have possibly way more logs than what you actually want dumped into your scrollback, because that's all the logs for the entire check. so that's the point of giving you the root cause logs - it's much more focused
for Dang where I'm dogfooding it I definitely wouldn't want all the rolled-up logs in my scrollback, because that includes noisy codegen stuff too
I think as a first approximation, it's OK to get all the logs for the failed check. It might be more text than I need, but at least it's presented in a simple overall structure. We have options for filtering that further
What about the subspan-specific logs?
Only show the logs of failed subspans?
The part that bothers me about the ERRORS section is repeating the whole structure twice
wouldn't that be strictly worse / more repetitive than just showing the root cause, considering we have that exact info anyway?
Want to chat live real quick? (or I can do later if you're in flow)
yea sure
no just when calling dagger functions dagger goes bang. due to what i think is windows checking out the code with CRLF and then files i assume being transferred back into dagger engine, theyre invalid and then cause things to go pew pew
': No such file or directory
@tepid nova progress! feels like something more should be done to tidy it up, but not sure what exactly 
@civic yacht weird looking trace... Loading a toolchain takes 19s but nothing seems to happen in the span?? https://dagger.cloud/dagger/traces/04b70f3074463bd1527244e502dca827?span=c6fb0286aae2a2fe
let's dogfood it ๐ I'm sure the answer will come quickly
merge ready?
can get it ready soon, just need to regen telemetry tests once the dust settles
will ping when ready
oh, need to double check the codegen logs fix too
any disagreement on adding dagger check as an alias for dagger checks?
@still garnet is there a standard way for the CLI to say "telemetry for this context should be hidden by default"?
telemetry.Internal()
example: dagger checks -l that queries the list of checks; or dagger toolchain list that queries info about each toolchain
oh i see
Dagger Cloud
At the moment I use the "tuck away" strategy in checks -l ๐
you can create a new span and zoom the TUI in to it
that's what i did for dagger checks, and dagger prompt/shell mode
Ah, so the reverse of tuck away
lol yes
How do I zoom?
there's a bit of a dance to it:
ctx, shellSpan := Tracer().Start(ctx, "checks", telemetry.Passthrough())
defer telemetry.End(shellSpan, func() error { return nil })
Frontend.SetPrimary(dagui.SpanID{SpanID: shellSpan.SpanContext().SpanID()})
slog.SetDefault(slog.SpanLogger(ctx, InstrumentationLibrary))
the last line just wires slog up so the user will still see it, since the TUI shows logs for the zoomed span at the bottom. mostly for debugging
SetPrimary is the key part. probably overdue for some refactoring now that we want it in so many spots
one downside is that if something does fail outside it'll be out of view, i think. may want to test that, can figure out how to address if needed
@still garnet in this pattern, I should keep the old context, to make queries that I don't want visible right?
yep
the alternative is to just make a telemetry.Internal() span and tuck everything in there, but that won't clear progress up until that point
btw this is another boundary of the "otel-driven UI" topic we keep discussing. In dagger toolchains list I don't actually have any spans to display. Just a few "printfs to tabwriter". But in the future, maybe we want even that display code to be owned by the engine? Via a table UI component perhaps?
anyone seen this weird error? https://dagger.cloud/dagger/traces/3f5d6ea3844d71cdf4b8b2c0d0025166
[...] failed to get type defs json during module sdk codegen: select: blob sha256:7ed8bf4612e08c914cf4649cb94180b12fe87078898c14fdf6f252268a5d14bd not found in cache
that would be interesting yeah 
i've also thought about making the 'zoom' mechanism able to be driven by attributes, but it feels a little OP
(kinda like the old .Focus() API, which i don't even remember how exactly that worked anymore, that was pre-otel i think?)
@still garnet if I don't use that inner "display context", does it never get created? eg. is it ok to do _, shellSpan := Tracer().Start(ctx, "checks", telemetry.Passthrough())
not entirely sure what you mean, but if your goal is literally to just clear the screen, you can discard the ctx yeah
it'll clear the screen as soon as you create the span
Dagger Cloud
@tepid nova ah ha - lazy loading means sometimes we don't actually run codegen until the check function itself runs if it calls another module, so those logs slip through. need to guard against that too (easy fix, but was confusing for a bit)
@civic yacht ack - i'm rebased on main and just hit SQLITE_BUSY right after a ./hack/dev, maybe something got worse with those tweaks? never saw this before
full engine logs
just noticed this too, ever figure anything out about it?
I did a git bisect session, but it led to a nonsensical answer (bad commit was supposedly a "post-release version bump" commit)
oh huh, it's getting a noopTracerProvider
ack, figured it out. it's kind of annoying
it's doing trace.SpanFromContext(ctx).TracerProvider(), which is a pretty common thing to do, but in this case it backfires because we don't have an actual span, it's just a container exec inheriting a span context, so SpanFromContext(ctx) returns a noop span
maybe this did a otel.GetTracerProvider() before and it worked?
Looks like it was otel.Tracer().Start() before my last CI refactor.
If anyone on EU timezone is around, I could use a review ๐ This fixes issues with the toolchain feature, surfaced during dogfooding
@still garnet I tried to revert this ๐ but after some superficial testing (it was late) it didn't seem to solve the issue. Any luck on your end?
this fixed it for me:
func (job Job) startSpan(ctx context.Context) (context.Context, trace.Span) {
attr := job.attributes
attr = append(attr, attribute.Bool("dagger.io/ui.reveal", true))
return otel.Tracer("dagger.io/util/parallel").
Start(ctx, job.Name, trace.WithAttributes(attr...))
}
but, I think that'll break its engine-side usage since it won't be sending to the per-client telemetry DB anymore
i'm not using it engine-side for checks anymore, so haven't run into that
Ah... So you're saying it either works fine client-side, or it works fine engine-side, but not both?
Just to confirm - the current implementation in main does work engine-side without issues?
yeah, using SpanFromContext is what you want engine-side
we can make it work in both by having the go SDK create a passthrough span in the entrypoint. might be a good idea just to maximize compatibility anyway, SpanFromContext is a somewhat common practice
SpanFromContext is a somewhat common practice
What does that do? I'm a bit lost
FYI on a fresh checkout of your branch, dagger checks go/lint does not show the custom spans emitted from within the module. Is your fix currently applied in that branch?
Repro:
dagger -m github.com/vito/dagger@checks-pragma call \
playground \
with-directory --path=. --source=https://github.com/shykes/dagger#ci-faster-load: \
with-workdir --path=dagger \
terminal
in OTel there are two ways to get a TracerProvider (the thingy that lets you create [the thing that lets you create ๐] spans)
otel.GetTracerProvider()gets the singleton globally configured TracerProvider, if there is onetrace.SpanFromContext(ctx).TracerProvider()gets the TracerProvider that created the current span, if there is one
did it find one? you don't know! the TracerProvider you get back will just silently be a nopProvider sending your telemetry into a black hole, so you can't even try one-and-then-the-other
(and otel.Tracer(...) is shorthand for otel.GetTracerProvider().Tracer(...))
IMO #2 is always better since it's more local to the code path, in the engine for example this is paramount because we have per-client TracerProviders that write to each client's separate SQLite DB, whereas ostensibly otel.GetTracerProvider() would be more for engine-side system-level traces, like periodic GC or whatever - things that are outside of any individual client
OK I see. So what prevents us from using SpanFromContext() is that connect() does not immediately send a top-level span, so there is no span to retrieve from the context, so we get a nopProvider?
yep - it specifically won't work inside a module function call if it's right on the span boundary, because there's no outer span created yet. The actual trace.Span live object is on the outside - we propagate it in as traceparent but SpanFromContext won't have access to the trace.Span, you'll only have SpanContextFromContext (the propagation metadata, from traceparent, propagated through env vars)
can repro, weird though because I just saw it work locally for scripts/test. investigating
It could be your latest rebase on merge... I just merged tweaks to util/parallel
oh, wait, i think this is because it's using code from your ci-faster-load branch, which has it broken still right?
But, normally those tweaks should not change the default behavior
That part of the code has not changed in the CI module, it's just a regular parallel.New().WithJob()..
Oooh wait it's the module linking against a different version of parallel...
the brain... it hurts...
I'm bookmarking this because it's legitimately hilarious how complicated otel is. Out of context this sentence doesn't just sound complex, it sounds clinically insane
Yeah, though to its credit we're squeezing every ounce of that complexity to support our complex use case, and it's had a knob for everything we've needed to turn lol
</stockholm>
lazy fix: don't use it engine-side (current state of my branch) and just swap it back to the global tracer
general fix: have the Go SDK create a passthrough span so trace.SpanFromContext(ctx) returns a proper trace.Span that the user doesn't see, but, theoretically this is a problem for every SDK too
Couldn't we insert it in the CLI helper since that's a gate for all SDKs? eg. dagger session
i don't think so - the problem is local to each language's OTel SDK, so we can't solve it in one place
maybe the real fix is to just finish the statuses PR
How would that fix it? Isn't it the same otel plumbing underneath?
yeah but it'd be all in the engine. the SDKs wouldn't need to touch the OTel plumbing anymore
OK I will start with a super short term, yet clean-ish, fix: make the behavior configurable in parallel
(except for OTel auto-integration)
@still garnet how does this feel:
// Default: don't use the "contextual tracer". Works client-side
jobs := parallel.New() // implicit
jobs := parallel.New().WithContextualTracer(false) // explicit
// Custom: enable the "contextual tracer". Works engine-side
jobs := parallel.New().WithContextualTracer(true)
Once we've done the "deep fix", I can flip to make contextual tracer the default. Nothing should break
lgtm
@still garnet maybe I'm misunderstanding, but wouldn't this whole problem go away if otel had a trace.TracerProviderFromContext(), that worked whether or not the context had emitted a span? It's not like the context doesn't have a tracer provider configured... It's just not passed through for an arbitrary API design reason. Or am I missing something ?
otel doesn't put tracer providers in context.Context, only spans (which let you get the tracer provider that created them), and span contexts (which are approximate to header data)
i guess that's a pretty circular answer, sorry distracted - if it had that API it'd be more convenient, yes. I added APIs like that for the otel logging + metrics - see telemetry.WithLoggerProvider and telemetry.WithMeterProvider - but that was initially only because those subsystems don't have a globally configured one (perhaps they're moving away from that pattern?)
No worries, it was just idle curiosity + testing my rough understanding of the stack ๐
i guess we could add a WithTracerProvider but it wouldn't buy us much, since that'd be just as Go SDK -specific as creating a passthrough span
(testing my fix now)
I think I'm going to port the monolith to dang ๐
Oh wait no I can't - parallel
Did you have those gifs lined up? ๐คฃ
lucky searches lol
gifs/reactions/whats-his-face/
welp it appears my fix did not work
but it should...
oh wait maybe I didn't wait long enough
ah that was it. it's working!
@still garnet ugly error alert
these errors are created by the graphql client the go sdk uses, i think
we clean up the error returned by the function, but don't have that ability for errors that are directly stamped onto custom spans
we should maybe just change the Go SDK to clean all that up - alternatively, this is another thing that could be handled by the statuses PR
or, for now, your parallel package
i can take a swing at that on my branch, to see what it could look like
Breaking down my understanding of issues in this first screenshot:
-
Error in sub-check is super verbose (and in my case I don't need it at all, exit code 1 is not useful info for me)
-
Logs for all sub-checks are mixed at the top-level (I think), so I don't know what went wrong in this particular sub-check. (each sub-check executes its own
golangci-lintprocess -
Normally golangci-lint is configured to prefix its error messages with the full path of the file. This would make it easier to use the combined logs. But for some reason it's not working here, perhaps a bug in my module
--> looks likegolangci-lintis indeed getting the correct--path-prefix <foo>but choosing to ignore it..
- If you don't need that error, seems solveable at application layer (just return higher level error instead of bubbling up Sync error)
- Maybe we should add context to the line prefixes? e.g.
[.dagger/golangci-lint] - Looks like it is, but maybe they're just relative, and they all run in separate containers anyway (edit: oh)
- If you don't need that error, seems solveable at application layer (just return higher level error instead of bubbling up Sync error)
I might do that, but it does make my module code more complicated, and I have to do it in a lot of places...
- Maybe we should add context to the line prefixes? e.g. [.dagger/golangci-lint]
(In this case it would be hard to do with a heuristic, it would have to be controllable by the module dev somehow. mmm maybe we just use the name of the nearest subcheck name as the prefix, instead of a heuristic? That way the names in the brackets would match the names in the span tree)
Yesterday I think we settled on "don't expand the subcheck logs by default". But now I'm having second thoughts... In my case the top-level check doesn't actually stream any logs of its own... Each sub-check has its own withExec. So for this particular case distributing logs in the subchecks would be strictly better.
But, I know go tests are different, because they share one big withExec.
Is there a way to differentiate those 2 situations somehow?
you could add more attributes to each of those spans to roll-up to them instead of the parent. basically set the boundaries yourself
there's a new Boundary attribute I just added for that sort of thing
i think you'd want RollUpLogs + Boundary on each of them, and RollUpSpans if you also want the fancy dots on each sub-span
Can you explain RollUpLogs and Boundary? Honestly all those attributes (eg. Internal, Reveal) are like random buttons I try to push in different combinations until it looks ok. I don't feel in control of the model at all.
(not a judgement on the buttons, just raw user feedback)
i want to make all those attributes easily accessible in parallel, so it's easier to try them in the first place
RollUpLogs <- when applied to a span, logs from descendant spans will be rolled up to this span's logs in the UI, with prefixes
Boundary <- prevents logs from rolling-up past this span
so in this case I think you want both - RollUpLogs to set each sub-span as the new roll-up target, and Boundary to prevent them from rolling up to the parent (this is also how codegen logs are hidden btw)
So Boundary is a protection in case your parent has set RollUpLogs?
found a fix for the input: foo.bar.baz.buzz: ... crazy errors. it's kinda silly. only happens for exec errors
still long, but less so
My issues with this 2nd screenshot:
- There are too many errors in the picture
- The errors get increasingly more verbose and less useful as they bubble up
--> Solution: only print the error in the leaf?
Second screenshot looks to me like we don't do any error origin tracking for self-calls (like within the engine, dagql.Select), blind guess but if true it could mean there's one spot where we can fix it
@tepid nova check-generated is failing consistently on main and PRs after "dagger toolchain list": faster, less telemetry noise
https://dagger.cloud/dagger/traces/d8ee12b1e6eef4af5348c0c47eba878a
But all the checks passed in the PR???
possibly an issue with not rebasing on head of main before merge? that can matter in cases where generated files are updated both in a PR and in main
that's all I can think of
np, easy to miss and easy fix
What about Reveal?
Reveal means "bubble me up through parent spans so the user sees it" and corresponds to the "Hiding noisy spans" toggle in the web UI. Boundary keeps that contained now, too - we had some tests specifically testing Reveal behavior and it was getting pretty annoying seeing the revealed spans dominate the whole test output, so now that's fixed. I suspect Reveal might be retired at some point in the future though, feel like we're chipping away at the use case, by e.g. adding more semantic flags like 'check'
In the TUI, how is reveal implemented? Not sure what "bubble me up through parent spans" ->
- 1.Does it mean "by default expand the path down to the revealed span"?
-
- Or, does it mean "always show this span regardless of verbosity, if its path happens to be expanded"?
If it's the 2nd, isn't that the default behavior? If a span exists, you see it if you navigate to its place in the tree?
It means "promote this span all the way up as if it were a direct child of either 1. a top-level call, or 2. the nearest Revealed ancestor" - so it skips over intermediate non-Revealed spans
and also auto-expand parents
it's how e.g. when you run our tests, you don't see the withExec or any outer stuff, you just see DaggerDevTest.all > TestContainer | TestDirectory | ...
do you have a repro for this screenshot, or similar? gonna look into it
ah maybe this is close enough
โฏ mv docs/dagger.json{,.bak}
dagger checks-pragma*โโ โกโก
โฏ dagger-dev functions
โ connect 0.1s
โ load module: . 7.9s ERROR
! failed to serve module: failed to load dependencies as modules: failed to load module dependencies: module requires dagger , but support for that version has been removed
โโดโ finding module configuration 3.7s
โฐโดโ initializing module 4.2s ERROR
! failed to load dependencies as modules: failed to load module dependencies: module requires dagger , but support for that version has been removed
โฐโดโ ModuleSource.asModule: Module! 4.2s ERROR
! failed to load dependencies as modules: failed to load module dependencies: module requires dagger , but support for that version has been removed
โฐโดโ load dep modules 4.2s ERROR
! failed to load module dependencies: module requires dagger , but support for that version has been removed
โฐโดโ ModuleSource.asModule: Module! 0.0s ERROR
! module requires dagger , but support for that version has been removed
Error: failed to serve module: failed to load dependencies as modules: failed to load module dependencies: module requires dagger , but support for that version has been removed
@tepid nova progress on screenshot 2 -
this works by replacing telemetry.End(span, func() error { return rerr }) with telemetry.End(span, &rerr) (extremely longstanding TODO), which now also handles error origin tracking, by transparently reassigning rerr to one that's stamped with span
not sure why it didn't work for the last step, but already much better, looking into that now
(back from late lunch)
nice!
got the last one
nit: that ERROR is redundant with the โ on the left
it is, but it helps when it's less clear imo, like $ => CACHED
๐ this should fix the check-generated error on main https://github.com/dagger/dagger/pull/11411
Is this a known issue? https://dagger.cloud/dagger/traces/228fb1a8a689b5be4de7dcd1a2430add?span=80dd5f41624ee978
fatal: unable to access 'https://gitlab.com/dagger-modules/private/test/more/dagger-test-modules-private.git/': Error while processing content unencoding: incorrect header check
opened a separate PR for this since it touches so many things: https://github.com/dagger/dagger/pull/11410
just did a self review, went over each change and squinted, not sure how to explain the CI failures yet 
replaces the clunky telemetry.End fn pattern with a simple error pointer
tracks the span as the origin of the error by re-assigning the pointer, if the error does not already have an origin
CI is failing in main btw
I was trying to fix a check-generated failure in engine, but now also getting a go/check-tidy failure in sdk/typescript/runtime ๐คทโโ๏ธ
@still garnet I just merged the fix for failed check-generated (thanks for the โ
). Want to rebase and see if it fixes your CI errors?
@still garnet do you think we can merge checks-pragma today? ๐ would allow me to start calling actual checks from GHA
i see it's approved already (thanks @civic yacht) - can merge whenever! i re-ran a CI failure to see if it de-flakes, but gotta run to dinner, if it goes โ feel free to push the button
I was also gonna address the feedback on the .contributing AI slop but figure we can do that in post
go check-tidy fails on main in CI, but I cannot reproduce it locally on the same commit...
๐ can I get a quick TLDR on this? this fixes a regression in main where we can't see some custom spans anymore https://github.com/dagger/dagger/pull/11409
all green, merged ๐ข
Still have a failed go check-tidy in main... Still can't reproduce it
๐ can anyone get this command to fail?
dagger call -m github.com/dagger/dagger@main go check-tidy
fails for me:
โผ .checkTidy: Void 1m14s ERROR
โผ Go.checkTidy(exclude: ["docs/**", "core/integration/**", "dagql/idtui/viztest/broken/**", "modules/evals/**", "**/broken*/**"]): Void 1m14s ERROR
! .: 'go mod tidy' must be run
โผ . 35.6s ERROR
! .: 'go mod tidy' must be run
โถ check for dagger runtime 0.1s
โถ .changes/.dagger/ 52.0s
can go mod tidy be flaky somehow? Or perhaps it depends on current engine version?
since check-tidy generates the dagger modules first
I think I got it. The good thing is it's not flaky, there's a reason for that ๐
I tried different ways, and it wasn't clear to me why dagger call -m github.com/dagger/dagger@main go check-tidy was not passing, but a dagger call go check-tidy on the main branch of my checkout was passing ๐ค
The explanation is behind some unversionned files I had on my clone.
- on a fresh clone of
main,dagger call go check-tidyis not passing - after a
dagger call go tidyit works (the fix is there: https://github.com/dagger/dagger/pull/11412) - on a fresh clone of
mainwith some unversionned files in my case it's some generated files from the python-sdk toolchain,dagger call go check-tidyis passing without the above changes
toolchains
โโโ python-sdk-dev
โโโ dagger.gen.go
โโโ internal
โโโ dagger
โย ย โโโ dagger.gen.go
โโโ querybuilder
โย ย โโโ marshal.go
โโโ telemetry
โโโ attrs.go
...
@still garnet I have not follow all the things, but is there a way to declare a function is a check in dang?
there is if you use the checks branch (can't merge until we ship the engine) - https://github.com/vito/dang/blob/436c93fbc7fa0b9eb3499ac40d0d99043dc31b13/.dagger/main.dang#L203
eh actually i can probably merge it now. but yeah, will need newly shipped or dev engine for it to work
gonna tweak go check-tidy to print the diff, which could give an idea - flying blind at the moment. can second @leaden glade's point that usually when this happens it's because of uncommitted files
you can also run go tidy and see the changeset
that said, it could be super nice to be able to browse a changeset in the terminal before to apply it, like a git diff
@fair ermine @leaden glade ๐ before you log off today, can you tell me which of the kill-monolith tasks I should not touch?
I have finish with rust and dotnet so you can do the rest ๐
I also opened a PR for the eager loading so you can test the monolith
I'm finishing the java-sdk, I think I'll be able to push the commit in a few minutes
Thanks guys
@still garnet more dang questions:
- are regexp available in dang?
- is there a way to raise an error in specific case?
<@&946480760016207902> any objections to implementing currentModule().checks()? Would be cool if a toolchain could introspect all the checks in the current context - including but limited to its own. Is that what currentModule() does? (or does it point to the toolchain/blueprint's own module?)
Or, maybe this is when we unshelve the concept of "current env" @still garnet and try again to make it more general, not just for LLMs?
currentEnv().checks()
currentEnv().workspace()
currentEnv().toolchains()
currentEnv().functions() // ?
currentEnv().functions().build() // actual function invocation with codegen? a path to self-calls?
My immediate use case is a GHA config generator ๐ inspect all checks in the current context, auto-generate a config
@vito more dang questions:
@still garnet would it be easy to change the default checks view so that "sub-checks" are visible by default? (but not the other spans)
(sorry if we already had that discussion)
- define sub-checks ๐
- anything is possible, though i'm not totally sure you want that, at least I wouldn't want a bunch of checks to push other things offscreen (Dang's suite for example is about 100 tiny scripts so maybe this goes back to 1.)
Sorry I'm using "sub-checks" as an alias for "custom spans emitted by checks, which we hope to formalize as a 'sub-checks' feature later"
Yeah it might be too much since there are so many
It's just that right now if I don't expand the span tree, I only see raw logs but no useful information, including whether any sub-check has failed
but if I expand, I get a firehose
default vs. expanded
@still garnet quick question on a change you made to parallel. From telemetry.Internal() to a new dagger.io/ui.internal attribute.
I notice in the engine code we still use telemetry.Internal(). Is it safe for me to call parallel.New().WithInternal(true) within the engine, knowing that it will use the new attribute and not telemetry.Internal()?
(My goal is to always guarantee that parallel can safely be used both engine-side and client-side)
I moved it away from dagger.io/dagger/telemetry just to avoid it showing up in go.mod for modules, it was probably fine but that's always kind of weird when it happens (vs. modules using their own telemetry package)
those are just helpers for adding attributes so it doesn't matter at the end of the day
Nice! Thanks
(doing a quick pass at making engine traces more readable with a few judiciously placed custom spans)
i do wish for the Go SDK we did something like replace dagger.io/dagger => ./internal/dagger but there might be a good reason not to
TypeScript SDK seems to work that way
Maybe open a quick issue about it, and eventually the right person with the right historical information might run across it and answer?
Currently a Go SDK module gets its own personal Dagger Go SDK client generated to ./internal/dagger which the module code then imports. It also has a locally defined dag variable that is the instan...
@tepid nova I just pushed the java-sdk toolchain to the ci-faster-load branch
I wrote it in dang, it's pretty nice to work with ๐
Thank you!
@still garnet dogfooding screenshot:
- Command:
dagger checks go/lint - Output: post-run (not live TUI viz)
- Dagger version:
dagger/dagger@main - CI module version: running
shykes/dagger@ci-faster-load
Issues:
1. Error is too verbose.
- What I need:
exit code 1 - What I get:
! input: container.from.withMountedCache.withMountedCache.withMountedCache.withWorkdir.withMountedDirectory.withWorkdir.withExec.sync process "golangci-lint run --path-prefix toolchains/security// --output.tab.path=stderr --output.tab.print-linter-name=true --output.tab.colors=false --show-stats=false --max-issues-per-linter=0 --max-same-issues=0" did not complete successfully: exit code: 1
2. Too much span context.
- What I need: nothing or maybe a link to that span in dagger cloud (I'm not debugging my module, just running a linter)
- What I get: 14 lines per check of low-level dagger calls and their arguments
The same screenshot with only the information I need:
@tepid nova trying to catch up - whats the status of checks + toolchains? is it working on any branch somewhere or should I put up a PR since checks are merged?
Here's the diff of the whole output:
- Actual: 427 lines, 33719 characters
- Desired: 133 lines, 7266 characters
https://gist.github.com/shykes/aa9d9cdb3072968894c8676de33e36ca/revisions?diff=split&w
https://github.com/dagger/dagger/pull/11410 should address a lot of that
Ah nice, I was wondering if there was still an outstanding PR!
missed that one
probably still more to do after, but getting there. I think your parallel package might need to use the telemetry.EndCause helper internally to fix some of it, which gets back to the dagger.io/dagger/telemetry dependency
@still garnet I ran into an interesting situation in that example:
-
All my errors except for one were golang-ci-lint failures. So the context tree was completely useless (don't care how the tool is run, just about its logs). I also decided I don't care about exit code either. If exit code is meaningful, then it makes sense for the module code to expect & emit a custom error. The happy path should be: don't care
-
one error was a failed
moduleSource().asModule()caused by missingdagger.jsonfiles in a few places. First reaction: "oh no in this case I do need the context tree! There are no logs, only the errors and those need context!". Second reaction: "actually if we're leaning on logs for user functions, we should do the same for system functions. Instead of printing an error with "no such file or directory" wrapped with a bunch of noise, wrapped in a complicated context tree, there should be an actual log message saying "no such file or directory" and we should print that to the user
TLDR if we're embracing logs for user-facing error context, we should embrace it all the way, including core functions. Then IMO we won't need to show special context trees around errors at all. Either you look at the logs, or you want to dig deeper, and you look at the compelte trace
It's all in main ๐
ah ok, thanks. I'm on your ci-faster-load branch and noticed checks -l didn't have anything
We're trying to get the monolith refactor ready to merge... It's a big lift but getting close! You can run its checks+toolchains and everything
It should. are you running main?
(not compatible with 0.19.6 checks. need main
dev build from 96a42e7e0 (head of ci-faster-load)
Ah I haven't rebased on main yet (Alex's PR also changed the CI module so I have to resolve that)
Until I do (today), you should test ci-faster-load with dagger@main
Just try to use it, and as soon as you find issues (you will), add them to the checklist in the issue ๐ That's what I'm doing
oh yeah looks great with a main build
@tidal spire I'm taking a quick break, but in 15mn if you want, we can stress test it together ๐
Could be a good demo too if anyone else is interested
nice i have to step away for a few minutes soon but i'll be back after!
I'm going to schedule an event. Should I schedule it for 12h30 PT? (in an hour)?
(violets home sick and shannon has a meeting at 3 est)
What time works for you
yeah 1230 or 1. If you're demoing I can watch whenever
Too late to run a release? ๐
@still garnet we're live-dogfooding with @tidal spire, there's a weird glitch where the output of dagger checks is.. flaky somehow? Same command will output different things. Like some checks just aren't there
Problem In the experimental dagger checks feature, we use '/' as separator for the check path. This can be confused with a file path (it's actually a function path). The problem will be...
@civic yacht I looked at the code but couldn't figure it out. Is ModuleSource.configExists() cheaper than ModuleSource.sync()?
Or does accessing that field imply sync()?
Accessing that field does imply sync. You could really use either, I think it's going to be arbitrary. I guess using Sync is a little more explicit in terms of what you're trying to do in your code
Will sync() fail if there's no valid dagger.json?
No I don't believe so based on the code, I think it will treat it as a new module to init in the dir. So if you care about asserting its existence then ConfigExists would be better
I'm assuming you are trying just like dag.ModuleSource("...").Sync(ctx). If you chain more calls to ModuleSource then the answer starts to vary
OK thanks! I was looking at the current code in CLI to initialize a module, to choose the most accurate & clear wording for the spans we show users
I'm using validate for ConfigExists which I think is accurate and clear. The only problem is that it takes a long time... Because I guess it triggers some actual file uploads/downloads. So I was thinking of calling it "materialize" or "tranfer files" or other sync-ish word.
TLDR seeing ConfigExists taking a long time got me wondering if I really understand what it does and why we call it in the CLI
@civic yacht will it make things faster if I call Sync() and then ConfigExists()?
That way I can say:
transfer filesโvalidateโ
And not be lying ๐
I know it's a small thing, it just always bothered me that when reading those initialization spans I have no idea what each step actually does
no:
Syncby itself will just trigger execution ofdag.ModuleSource("foo")ConfigExiststriggers execution ofdag.ModuleSource("foo")and then returns a bool field set during that execution
So they are almost the same thing, ConfigExists has probably microseconds of overhead to retrieve the bool field too, but that's it. Calling sync and then configExists would just add a little bit of extra overhead.
I'm using validate for ConfigExists which I think is accurate and clear. The only problem is that it takes a long time... Because I guess it triggers some actual file uploads/downloads. So I was thinking of calling it "materialize" or "tranfer files" or other sync-ish word.
Yeah whether it's a local or git source, it's essentially just pulling of sources and reading various pieces of configs from dagger.json. It also does the same for each dependency (recursively). So those words would make sense to me
@tidal spire @still garnet I figured out the root cause of that persistent build error in .dagger. It's not that the dagger runtime was not generated - it's that toolchains are no longer included in the codegen ๐คทโโ๏ธ This is since I rebased on main. So something happened on main that changed the codegen behavior of toolchains... Seems related to the other unexplained issue where dagger functions no longer prints the correct description for toolchains, when it definitely did since I fixed it in main. Maybe same root cause for both mysteries?
(but I can't find any recent commit that does anything suspicious)
Maybe i'm missing something but this looks like a very low hanging fruit: replace docker info with docker version to speed up connect time: https://github.com/dagger/dagger/pull/11421
@still garnet re telemetry.EndWithCause(). Should I just always use it in parallel? Or make it configurable whether to call a) End() or b) EndWithCause() ?
Also should I stop calling span.SetStatus(codes.Error, err.Error()) ?
Answering my own question: EndWithCause() already handles it, so yes I can safely remove that
Yeah, if you use telemetry.EndWithCause in parallel that should fix a lot of the repeated errors. I had some cludgy changes locally to do the same but by copying the necessary code into util/ but honestly probably fine to just import dagger.io/dagger/telemetry for now
I don't mind that import at all. Doing it now
EDIT: go mod tidy complains... but I will figure it out
oh, euch that that leads to hairy dependency issues, since the modules won't have the right dagger.io/dagger (unless you do a go mod replace i guess? if that works in a module? had trouble with that in Dang)
So far:
diff --git a/.dagger/go.mod b/.dagger/go.mod
index 17b249f3e..e0c7807cb 100644
--- a/.dagger/go.mod
+++ b/.dagger/go.mod
@@ -9,6 +9,7 @@ require (
replace (
github.com/dagger/dagger => ..
+ dagger.io/dagger => ../sdk/go
github.com/dagger/dagger/engine/distconsts => ../engine/distconsts
github.com/dagger/dagger/sdk/typescript/runtime => ../sdk/typescript/runtime
)
diff --git a/dagger.json b/dagger.json
index 2e3436cd2..f9ba52f12 100644
--- a/dagger.json
+++ b/dagger.json
@@ -9,7 +9,8 @@
"sdk/typescript/runtime/**/*",
"go.mod",
"go.sum",
- "util/parallel"
+ "util/parallel",
+ "sdk/go"
],
"dependencies": [
{
I have graduated from a runtime build error to a codegen error ๐
Error: load package ".": no packages found in .
! process "codegen generate-typedefs --module-source-path /src/.dagger --module-name dagger-dev --introspection-json-path /schema.json --output typedefs.json" did not
complete successfully: exit code: 1
I've done this sort of "go.mod replace + dagger.json include" tweak many times, to import parallel from various dagger modules. This one is slightly different though
Oh no I have to do the extra replace it in every module that imports parallel
yeah ๐ญ
i'm this ๐ค close to just yeeting error origin metadata into error strings and parsing it out lol
Funny I actually removed parallel from almost every dagger module in my branch
thanks to dagger checks, tons of aggregator functions just aren't needed anymore
btw... we can make doug a toolchain now ๐
(I know technically it's the dev module but doug is too cool a name not to use it ๐ )
(welp adding the go.mod replace everywhere did not fix the codegen error..)
Now go mod tidy complains in each module where i added the replace. Some sort of interference between github.com/dagger/dagger and dagger.io/dagger?
diff --git a/.dagger/go.mod b/.dagger/go.mod
index 17b249f3e..e0c7807cb 100644
--- a/.dagger/go.mod
+++ b/.dagger/go.mod
@@ -9,6 +9,7 @@ require (
replace (
github.com/dagger/dagger => ..
+ dagger.io/dagger => ../sdk/go
github.com/dagger/dagger/engine/distconsts => ../engine/distconsts
github.com/dagger/dagger/sdk/typescript/runtime => ../sdk/typescript/runtime
)
$ go mod tidy
go: finding module for package github.com/dagger/dagger/.dagger/internal/dagger
go: github.com/dagger/dagger/.dagger imports
github.com/dagger/dagger/.dagger/internal/dagger: module github.com/dagger/dagger@latest found (v0.19.6, replaced by ..), but does not contain package github.com/dagger/dagger/.dagger/internal/dagger
@still garnet permission to copy-paste? ๐
lol yes
I don't know how to fix this
that's what I ended up doing
just the whole telemetry package?
you can just yoink the one function and handful of other types it needs
it'll just be more code we can triumphantly delete once we figure out the statuses/subchecks API
And it builds ๐
Now to add the logRollup
To confirm my understanding:
-
For now, to get log & span rollup in my "sub-checks", I need to set the corresponding attributes in my check function. This is a stopgap (we don't want to require all toolchain devs to do this)
-
Soon,we will have an official API for sub-checks, and the engine will handle setting those attributes on behalf of the checks dev.
--> Agree?
yep!
Question for the theseus masters... @civic yacht @rocky plume . How hard would it be to start collecting some sort of "cache hit rate" metric from the engine? I think we need a number to quantify, even roughly, how much an engine is using its cache. If only to see if that number improves or degrades over time.
In the context of scale-out and parc, it's difficult to evaluate whether a given load distribution strategy is working or not without that number.
For example, I'm looking at an engine I'm currently alllocated to. It's 7 days old, and has a great variety of modules in its cache. On the one hand keeping instances around longer should give us better cache use. But on the other hand, re-using the same instance for many different modules might hurt cache performance. Hard to tell, without some sort of measure... You get the idea.
Ironically when we slice up an engine for multiple tenants, it's easier to slice up CPU, memory and disk than it is to slice up the engine cache
Getting a weird CI fail... https://dagger.cloud/dagger/traces/d2ac57704e174d2eb824810764bf43e1#2ba45316a22c545e:EL94
It's looking better ๐
oh btw, random idea re: dagger generate-ing from the path you want to regenerate: when we run a // +generator func we can inspect its returned Changeset and record the changed files in dagger.json, something like this:
{
"generators": {
"dagql/foo.go": "TypeName.funcName"
}
}
then when you do dagger generate ./dagql/foo.go or dagger generate ./dagql it'd run TypeName.funcName.
could get spammy in some cases (e.g. elixir/php have a bunch of separate files iirc), but you get the idea
a conendrum: how hard would it be to wrap our integration tests in go test?
our go and engine-dev toolchains are competing to run the same test suites
@still garnet There's a problem with the dang SDK, all our CI and local dev build are failing because of:
# dagger/dang/entrypoint
entrypoint/main.go:445:20: funDef.WithCheck undefined (type *"dagger/dang/internal/dagger".Function has no field or method WithCheck)
We are currently investigating with Yves to understand what's going wrong and why the CI has been green 2 days ago while you did your changes on the dang SDK 5 days ago (when adding the checks).
Yves noticed that the issue may be triggered because you are using unreleased API on the Dang SDK (it still doesn't explain the CI issue tho)
I also tried to run dagger functions on your dang module and same error, it's failing (with dagger v0.19.15 and v0.19.16).
@tidal spire can you help me get 11373 merged today? ๐ We should be close
I have no idea if this would be the ideal place to do it, but I wonder if we could keep a running total of the ratio each time this line (cached = res.HitCache()) is executed? https://github.com/alexcb/dagger/blob/5ee94abb936d1e9d0aa2f32047579ee7148be887/dagql/session_cache.go#L154
@still garnet FYI in 11373, errors still get the unneeded "context tree", even though I changed parallel to use your fix with error hoisting. Is that normal, or a sign of a bug in my parallel implementation?
would need to take a deep dive to figure it out but this is the sort of brittleness i was talking about, each and every layer has to cooperate. i'm trying out my general fix (lovely regexes and metadata error string injecting), if you have the command that repros that I can try it out
Anyone else getting this insanely slow call when building engine?
loadPackage -> 4m30s and counting
@fair ermine there's only Elixir SDK left to merge/spin out right?
๐ for those who are familiar with our release flow. Do we ever call dagger -m releaser publish? Or is that actually dead code?
$ grep -r releaser RELEASING.md .github/
RELEASING.md: dagger call -m releaser bump --version="$ENGINE_VERSION"
RELEASING.md: export CHANGIE_MAINTAINERS=$(dagger call -m releaser get-maintainers --github-org-name dagger --github-token="cmd://gh auth token" --json)
RELEASING.md: dagger develop --recursive -m ./releaser
.github/workflows/publish.yml: module: github.com/${{ github.repository }}/releaser@${{ github.sha }}
.github/workflows/publish.yml: --goreleaser-key=env:GORELEASER_KEY \
--> No relevant reference to it anywhere EDIT: last line is the relevant line
Yeah I thought @fresh harbor was interested to work on it but I can take care of that tomorrow if you need it fast ๐
cc @astral zealot @wild zephyr
Could this possibly be dead code??? neverming I found it: module: github.com/${{ github.repository }}/releaser@${{ github.sha }}
btw we can use a .env for release...
Note @still garnet : our own publish function hand-rolls the pattern that many users have been asking us for on the tests: "give me a report artifact even if it fails"
Which is only possible if we suppress errors at the app layer ("dagger errors mean something went wrong with dagger")
OR, at some point down the road, we offer a core feature that can produce a report artifact for any dagger call, from its otel trace
(it does feel like that's what our own publish report code is trying to replicate. "We called this publish function, with these arguments, and then we got this error as a result", etc)
--> dagger --export-report=./report.md call ...
or as a intermediary step: allow checks to return a File or Directory? Since we have +check now
hmm even with that you'd need a way to still 'fail and return'
yeah
doesn't work as is
So maybe in the same vein as adding support for a "subcheck" interface-ish, we could also support a "check-result" interface-ish, with a pass(): Bool! and report(): File!
(in lieu of void)
<@&946480760016207902> I'm going to make the following changes to our CI. Any red flags?
- Engine tests default to
--race=false, but CI config explicitly sets--race=true. Objections to changing the default totrue, so the CI config can be dumber? - Engine tests default to
--parallel=0, CI config sets--parallel=16. OK to set default to16for same reason?
Alternatively, I can move those settings to a .env
(but then we need to talk about how to get that .env into CI)
ACTUALLY because of our defective DX, I believe we cannot set the default of a bool argument to true... that effectively makes it impossible to unset (because our generated clients don't differentiate between "set this to false" and "don't set this")
Engine tests default to --parallel=0, CI config sets --parallel=16. OK to set default to 16 for same reason?
We just want-parallelto be the number of CPUs I believe, so actually leaving it as 0 should be fine for our CI. Setting explicitly to 16 would create the same end effect, except we'd need to update it whenever infra changes
Nice - one less setting to worry about ๐
What about race enabled by default? (setting aside the DX complication)
for our CI in particular, that's fine if it's on by default I suppose. I'm not sure if the bigger context here is figuring out the default for the go toolchain though. If so, then for general purpose usage by anyone I'd probably want it to default off? But not strongly opinionated, there's arguments either way
@still garnet we're redoing the GHA config plumbing on top of dagger checks, and ditching the generator... There's one workflow that is a little more custom, your llm test workflow. I wanted to check with you what we should do:
- Path filters. Easier if we remove it, I'm guessing it will make things slightly less efficient in the short term (the job will trigger when it's not needed), but that all goes in the trash soon and replaced with smart checks anyway. Just checking that it doesn't actually break anything to remove those test filters
"on":
push:
paths:
- core/llm.go
- core/mcp.go
- core/env.go
- core/llm_*.go
- core/llm_*.md
- core/schema/llm.go
- core/schema/env.go
- modules/evaluator/**
- modules/evals/**
The shell job:
call: --allow-llm all test specific --env-file file://.env --pkg ./cmd/dagger --run CMD/LLM
--> Not sure what that even does or what to do with it
it will only break our bank account
That's a highly motivating test case for smart checks ๐
if you need to punt for now you can, since that job currently is broken in CI
i just run them locally atm
For now I'll "eject" the yaml to be manually edited if that's ok?
sgtm
what if we change that one to just be dispatch?
that'd be fine too ๐
@tidal spire FYI I had to eject 2 workflows:
llm.yml๐daggerverse-preview.yml
I'm not actually sure what the purpose of the daggerverse-preview workflow is. I'll check into it but if anyone has context please educate me
Looks like it checks that crawling modules works as expected when we're creating a new release. I think we could find another way to test this that fits better in our test stack
Especially now that the crawling code is in a separate library which wasn't the case when this pipeline was created. The library is private but we can move it if it makes sense
we dont have to make that change now, ejecting is fine. just noting for the future
We're getting closer ๐
This ๐ is all that's left of the monolith
To be fair engine-dev is a beast
the monolith within the monolith
But we'll get to that later
You fix 90%, and suddenly the remaining 10% looks huge
Any hesitation with me shortening this error message?
process "go test -ldflags -X github.com/dagger/dagger/engine.Version=v0.19.7-251117095547-480c5d59242f -X github.com/dagger/dagger/engine.Tag=bfeb1a7cec3757a60e9e2c5bf176c1131e9af388 -parallel=24 -timeout=60m -count=1 -run TestModule ./..." did not complete successfully: exit code: 1
=>
exit code: 1
diff --git a/core/container_exec.go b/core/container_exec.go
index 7d8ea6e90..2efe87f8f 100644
--- a/core/container_exec.go
+++ b/core/container_exec.go
@@ -557,7 +557,7 @@ func (container *Container) WithExec(
}
if execErr != nil {
- return nil, fmt.Errorf("process %q did not complete successfully: %w", strings.Join(metaSpec.Args, " "), execErr)
+ return nil, execErr
}
return container, nil
cc @civic yacht
SGTM since I'm presuming telemetry fills in the info about the args. Maybe worth putting in a slog.Error above the return error so that engine logs remain coherent (i.e. you can tell what actually failed in those logs rather than just "something exited 1")
ah good call, will do! yeah it's redundant with telemetry, and SUPER verbose when like 10 things fail
<@&946480760016207902> if you recognize this GHA config, this is your warning that it will get nuked soon, and you should start thinking about how to migrate it off of GHA ๐
_dagger_on_depot_local_engine.yml_dagger_on_depot_remote_engine.ymlalternative-ci-engines-1.ymlalternative-ci-runners-1.ymlbenchmark-engine.ymlbenchmark.ymlchangelog.ymldaggerverse-preview.ymldeploy-docs.ymlllm.ymlpublish.ymlstale.ymltrace-workflows.yml
"soon" as in: in the next few weeks.
@tidal spire I'm taking the PR live... let's see what happens!
Here we go https://github.com/dagger/dagger/pull/11373
To play with it:
dagger -m github.com/dagger/dagger@main call playground with-directory --path=. --source=https://github.com/shykes/dagger#ci-faster-load terminal
@still garnet would it be easy to get dang sdk to work with 0.19.6 and also fully support checks?
in a hacky kind of way yeah - i could just sidestep the codegen'd WithCheck call and use the underlying graphql client. wouldn't be too bad
assuming theres a way to detect 
Mmm what is WithCheck?
Function.withCheck, it's the typedefs API that the pragmas translate to
Ah! of course
(good time to bikeshed btw, hastily chosen name)
Those CI tests are passing a little too successfully...
I wonder if we remembered to exit 1 when dagger checks fails ๐ค
Looks like it!
I had a typo in the new split-test (made that a toolchain)
dagger checks test-split/* -l
There is one issue left... Our glob path matching for checks is confusing our GH actions... It doesn't escape it properly, so if any file matches, it expands it on files ๐ญ
Note that this check succeeds... Because no checks match, so it happily checks nothing ๐
@tidal spire the test-split checks are run successfully in the matrix, but seeing a bunch of timeouts... are you sure these get scheduled to separate machines?
I might have to move it up a level, I'll check
the other issue I see is the unescsped globbing, tried to fix it in d4gh but didn't seem supper easy
other than that, looking promising ๐
i'm happy with the separate test-split pattern, it's a stopgap, but uses toolchains in a way that feels clean
wdyt of having test-split automatically mean test-split/*? i tripped up on that pretty early, expected dagger-dev call test to run my full test/* subtree but instead it just ran the constructor
yes 100% agree
forgot to add that to the todo list
also tomorrow I will switch to : as separator
just pushed up the wip elixir module @tepid nova ! left a comment on the PR too, but it has the same FIXMEs as typescript since we have several spots that need a Strings.replace
happy to attempt that on dang @still garnet but you can probably do it faster ๐
ah fun, yeah lemme take a swing
alternatively: file("foo.txt", content).withReplaced("x", "y", all: true).contents ๐
@tidal spire pushed: "foo".replace("o", "x") returns "fxx", there's a count arg if needed
Wow that was fast! Thanks!!
toolchain devloop is getting faster ๐ช
I'm logging off for the night, but cleaned up the TODOLIST, if you guys want to keep going:
https://github.com/dagger/dagger/pull/11373
@tidal spire @leaden glade @fair ermine
Sorry I caught a cold and not recovered yet. ๐ท
@civic yacht observed an odd sqlite error: https://dagger.cloud/dagger/traces/1fa66bac40d32526e14ada81efe38014?span=065090c0f98951d1&logs#065090c0f98951d1:L86
export to gx2p3gr00g5iywne4738evh1j: export spans [run]: begin tx: SQL logic error: cannot start a transaction within a transaction (1)
# followed by...
export to gx2p3gr00g5iywne4738evh1j: failed to export resource metrics: insert metrics: database is locked (5) (SQLITE_BUSY)
I just pushed an update to the new elixir-dev toolchain using File.WithReplaced. Going to have a look at toolchain descriptions and then non-string toolchain configs
I think the docs/recorder isn't used, so we can remove that. I'll double check
Yes I have not seen it used somewhere, but best to double check
I know at some point it wasn't working and there was an unsuccessfull effort to rewrite it
that's why recorder2?
yeah
I was looking at the toolchain description (timing issue between discord and the issue ๐ ) But I can jump on something else if you also started on it
no worries, if you want to keep going on that I will start on the toolchain config side. At one point the toolchain descriptions were working fine, so it was broken sometime recently
honestly we can do the recorder toolchain in a followup
(back)
Is the todolist up-to-date @tidal spire ? I don't want to conflict
(looks like it)
@astral zealot @wild zephyr I may need help understanding this timeout on test splitting...
Also is it possible that I accidentally triggered scale-out from GHA @civic yacht ?
not merged yet, so I hope not ๐
yes!
well elixir is done-ish. As much as the typescript one is. They're both missing bump
But did they have bump before?
@tidal spire I'm hanging out in lounge trying to figure out those timeouts. Whenever you're out of the zone ๐
Hello, to access the args of a parent inside the dag, would this be the only way (for non-sensitive args):
parentID := dagql.CurrentID(ctx).Receiver() // same as parent.ID()
if arg := parentID.Arg("url"); arg != nil { // look up by name
if url, ok := arg.Value().ToInput().(string); ok {
// use url ...
}
}
github having issues for anyone else? the site works, but getting 500s when my engine tries to clone/resolve
that'll work, though i wonder what you're trying to do 
I'm stoping there for tonight, but I have something good for the toolchain descriptions + list. I just can't build and test it right now because of the current GitHub incident, so I'll finish that first thing tomorrow morning
One day we'll have caching & lockfiles so good, github outages won't even affect us ๐
just successfully pushed again
same, we're so back
last missing part for sdk-dev toolchains into dang I think is just regexp.replace needed in {sdk}.bump https://github.com/vito/dang/issues/6
An open-source runtime for composable workflows. Great for AI agents and CI/CD. - dagger/dagger
it would satisfy IsSemver (regexp.match) too since we're doing a hack with a go package at the moment
I guess we could do a similar hack for replace for now
@tidal spire I'm still seeing a bunch of mysteriously canceled / timed out jobs even after pushing your one-line yml fix
(almost done fixing the last typescript-sdk error)
I think just delete the concurrency group
Note quite there yet
If anyone is curious about our current release process ๐ https://gist.github.com/shykes/620acbd37543fe795d1552765d1911a2
@still garnet @civic yacht my git bisect says https://github.com/dagger/dagger/pull/11439 changed something I was relying on. Basically I have a println in HandleChanges() and unlazy and neither are printed when i do this:
foo=foo-$RANDOM
dagger -M -c 'directory | with-new-file bar BAR | changes $(directory | with-new-file '$foo' foo) | added-paths'
Am I missing something obvious ?
@tidal spire picking up kids. Still have timeouts on split tests ๐คทโโ๏ธ
try a rebase on main if you haven't already, there's fixes there for issues that might plausibly cause weird hangs
Yeah I can't imagine it's the workflow file at this point
I had CI configured to run checks on this commit 3807ba08e53f9506a02bbfc564225ff16a571b63 ( telemetry: use at most one open db per client (#11424)). I just updated it to the very latest commit (b5b3d7be0afed1fe949a5494a58cd68ea5be906f). Let's see if it helps!
Rebased on main & running CI in an outer engine built from main. No luck..
I must have been as awake as Cloudflare yesterday ๐คฆ I put my name on the wrong line yesterday. That's why you weren't able to see I was working on the descriptions ๐
Anyway, it's done and pushed, works for dagger functions, dagger toolchain list, with description for go and dang based modules
thank you! if you have any ideas on why the remaining checks are failing, I'm interested! ๐
looking at them, fixing lint issues, tidy issues, etc
We have:
dagger functionsdagger call --helpdagger toolchain listdagger checks -l
Can we/should we align them more?
There's something odd I'm not sure to understand. If anyone has an idea.
I did a commit that fixes go/check-tidy complains. I did this by running dagger call go tidy from inside an instance based on main (not from my machine). The commit is pushed on the ci-faster-load branch.
With it, dagger checks go/check-tidy is happy.
But dagger checks test-split/test-base is now complaining saying go mod tidy must be run. And if I revert my commit, then the tests can be run.
So somewhere we have something wrong as the go.mod/sum changes are good on one case, bad on the other. And the other way around.
do you have a trace of the second one? (test-base saying go mod tody must be run?
)
yeah we need to revisit this after!
I always have timeout errors when trying to run the tests from a playground (local one or using cloud)
Head "https://mirror.gcr.io/v2/library/alpine/manifests/latest?ns=docker.io": dial tcp: lookup mirror.gcr.io: i/o timeout
For instance when running dagger checks test-split/test-base
Is that a known limitation? Or something I'm doing wrong?
@leaden glade @fair ermine since we're all looking at the same thing, I think I have a fix for the tidy issue and i'll push it up in a moment. Just verifying i've caught everything
I just saw your message on the issue. Nice catch. Yeah that makes sense, I haven't thought about the ignore aspect.
I'm currently testing your fix
this could be the dagger-in-dagger-in-dagger issue maybe? how many levels of nested engines are you running?
or, our registry is down ๐
yesterday, to test the dagger cli I was running dagger in dagger in dagger with dagger to run dagger commands, true.
but no, today it's just one single level of playground. Using dagger main to run checks from the PR
The tests are running ๐ nice fix (some are failing but no tidy error so far for testSplit/testBase)
looks like it fails go linting, which is ok. I originally removed the core/integration/** line from the ignore but i'm trying now with core/integration/testdata/** instead
So we're saying the remaining timeouts is becuase of telemetry getting lost in dagger cloud?
Trying to understand the way forward to merging this thing
Seems like it if telemetry is also hanging outside of ci for those checks. Maybe there's a missing defer somewhere ๐ค
Thanks for catching that go mod tidy issue, it was driving me crazy yesterday
Could it be the fact that it's all running on a main engine?
but that would affect other checks also...
Something about dang checks then?
No, some of those have passed
I'll keep poking around. It seems isolated to test-split but i dont know how that matters yet
yeah same
I'll call the linter in the meantime ๐
And try to get those : separators done before release (which we're supposed to do today!)
nice! I'm still working out the ignore thing. I think it might need to be more specific, like core/integration/testdata/modules/**
<@&946480760016207902> how do I "update the tests"? The go tool tells me to run go test -update but I'm assuming we need to run a wrapper equivalent?
which tests?
(we have a few that use the golden output pattern)
TestTelemetry
that one's call test telemetry --update
Ha ha I was so close:
dagger call test update --run TestTelemetry
--> ๐ฅ
that's the one we use for the LLM tests, could maybe consolidate
btw @still garnet the reason it took me a while to answer your question: that particular PR of mine seems to really mess with the default trace view in cloud... ๐ฌ . I think it's because of parallel's use of reveal
"our module loading telemetry is so user-friendly, we hide everything else so you can really appreciate how user friendly it is"
eh yeah i'm pretty hesitant to use Reveal for internal stuff. is it crucial for the PR?
imo we should be trending towards deprecating it, but i'm not 100% sure yet (and we're not there yet)
Well as you know, my secret technique is I change all the attributes randomly until it looks OK ๐
Jokes aside, it happens to be what parallel does by default, and all I know is when I change any of the attributes, it starts a whole game of wack-a-mole. No opinion on reveal itself though
I could just remove the reveal from parallel completely? Will the custom spans still visible? Will tryu
my guess is you'll want reveal when it's used within a module function, and not when it's used within the engine. but, while we're on the topic of deprecation, might as well try removing it entirely and see how it looks in modules too
OK. I just want to make sure it's shown prominently, because the whole point is for the default UX to be more pleasant and clear to end users (speaking of my PR 11423 which adds custom spans via parallel in strategic parts of module loading, to better explain what's going on)
--> Is that consistent with removing reveal?
@still garnet which commit of the dang SDK should I use for it to work with dagger stable?
(trying the vendor-by-blueprint trick to quickly swap the version when I need to run it locally with stable)
0fc10d961e421944c1f3dbff4961dd5b0a59998c
May I request a tag? ๐ for easier swapping
lemme see if I can just do that WithCheck workaround I mentioned instead, so you can keep rolling forward everywhere
nice
that would be ideal
(also don't worry about the tag, I have 2 dagger.json and a symlink ๐
pushed, with a caveat that modules that use the @check directive will still fail because that's an unknown directive. can work around that too if needed
ha ha I just ran into the exact same issue with my workaround
welp
(that's what happens if I load with 0fc10d961e421944c1f3dbff4961dd5b0a59998c
if it's easy to gracefully ignore unknown directives (or at least @check as a special case) that would be ideal
pushed workaround no. 2
it works!! Thank you
Hey Alex, can i take over and add a test for: https://github.com/dagger/dagger/pull/11392 ? To try making it as part of the release
Also vendor-by-blueprint is really nice
sure thing - thanks! ๐
@still garnet running checks on the kill the monolith branch - if I dont expand go/lint it just shows me 1 error at the end when there are actually several. If I expand it, no problem
@tidal spire just fixed that on main
@still garnet another weird one for you https://dagger.cloud/dagger/traces/0a7f187d7a511a3fcc6c8fb6d52002bf?span=9ab3d910b4da192c
all of the withDirectorys look weird but when you click on one of the source: Directory args you can see whats actually going on
oh yeah, that's been a thing for ages, it's from the wolfi module
would be nice if it simplified it to the wolfi call
@still garnet but what's weird is that the call to the wolfi module is not visible in the trace
yeah in the constructor
what's probably happening is: 1. Go module is constructed, calls Wolfi to create that container, stores it as a property, 2. CLI runs checks, engine sets WithRepeatedTelemetry for the ctx, which causes all those cache hits to show up
maybe we just don't need that WithRepeatedTelemetry anymore?
@tepid nova do you remember the original motivation?
I think because CheckGroup.run() needs its own dagql server
@civic yacht a weird one...: https://dagger.cloud/dagger/traces/589323d9cf8e741614d3b9bfbc07c2a9?listen=e0964f87b1161272&listen=071224767ba23292
failed to sync: conflict at "cmd": change kind changed from "delete" to "add" during sync
I get what this is trying to guard against (concurrent changes during upload) but I feel like it's happening when it shouldn't. I wasn't making any changes and don't think any background thing was.
To add to the weirdness (or maybe answer for some), it looks like the filesync tried to touch /home?? Maybe it's looking from / instead of . in some situations?
FYI @Kyle Penfound @still garnet I filed an issue for later https://github.com/dagger/dagger/issues/11445
Problem When configuring my module to use an SDK, if that SDK is itself a blueprint, loading it will fail with this error: SDK does not support defining and executing functions https://dagger.cloud...
@tepid nova it's because so we don't get rate-limited by dockerhub and are pulls are authenticated
that config file in our CI is currently injected by namespace
with an account that they own which has a some special limits for their fleet
I remember that part, but I thought there was also a pull-through cache that we inject at the engine config level, that made the docker config less important?
follow-up: it's because by default dagql.Select marks telemetry as "internal", assuming it's an implementation detail of some higher level API. WithRepeatedTelemetry gets around that but also repeats telemetry giving us all those cache hits. gonna try a middle ground
yes, namespace also configures that
it's currently using mirror.gcr.io
but whenever the image is not in that mirror, It'll use the config.json to fetch it from dockerhub authenticated
this config is not in the config.json though. That lives in the engine.json file
We're seeing a weird network error, and trying to determine if it's caused by 1) unauthenticated Docker Hub pulls (we haven't re-enabled that docker config feature yet), or 2) a return of the dagger-in-dagger-in-dagger CIDR issue? 3) some other unknown issue...
@wild zephyr @civic yacht do you recognize the error in a way that helps diagnose?
https://dagger.cloud/dagger/traces/86cc7f665d07a2f3140df38f5d247084?span=f5e8ad5cff94ffbe
1) Dagger\Tests\Integration\ClientTest::testContainer
GraphQL\Exception\QueryError: failed to resolve image "docker.io/library/alpine:3.16.2" (platform: "linux/amd64"): failed to resolve source metadata for docker.io/library/alpine:3.16.2: failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/3.16.2": dial tcp: lookup registry-1.docker.io: i/o timeout [traceparent:86cc7f665d07a2f3140df38f5d247084-29913271c3172612]
the engine service logs just have the same error: https://dagger.cloud/dagger/traces/86cc7f665d07a2f3140df38f5d247084?span=6e7346d3447897a0#6e7346d3447897a0:L139, so I'd suspect something funky is going on with the nested engine network. Seems like the engine service itself is unable to even make a dns request
it's hard to tell what's happening since there's so much nesting but the first thing that comes to mind is that if we're spinning up multiple nested engines, not only does each one need a CIDR that's different than the parent engine's CIDR, they them selves also need different CIDRs from each other
yeah that makes sense. Not sure why we didn't encounter the issue before though...
Separately: maybe a new flake for your collection? https://dagger.cloud/dagger/traces/93301f52a469301d583232c8a4a110fe?span=a988b24b1568c38a
yeah I saw that one last night too, on my list ๐
@still garnet so there's good news and bad news.
- Good news: the TUI showing memory metrics makes it really easy to spot infinite recursions blowing up your stack
- Guess the bad news ๐ (it's happening in the dang runtime)
narrowing down a repro
My clean-room repro fails to repro... It only fails in-place in elixir-sdk-dev and I have no idea why
This crashes with infinite stack recursion:
pub bump(version: String!): Changeset! {
#let versionFilePath = "sdk/elixir/lib/dagger/core/version.ex"
# let before = directory().withFile(versionFilePath, workspace.file(versionFilePath))
#let before = directory()
container().
from("alpine:3").
withWorkdir("/app").
withDirectory(".", directory()).
directory(".").
changes(directory())
#withExec(["sed", "-E", "-i", "", "-e", "s/@dagger_cli_version \"([^\"]+)\"/@dagger_cli_version \"" + version + "\"/g", versionFilePath]).
# directory(".").
# changes(directory())
}
pub bump(version: String!): Changeset! {
container().
from("alpine:3").
withWorkdir("/app").
withDirectory(".", directory()).
directory(".").
changes(directory())
}
do you have the stack trace handy?
sorry in a meeting
@still garnet finally found it -> https://dagger.cloud/dagger/traces/8f2db894d6578684158ef913c0909b0f
(most of the other runs I canceled before the end)
hmm kinda looks like an actual stack overflow in the dang code. is it pushed somewhere?
No but I can push. Normally if you just copy-paste that extra function in the elixir-sdk-dev toolchain and run it, it should repro
(to my great confusion)
tried that in my own module but no cigar
do you have another function named directory or container by any chance that depends on bump?
arg I checked for directory() but just realized, there is indeed a container().....
it doesn't look like it should trigger a recursion though. But yeah that can't be a coincidence
could be over-indexing but i wonder if this sort of thing is a point in favor of requiring explicit self.. depends on how common of a footgun it is
it's nice and terse being able to refer to sibling functions without it, but you don't have the dag. escape hatch that you normally would with dag.Container()
i guess alternatively we could be binding the dagger API to a var instead of having all of Query.* in the toplevel scope
tradeoffs all around...
ok I found the recursion I think
container() calls withBase() which calls container()...
that'll do it
fixing
I think I had another instance of shadowing, in another module, whenever I reference source() I get a mystery string instead of my pub source: Directory!. Couldn't track down where that string comes from. Meant to open an issue for you later
Got me wondering if there's a special case source symbol somewhere in the dang runtime
Feels like the paths forward are:
- Instead of Dagger's API providing a global
container(), should Dagger'sQuerybe bound as e.g.Dag.container? - Instead of
container()resolving to object-localcontainerfield, should it always resolve to the outercontainer(), and for local you need to doself.container()? - This is fine

interesting, what was the string value? issue welcome
No idea, just got a bunch of dang error that boiled down to "you can't call this method on a string type"
oh i see, didn't get past typechecking
leaning heavily towards 2 for a few reasons atm, gonna try it out.
- Precedent in Python (familiarity points)
- Much easier to tell what's a local var vs. what's a local method call (readability points)
- Much easier to support copy-on-write semantics. Currently Dang has a carve-out so that
self.a = 1, siblingMethod()callssiblingMethodwith the latestself(previously it called the lexically boundsiblingMethodwhich resulted in each call having the sameself=> pollution). If I go this route,selfcan just go back to being a regular variable, rather than a stand-in for this "dynamic scope" mechanism
Yeah from a familiarity POV, self beats dag, core, query,system, _, dagger, or whatever else we would choose for an explicit top-level query
if we went that route I'd probably opt to have an explicit import at the top. which is already there, it's just half-baked
@still garnet dang feature request: print a string to "stdout" ๐ to show up in function logs
there's print(x)
I know it's fixed in the current CI refactor PR, but I noticed we are currently running all SDK tests and lints 2 times, all in parallel. Quick fix in the meantime: https://github.com/dagger/dagger/pull/11449 (also has a small java sdk test improvement that saves like 15s)
dang: escaping shadowing (aka self or import?)
Hallelujah!
Improve test-split toolchain so that new test suites don't silently get dropped in the future
Could anyone detail a bit more this (first ToDo entry of https://github.com/dagger/dagger/pull/11373)? I'm not sure to understand what it means.
If I'm right, as some test are run withspecificandtestBaseis running everything else, any new test suite should be run bytestBase, no? Or is the issue at a different level?
Actually we can remove that line, it was a mistake on my part
I forgot that testBase uses -skip and not '-run`. So future new test suites will be automatically included.
noice. I guess you were able to find what was happening with the dagger-in-dagger-in-dagger network errors I assume?
no I changed the check to run linters instead ๐
networking issue is unresolved but also very niche (only affects our ci) so leaving it as a followup is ok imo
@leaden glade thank you for the extra dangification ๐
Rebasing on main and dealing with last minute merge conflicts...
I wanted to clean up kill-monolith into a few large but clean commits, it's proving difficult

