#maintainers
1 messages · Page 7 of 1
thank you!
https://dl.dagger.io/dagger/main/<git sha>/dagger_<git sha>_linux_amd64.tar.gz
thanks
@tawny flicker Hey, there's a bug in Go codegen for Client.Pipeline:
return &Client{
q: q,
}
it's missing c: r.c so it's nil and panic'ing when I try to use it
Hi team, I got a strange fail on the engine CI in a PR that only updates docs. Anyone have an idea?
https://github.com/dagger/dagger/actions/runs/3898226697/jobs/6657012287#step:7:661
I suspect there's probably a race condition in how we run registries used for those tests, it looks like the registry tried to listen on a port but it was already taken. It'd be worth re-running and seeing if it happens again
The long term solution is to run the registries properly w/ actual services once we have those
Hmm okay, I also spotted a test that fail on my side but I hope everything will work after that
If the failure happens more than just very rarely we can fix it now in different ways probably, it's only worth ignoring if it's very rare. I'm assuming it's rare because I don't think I've seen that before but I could be wrong or something could have changed
Fixed! Thanks 
Hi guys, I am getting a x509 certificate error with dagger/engine (I'm using a private registry for images), however I am sure that buildkit is ok, since with dagger-cue and the same buildkit instance, everything works just fine.
I am running docker login before spawning the engine, and as the docs say, this should grant the container to have the same docker creds as the host machine, but it looks like that's not the case here.
Do you know if there is any workaround for this problem?
Hi, it seems your are using dagger-cue ? Then, I think the auth is different. It has to be speficied in the pull
No, sorry if it wasn't clear, I am using two different GH workflows, one with python and dagger-cli, the other with dagger-cue
They both use the same buildkit instance, but while for dagger-cue everything works with auth for docker.#Pull, the same is not true for dagger-cli/python; after a successful docker login to my private registry, I get the error #1 ERROR: failed to do request: Head "<registry>/v2/python/manifests/3.9-rhel8": x509: certificate signed by unknown authority
FYI, I documented the summarize of the current state of secrets support, what works and what's not working yet: https://github.com/dagger/dagger/issues/4366 as well as indexing all related open issues. If you think it's not complete, feel free to comment.
I don't think it is on a par with CUE, as with CUE we had the ability to mark stdout of host commands as dagger.#Secret
client: commands: sops: {
name: "sops"
args: ["-d", "./secrets.yaml"]
stdout: dagger.#Secret
}
fair, I'll refine
You all are amazing to add this feature in so fast.
https://github.com/dagger/dagger/issues/3830
I'll make some time to test this out for my setup. I'm mainly curious about replacing Buildkit with Dagger engine. Currently, I've got this deployed in K8s (without TLS options) : https://github.com/moby/buildkit/blob/master/examples/kubernetes/deployment%2Bservice.privileged.yaml
Would running Dagger engine be as simple as replacing the image to Dagger engine image and then connecting the SDK via _EXPERIMENTAL_DAGGER_RUNNER_HOST env var?
that’s the idea 🤞
There’s now quite a lot of active secrets issues currently open and I’m not really sure what the expectations are for a potential solution. It seems like the first take on it will be with session attachables, similar to the Earthly approach?
Either that, or fixing the cache leak in File { secret }, whichever comes first.
Quick one: https://github.com/dagger/dagger/pull/4392
@wild zephyr 🙏
Another pre-release of buildkit (apparently... that's what the release notes say...): https://github.com/moby/buildkit/releases/tag/v0.11.1
Welcome to the 0.11.1 release of buildkit!
This is a pre-release of buildkit
Please try out the release binaries and report any issues at
https://github.com/moby/buildkit/issues.
Notable changes
B...
Nice!
@spiral fog This is the one containing @civic yacht's socket fixes (the root cause behind docker exec not outputting stdout)
Yes, already tracking release. When do you think dagger would have a release containing this version of builtkit ?
@spiral fog This week -- I'm currently running tests locally using the new version
Does dagger currently edit out the value of secrets from logs, stdout, stderr, the way it did in 0.2 ?
Nope
Just to confirm, this is true both for values that are decoded from eg. a json file, and also values that map directly to the raw contents of a file or environment variable? Asking because in your issue you mention the former, but unclear on the latter.
True for both
Correct
s := client.Host().EnvVariable("FOO").Secret()
out, err := client.Container().From("alpine").
WithSecretVariable("FOO", s).
WithExec([]string{"sh", "-c", "echo secret env var is $FOO"}).Stdout(ctx)
fmt.Printf("container output is: %s", out)
p, err := s.Plaintext(ctx)
fmt.Printf("secret var is: %s", p)
#4 sh -c echo secret env var is $FOO
#0 0.072 secret env var is THIS_IS_SECRET
container output is: secret env var is THIS_IS_SECRET
secret var is: THIS_IS_SECRET%
Thanks! Updating https://github.com/dagger/dagger/issues/4413 accordingly
Hi, I was trying to connect my Python SDK directly to dagger engine, so that I first run docker run --name dagger_engine dagger/engine:v0.3.7 and in another shell first export DAGGER_RUNNER_HOST=docker-container://dagger_engine and then execute my python script trying to pull Python from a private registry, and then printing its version.
The problem that I am encountering is that I am getting x509: certificate signed by unknown authority.
From what I see from the help, dagger/engine is a wrapper around buildkit, so that I should be able to provide my CA certs and then use the flag --tlscacerts /some/path/certs.pem.
The problem, however, is that locally I cannot mount any volume to docker containers. I though then about two solutions:
- host dagger on a k8s cluster where I can mount volumes (so CA certs), and then port-forward remote 1347 to local 1347 -> PROBLEM: how should I setup
DAGGER_RUNNER_HOST, sincetcp://links are not supported? - force
dagger/engineto use another buildkit instance, i.e., an instance on k8s that I can port-forward to local to another port -> PROBLEM: is that even possible?
Hi I was trying to connect my Python SDK
👋 seems like there's a regression with build().publish(). Just opened an issue here to investigate: https://github.com/dagger/dagger/issues/4420
can't repro in main.
Not sure if this is some engine-related bug, my setup, or something else, but since adopting Dagger in our new monorepo, we frequently run into No space left on device coming out of buildkit:
#18 ERROR: open /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/24/fs/opt/az/lib/python3.6/site-packages/azure/mgmt/network/v2017_10_01/aio/operations/__pycache__/_operations.cpython-36.pyc: no space left on device
Does this require some configuration on our gitlab runners? And is there a way to provide a more helpful error message for this? I wouldn't be surprised if this somehow relates to running docker:dind.
Are your gitlab runners statless or stateful? Do the runners machines have a single partition or multiple? By reading this, I'd assume that the partition where docker /var/lib/docker is located is effectively running out of space
it seems unlikely that it should run out of space – we spin up new instances on demand, so I'd expect them to be fine on space (though I'll have a look later to check thoroughly). Should only be one partition afaik
I assume this might have something to do with using the kubernetes executor for the gitlab runner. Wondering if we need to configure a PVC or something for the container
not sure how the k8s gitlab executor works regarding storate allocation. Have you checked that your k8s nodes are not effectively running out of space?
I believe I just need to tweak one of these https://docs.gitlab.com/runner/executors/kubernetes.html#storage-requests-and-limits
So yeah, not an #maintainers issue I assume 😄
confirmed that it was actually just a full disk… realised when we couldn’t access it because it couldn’t write to the systemd journal 😅
Hey @civic yacht, can I pick your brain on something? 🙂 It's about the error buffer (https://github.com/dagger/dagger/pull/3826).
@civic yacht @still garnet @stray heron I believe https://github.com/dagger/dagger/pull/4340 is ready to go
I did a self-review to point out a few things. I recently switched telemetry to a different target, and noticed some weirdness with the numbers. I believe it's because of how buildkit reports started/completed times, and the fix is to implement the same nasty mess buildkit does (interval list)
hey folks, is there any README file somewhere about how to set the OTEL stack? I wanna try something and I'd be great to this running ❤️
Hey sorry, was not around. We're using grafana cloud (which is basically a hosted Loki/Tempo/etc) to ship/store OTLP data. Do you mean setting up a different account or bringing up the stack on your own?
otel libraries support a standard set of env variables which we're using in #3826: https://opentelemetry.io/docs/concepts/sdk-configuration/otlp-exporter-configuration/
We're sending those straight to Tempo (hosted by grafana labs), but you could also send them to an OTEL Collector instead (https://opentelemetry.io/docs/collector/) and configure it to ship them somewhere else
Thx Andrea. I was hoping to start everything locally. Do we have any guidelines for that? If not, it's not a big deal. I can document it as I set it up
Oh, not really. OTEL/OTLP is a format, there’s different consumers. I’ve used Jaeger in the past and I think we still have 0.2 docs about it (there’s a compose tracing yml file in the repo IIRC)
That’s for traces though. Logs (e.g. events) are a different beast. Never used Loki locally
For the grafana dashboard itself, Gerhard has a json config file somewhere
OTEL has traces and metrics. However in the POC we’re only using traces. For “metrics”, we’re actually sending the events as logs to Loki, then Grafana can query those logs (as opposed to sending otel metrics to Prometheus)
@wild zephyr btw — not sure what your use case is, but you could just send those in the hosted version and build a different dashboard
👍 yeah.. I guess that's enough for now. Thx Andrea
Do we have a planned release this week? 🙏 cc @stray heron @wet mason @civic yacht
Hey, sorry @wild zephyr, forgot to get back to you. I was catching up on the latest RELEASING.md changes, haven't released myself in a long while
I think we should release soon (today ideally), haven't released in a couple of weeks. Is that ok for Python/Node @rancid turret @wild zephyr ?
@fair ermine @mellow bolt are we good to cut a release of the Node SDK?
Yeah 👍 Python → v0.3.1
Thinking about your entrypoints list @still garnet ... And a comment I read in the issue. We've been assuming that we can't use any of the pipeline code as entrypoint, because it's trapped in arbitrary code - basically a black box until we execute it. But I'm wondering if maybe we should revisit that.
Perhaps we could use the new Pipeline() directive, and make those pipelines the entrypoints? Then the session could be configured (via dagger run or otherwise) to selectively run some pipelines, and not others
The client would still run as a whole of course. But that's cheap (or at least it's reasonable to expect the client developer to make it cheap). But each time it makes a Pipeline("foo") call, the engine would make a decision on whether to actually run the following instructions, or silently drop them
yeah, I think we've been talking past each other, since I assumed we wanted those pipelines to be the entrypoints already
just wrote a big old reply that surfaces these questions 😄
I'm open to any design. My starting point is simply what our users do today, eg. with mage.
The de-facto entrypoint today for a project, is a project-specific executable tool, which most commonly takes a pipeline name as its first command-line argument.
@still garnet I think I might steal your list to add meat to https://github.com/dagger/dagger/issues/4414 . Seems like it would benefit from public discussion. wdyt?
Go for it!
@civic yacht Would you have some time to cut a release today? The steps have changed quite a lot since I've last ran them. Or I can do that on my own -- is it pretty smooth overall?
Last release was rough but that was due to all the changes around the CLI/sessions/etc. I ironed out the problems afaik but I can't be 100% certain its smooth again until we do another one. So I don't mind running through it again this time but in general hopefully it's something we can all share the load on.
That being said, I'd like to wait for word on which version to use for the nodejs SDK. Last time I asked last minute and just bumped the patch but later it turned out a minor bump was preferable
Also want to make sure the node SDK doesn't have any half-way done features that need more merges before releasing
100% on sharing the load. Maybe we should have a rotation, so everyone is up to speed with the release process? /cc @twin crow
Also builds incentives to automate all of it 🙂
Yeah, part of it should hopefully be incrementally improving it too so it becomes more and more automated
Thanks Erik :). Node.js version will be v0.3.2. I'll reformat the release note afterwards (like this one: https://github.com/dagger/dagger/releases/tag/sdk%2Fnodejs%2Fv0.3.1)
Remaining opened PRs are mainly about refactoring, so we're good to go 🚢
I'd love to shadow Erik but today I have to finish Early since I have Ollie has a pediatrician appointment :sadmarcos:
@civic yacht, any idea what this error is about? https://github.com/dagger/dagger/actions/runs/4027159246/jobs/6925959573
I think that's a step added for the new telemetry stuff, cc @wet mason
Looking into this now. ETA Monday.
@rancid turret https://github.com/dagger/dagger/pull/4478
@civic yacht @wet mason happily volunteering to release and/or shadow someone for the release this week
Last time went smoothly enough that I think the updated instructions here should be easy enough for someone unfamiliar to follow: https://github.com/dagger/dagger/pull/4461 (cc @wet mason @stray heron need a review on that still)
Happy to be around for questions if anything comes up while you run through it though
Feedback on our own dev workflow. Running ./hack/make lint after not touching the repo for a few weeks is... overwhelming. I just watched what seemed like a massive chain of docker builds, unfold on my terminal for 5 solid minutes. I had no idea what was going on.
I think we should stop relying on the default buildkit output for display to the user. I would prefer if it were disabled by default, and instead the tool gave me an output relevant to the task at hand (linting)
$ ./hack/make lint
<insert output relevant to linting>
@tepid nova that's the frustration that led me to prototype the terminal UI
@tepid nova The tricky part is that pipelines are asynchronous so:
printf(“building the engine”)
WithExec(“go build”)
We’d be printing “building the engine” right away, but it actually gets built by a different function later on!
We rely on async behavior throughout the CI code quite a lot (helper to get the repo code, helper to set up a go env, python, node, …), basically our “targets” combine a bunch of helpers together to assemble a final target, so any message we’d output in helpers would be “too soon”
In case of errors it’s worse:
“building the engine”
[bunch of messages about other things]
[much much later …]
“go build failed” <— referring to the engine
Current system is not human legible. It’s just enough so when it fails, the output is useful enough to figure out why
@tepid nova the best way around this IMO is visualisation “aided” by code (provide hints of what’s going on using Pipeline()). Terminal visualisation for local dev, and possibly someday (wink wink) cloud visualisation for CI
I agree with that, it's hard for the client code to print information about pipeline execution to the user directly.
BUT that's not the only thing I may want to print. For example, in the linting example what I want to print is the output of the linting tool. Right now that is mixed with all the rest of the logs. But from the client tool's perspective, it's not just another log event - it's what I want to print to the user.
So I think there's 2 distinct visualization problems:
-
Problem 1: when you want to visualize the engine logs, it's hard to follow. Solution: better viz UI in the terminal, then in web
-
Problem 2: when a Dagger client wants to control its own output, we don't give clear guidance on how, or whether they should. It's all or nothing: dump the logs, or get no output
An example of problem 2, in the context of the linter: I want to see my linting summary. I don't care about dagger logs. How do I get that?
- Option 1: client tool implements its own output logic.
output, err := lintPipeline.Stdout(ctx)
if err != nil { ... }
fmt.Printf("Linting result:\n%s\n", output)
- Option 2: give pipelines an API to control their own user output (can't just be exec stdout! too noisy)
???
maybe as extra inputs to Pipeline()?
👋 just wondering if there's a reason why we're not currently re-using cache upon engine updates. Since dagger engine volumes are not named, when the engine image changes, all my buildkit cache gets invalidated. cc @civic yacht @wet mason
lintPipeline := c.Pipeline("linter", &PipelineOptions{
// If Info points to a container, use its stdout as a markdown-formatted info field for the pipeline
Info: c.blabla.blabla.Exec("lint-tool")
}
One main reason: sometimes buildkit changes some internal data structure and when reusing the same volume, it starts failing in weird way
Thought the “cache bust on upgrade” is less annoying than “dagger randomly broke I don’t know why”
/cc @civic yacht — not sure if there were other reasons
Interesting and agreed
Perhaps a way to set up custom statuses / outputs per pipeline?
Right. Which we need anyway
But I realize now there's a big design decision underlying this - how are developers supposed to handle output to the user. In that regard, is Dagger A) a library, or B) a framework.
And by default, last operation of a pipeline is the status/output.
Thinking of our internal pipelines, I think that would work 99% of the time
But there may be several concurrent execs in a pipeline right?
But I guess we can have a default rule to decide who wins
Yes, correct. And sub pipelines …
Right
Sub-pipelines are OK assuming each sub-pipeline has its own status field. If a pipeline has lots of sub-pipelines, and no exec of its own, it just has a blank status (no rollup by default)
The problem in this model is that, eventually we're going to want callbacks
Interesting.. I would have expected buildkit to handle this gracefully internally somehow. i.e: if the internal structure changes, buildkit will automatically fallback to a new DB. 🤔
(sorry was in interview, catching up) Yeah buildkit does actually attempt to handle this sort of but it has had multiple problems in the past so I'm not at the point where I fully trust it. There are also cases where dagger might want the cache to be invalidated. E.g. if we change something about the shim behavior, the shim is not (currently) part of the cache key for execs so then execs won't actually re-run if you upgrade and that could cause problems
There are better solutions than just getting rid of the previous cache, but it was simplest to start there
Perhaps it’s worth investigating some more
It used to be a minor problem because we were distributing the buildkit image which is rarely updated. We were not wiping the cache when the engine was updated, only when a new version of buildkit was released
Whereas now we’re using our own engine image which is released often and makes the problem worse
@wet mason @tepid nova one option re: lint-like commands that print output - with bubbletea you can print to an 'alternate screen' that disappears when it's done, so i think that would let us have both?
I think the unknown is on the DX side, like, “this is the command you want”
Like for a linter the output of from nodejs / yarn install that led to the setup of the linter are not the “meaty” part
yeah, i'm thinking all that command output could render in a progress UI on the alternate screen, and then the user's fmt.Println(lintStdout) would go to the regular stdout
not sure if your demo already did that, i remember it taking over the screen but that might be orthogonal to the alt screen: https://pkg.go.dev/github.com/charmbracelet/bubbletea#WithAltScreen
Oh! You mean Println from the pipeline code itself, not from a container. Makes sense!
yeah exactly
Yep yep
Actually I was thinking and had on my to do list: you know how the TUI was showing hierarchical pipelines?
What if the “root node” at the top is the pipeline itself (e.g. go run or dagger run). And the output of that when you look at that node is the stdout of the pipeline code?
Like:
- dagger run foo
- sdk/node/lint
- yarn lint
- sdk/node/lint
If you focus on the root node, you’d only see the output of your pipeline code. Not mixed together with container logs
/cc @tepid nova ?
yup yup, did something similar with bass - but i think you'd still want to support regular stdout, for unix piping or redirecting etc etc
and being able to see output after the pipeline finishes
Yep I was using the alt screen
I was thinking:
- alt screen is the demo I did
- when finished — regular screen would show up a summary
That way while you run the pipeline you have the complex interactive screen
But if you come to the terminal later, you’d only see “12 pipelines executed without error in 48s”. Or like, the full error logs of the one task that failed
Oh absolutely. Yeah that makes a ton of sense
I think it’s more important than the summary bit
Ok you’ve convinced me
So basically:
- while running, it’s an alt screen with complex ui
- among pipeline status you can see your program’s output
- upon completion, you see the exact, raw output of your code
That feels good. As a user you’re free to handle your code’s output as you see fit. However, while it’s computing, you get this “Heads Up Display” that augments your output with useful information. It goes away on exit, you only see what you meant the output to be
yeah, you could also potentially have it decide between summary or raw output based on whether stdout is a tty, but i think you'd want the raw output to not be truncated in the summary either way (just so they don't have to redirect)
and/or show summary if there was an error
Yeah, neat. That would make the experience so much better already without any new concepts
Also I like the simplicity of it
“How do I display extra information to the user?” —> Printf
On the telemetry side, we could also send the code’s stdout/stderr
does that rely on having an entrypoint?
to capture fmt.Print etc
seems like youd need to capture it with a reexec/fork or running it in dagger (dagger-in-dagger)\
I’m thinking (low confidence) that those features (terminal UI, capturing stdout etc) only work if you wrap the execution with the dagger CLI, not if you’re “just” using the SDK
Weird if a library starts taking over the terminal etc
Is the function marshalValue in sdk/go/internal/querybuilder/marshal.go concurrent safe ? I have the TestGit test failing only when running in parallel of all the Go SDK tests
Fake alert 😇 Human error 👼
I resurrected the POC "terminal UI" code, adapted it to the new Pipeline API and ran it through our live CI!
It "works", but there's some very large issues to fix (e.g. with our CI, there's too many tasks and it doesn't support pagination so it's not really usable)
@civic yacht would you be around to give some context on https://github.com/dagger/dagger/blob/7baee302215f5623de0205459d5a48e9a47da227/sdk/go/server.go#L498-L506, and how to test it / debug it ?
That's for the extension code. Are you getting an error with the new enum support? If so send it to me, I can look quickly. But if it's beyond a trivial fix we can just remove that code for now and resurrect it later, not worth the effort at the moment
@tawny flicker did my comment on secrets help you get unstuck?
Yes, I just read it and I think I got it.
So I don't need access to a secret store from the shim, I just read from the env with os.Getenv() and I should be good, right?
os.Getenv() or os.Readfile
Right
depending whether they were passed as ENV or mounted as file
Perfect, easy peasy
Basically in the shim you're "just like the user" -- so you have to act accordingly
Thanks
np
Some news, I did'nt got accepted to devoxx fr, how about you @tawny flicker ?
Thanks again for your awesome contributions ! They got merged today 🎉
I saw that ! Thank you, some good news finally
Is there any objection to supporting embedded dagger engines in the go client?
I keep seeing the download of
github.com/containerd/stargz-snapshotter
Why is it downloaded like 3 or 4 times during a run?
It's like 40MB, so not fun to download 3-4 times, is there some cache issue about it?
So, I got rejected from all my talks to Devoxx UK and MixIT, but, I got accepted at Devoxx FR! I'm quite happy for myself, but I'm bummed for you. Seeing your video makes me think you'd have created a great talk!
Congrats at least we got a dagger talk !
In the go SDK I want to write libraries that don't need a full *dagger.Client but instead something much more minimal like a *dagger.Container.
Problem is, some cases I always will want to call From() on the container. I feel like a *dagger.Container type in the function signature does not convey what the function is expecting to receive.
I'm wondering if there could be something like a *dagger.Scratch type that is basically a *dagger.Container but the type is explicitly a brand new container with nothing and is just a placeholder.
Of course I could be over thinking it and it really doesn't matter.
Though I think a dagger.Scratch could indeed be a special type that supports .From() where as *dagger.Container supporting .From() is kind of weird.
well.. it's tricky since there are some things from *dagger.Container that are useful to chain in multiple froms. ie:
foo := client.CacheVolume("mycache")
c := client.Container().From("alpine").WithMountedCache("/mycache", foo).WithMountedSecret("/secret", client.Host().EnvVariable("VERYSECRET").Secret())
c.From("ubuntu").WithExec([]string{"ls", "-la"}).Stdout(ctx)
}
^ in this case VERYSECRET and mycache will be persisted across Froms. Abstracting that to *dagger.Scratch will require copying a lot of those methods as well.
That's fair. Thinking more on it I could also make an interface in my library code called "Scratch" that has a "From" method that returns a container.
Yes, haha. I thought about it as well but I assumed you already considered it.
👋 I'll bump dagger to latest go 1.20 primarily to validate that our tests run ok since there's been a few users that reported some instabilities after bumping to Go 1.20 which we couldn't replicate. LMK in case you have any comments/questions
interesting CI failures 🤔 https://github.com/dagger/dagger/actions/runs/4117088948/jobs/7107962106
yep, just realized it needs to be 1.20.0 since it's assuming go 1.2 🤷♂️
lmao. yaml
@wild zephyr could alternatively quote it like "1.20" so we still get patch updates automatically
yep
God, I hate YAML sometimes.
my favorite part is the kitchen clock syntax, like 03:15. i had a valid use case for it at one point, but i may have forced the issue
Always
@wild zephyr Looks like you might have triggered it: https://github.com/dagger/dagger/actions/runs/4117218278/jobs/7108251178#step:6:2068
yep, was 👀 at the same
probably a race condition since non-race tests passed. Seems to be random
can't repro locally regardess of how many times I run that test.
I was just trying too and haven't yet, running with -count=1000 and letting it go for a bit
yeah.. re-running in CI to see if it's consistent there.. Next step would be to ssh into the GHA box and fiddle with the test there
k, it passed. So definitely some weird race condition probably
@north jay if I understand correctly, it's having to pass a dagger.Client everywhere that you find too heavyweight?
k it passed So definitely some weird
From a unit testing PoV, yes.
Probably not a good way around it at the moment since I also need cache mounts which requires the client.
Really probably one of those things where it really doesn't matter much.
👋 I'll do an engine release today after merging the PR in this thread. cc @rancid turret@tawny flicker @mellow bolt for the SDK's
Can we make sure to have https://github.com/dagger/dagger.io/pull/1736 merged before ? In case pkg.go.dev gets updated on release triggers
sure, approved! 🚢
ETA for release start ~1h. Last call for any last-minute changes
Starting the release process in 5m
Engine v0.3.12 released and all SDK's bumped accordingly 
Corresponding changelogs can be found in the releases page https://github.com/dagger/dagger/releases
There is a cross-compilation error failing engine publishes in main right now, this will fix it: https://github.com/dagger/dagger/pull/4559
(has no effect on the previous release, this was all merged after it)
@civic yacht I still have a strange error when I try to install Dagger:
package github.com/dagger/dagger/cmd/dagger
imports github.com/dagger/dagger/engine
imports github.com/dagger/dagger/core/schema
imports github.com/dagger/dagger/project
imports github.com/dagger/dagger/router
imports github.com/dagger/dagger/internal/engine
imports github.com/dagger/dagger/engine/remotecache
imports github.com/moby/buildkit/cache/remotecache
imports github.com/moby/buildkit/cache/remotecache/v1
imports github.com/moby/buildkit/worker
imports github.com/moby/buildkit/cache
imports github.com/moby/buildkit/snapshot
imports github.com/containerd/containerd/snapshots/overlay/overlayutils: build constraints exclude all Go files in /Users/tomchauveau/go/pkg/mod/github.com/containerd/containerd@v1.6.14/snapshots/overlay/overlayutils
package github.com/dagger/dagger/cmd/dagger
imports github.com/dagger/dagger/engine
imports github.com/dagger/dagger/core/schema
imports github.com/dagger/dagger/project
imports github.com/dagger/dagger/router
imports github.com/dagger/dagger/internal/engine
imports github.com/dagger/dagger/engine/remotecache
imports github.com/moby/buildkit/cache/remotecache
imports github.com/moby/buildkit/cache/remotecache/v1
imports github.com/moby/buildkit/worker
imports github.com/moby/buildkit/cache
imports github.com/moby/buildkit/snapshot
imports github.com/moby/buildkit/util/overlay: build constraints exclude all Go files in /Users/tomchauveau/go/pkg/mod/github.com/moby/buildkit@v0.11.1/util/overlay
Same with ./hack/make engine:build
Yep I am fixing that now too. It's a problem on macos only, there was a new import in the CLI that added a transitive dep that doesn't build on darwin
Snif 😦
Weird the CI hasn't spotted that
Yeah the fix is simple, I'm adding a workflow that will catch this too. The problem is that in PRs we only ever built for the host platform (to save time). My last PR fixed a problem with the engine image but now have to do the same for the CLI
I got it, may we shall run engine-race-detection on both ubuntu and mac too? With a matrix that should not take that much time
No that'll take too long, the macos runners are too slow, I'm just adding a step that builds the cli for all our supported os/arch combos for now
Okay 
This should fix it: https://github.com/dagger/dagger/pull/4562
I check rn
Approved, merge whenever you want to 😄
Thanks!
just noticed we're leaking goroutines in the SDK, in my testing it's 2 leaked per call to c.Container().From(...).WithEnvVariable(...).WithExec(...).Stdout(ctx)
seems to be leaked HTTP connections, maybe we're not closing the response body? 🤔 edit: we're closing it
going hunting
sidebar: i don't know how i went this long without discovering GODEBUG=tracebackancestors=999 but it's super helpful in cases like this. (it shows who created a goroutine in the stack)
Am I the only one who use some http.Get on a local webserver to trace some behaviour in the engine/shim/container?
not sure I follow Tanguy 👀
I was trying to trace wether my secret scrubber was working or not once integrated in the shim, but logging to the shim sometimes didn't seem to output anything or even would break things (I might mix it up with the provisionning, though).
So in the end, instead of logging to stdout/stderr, I just http.Get to a local ip webserver that the shim, the engine or the container have access to, and I can see what happens in order.
I just unblocked myself doing that.
@tawny flicker one annoyance I've run into is the Dagger client in the tests doesn't flush its logs before the test fails, so pretty often they're cut off. maybe that's what's happening?
Well, it was manual testing
There were different reason for failure:
- first, I was running with the wrong engine
- It seems my new run didn't use the new shim, but I'm convinced it was because of the cache of the CI code. But after dumping the engine + the volume, the new shim got executed
- now that our cache show the previous output, I forgot my CI code didn't change and I thought it was executed with the new shim (I didn't use a cache buster at that time) (even though I could have checked the "CACHED" log print.
- the real bug was: I modified the amount of character written and outputed the newest version with its len. The
cmd.Rundidn't like that I didn't write the amount it sent to me ^^' (as one should). But the laconicshort writeerror brought me into different rabbit holes and fixes (mutex on the writer, …)
anyway, my setup works ok now. But I'm gonna use a cache buster this time :p
oh, I see. I just make the shim append to a file whatever custom thing I want to see outside stdout/err
And how do you get the content of the file? On the dagger engine /var/lib/dagger fs?
Right now when the engine shutsdown it's easy for some remaining logs to get dropped. Hence this silly sleep hack in the nodejs engine wrapper: https://github.com/dagger/cloak/blob/28940446...
@wet mason ran into a codegen limitation, which we actually hit before but worked around, but now I'm wondering if we should fix: it's not possible to have foos: [Foo!]! where Foo is e.g. type Foo { a: Int!, b: String! } because you get this error on x.Foos(ctx):
input:1: Field "exposedPorts" of type "[Port!]!" must have a sub selection.
we previously ran into this with envVariables: [EnvVariable!]! but wrote the test with raw GraphQL to work around it. I think looking up exposed ports might be a more common thing to do though, so it'd feel bad to punt again.
i think we could get away with just selecting all of the fields, at least in these two cases, but I have a feeling the rabbit hole goes a little deeper still; the returned type is a slice of objects which have their own methods for querying fields, like (*Foo).A(ctx). but I don't see how that would work when the foo in question is, say, the 12th object returned from the array. seems like we would instead want "dumb" structs with all their fields pre-populated or something. not sure if what I'm saying makes any sense. 🤔
oh, i found an issue for this already which I commented on with a related point 😅 https://github.com/dagger/dagger/issues/3457#issuecomment-1397574645
I believe @fair ermine was looking for a solution for this
It's tricky because of the reasons you pointed out
Yeah it's really tricky to do :/
Also because the same object (e.g. Env) can be returned pre-filled (e.g. envVariables) or "on demand" (e.g. envVariable)
.EnvVariables[0].Name
vs
EnvVariable("FOO").Name(ctx)
Yeah and the issue is that you don't parse it the same way in GraphQL, so you have to deal with type conversion and stuff to handle it...
Since it's not possible to have
type EnvVariable struct {
Name string
Value string
}
func (e EnvVariable) Name() string {
return e.Name
}
func (e EnvVariable) Value() string {
return e.Value
}
Because EnvVariable cannot have both public field and method with the same name.
And if you make it private, you cannot retrieve data in JSON because the parser cannot access private field to insert data in...
yeah, definitely seems tricky! hmm
So you have to create a intermediate type to retrieve data and fill it in EnvVariable and deal with type conversion
It's really tricky for now and I still haven't found an elegant way to do
Parsing shouldn't be a problem though with querybuilder bind
lots of codegen custom code though
@fair ermine Is it an immediate problem though?
The querybuilder bind works, but if you bind a type with private field, values are not retrieved
Because under the hood we are using json.Unmarshal(marshalled, s.bind)
But unmarshal cannot find private field in struct
We could generate an EnvVariable that only contains public fields
And how do you replace the existing binding?
type EnvVariable struct {
name string
value string
q *querybuilder.Selection
c graphql.Client
}
// The environment variable name.
func (r *EnvVariable) Name(ctx context.Context) (string, error) {
if r.name != "" {
return r.name, nil
}
q := r.q.Select("name")
var response string
q = q.Bind(&response)
return response, q.Execute(ctx, r.c)
}
If name become public, how the user can select with .Name(ctx) ?
i wonder if we could annotate certain types as "dummy" or "eager" or something, and they'd always have regular fields instead of the getter methods? that seems like what I would want for EnvVariable, Port, and also FileInfo (from Directory.Stat)
The solution I tried to make was somehow close to what you think
I created a "dummy" temporary type with public field to catch values and then I convert it to EnvVariable by inserting private fields
yeah, sounds the same, I guess I'm proposing to just not even have the methods for these types, so they're just a struct that we unmarshal into
(not sure how this would translate to the other SDKs)
type envVariable struct {
Name string
Value string
}
var response []envVariable
q = q.Bind(&response)
// Then make the conversion
That what I'm working on for now, I need to handle a proper conversion that will be elegant, I got new ideas that I'm gonna try today
Not that it's a long term fix, but this is inconsistent ...
Container.EnvVariable(name) returns a string whereas Host.EnvVariable(name) returns a EnvVariable object
not sure which pattern we should adopt but they should be consistent
Yeah :/
Depending on which road we take for handling dynamic secrets, Host.EnvVariable may in the future become redundant?
@still garnet asking here to avoid a tangent-on-the-tangent in the github thread: re this: https://github.com/dagger/dagger/blob/c48df04f949a0eb94567fdc75a213ff6b7fb6330/core/integration/services_test.go#L1-L11
--> Perhaps a naive question: should execs never be cached when they have one or more service bindings? And would that solve this particular problem?
(separately, I'm fine with an experimental flag to cut scope without breaking the tests)
hm that's a good question, I'll try to think of an example that makes the answer obvious
oh, actually, one example might be databases for tests
we probably want to cache those test runs
so far I haven't really run into a case where I didn't want it cached, fwiw. the only awkward part is we still start/detach the service redundantly.
Assuming we had a simple cache: bool option in the future, we could change its default behavior depending on service bindings (no service: default to true. servide: default to false), but you could still override explicitly either way?
I may have misunderstood the problem - thought that tricky test situation was caused by indesirable caching of a client exec
it's caching that we still want in the end, but the issue for the tests was that calling Container.Directory/Container.File forces the container to evaluate ahead of the time that the test wanted it to
for example, a test that asserts that we start the service when we try to read the contents of a file that came from a container that uses a service. it would result in a false positive because getting the file forces the service start + container evaluation before we can even try to read its contents
it's a kind of confusing thing to explain 😅
in my limited experience using a feature like this, I think caching by default still makes sense, but I can totally see how there may be cases where you don't want it to be cached once services are involved. but the test situation is more of a special case for the tests
Ok 🙂 Thanks for explaining it to me
np!
anyone else run into this sdk/nodejs failure? https://github.com/dagger/dagger/actions/runs/4206164966/jobs/7299271398 - complaining about a difference in whitespace. rebuilding on my machine doesn't show any diff. 🤔
(reran 3 times, same failure)
Pretty sure I've had that issue in the past, yes. Or at least a similar one. I think there are cases where there is a random order hash or similar used. Can't remember if it was in the JS code or in the Go GraphQL code.
here's the diff for my case - it's just whitespace!
--- sdk/nodejs/api/client.gen.ts
+++ sdk/nodejs/api/client.gen.ts
@@ -409,7 +409,6 @@
*/
Udp,
}
-
/**
* The platform config OS and architecture in a Container.
*
@@ -2605,6 +2604,7 @@
/**
* A port exposed by a container.
*/
+
export class Port extends BaseClient {
/**
* The port description.
super weird
our actions runs don't have any magic caching right?
nope, AFAIK. @still garnet I was trying to run your PR tests locally and getting some what it seems to be ipv6 errors
#13 DONE 0.4s
INFO[0005] detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf
Error: docker run: exit status 125: invalid argument "fe80::1%2" for "--dns" flag: fe80::1%2 is not an ip address
See 'docker run --help'.
exit status 1
marcos:Projects/dagger (services) (⎈ |N/A)$
oo good to know
@wild zephyr would you mind sharing your resolv.conf?
...assuming there's nothing sensitive in there
Not yet
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.
nameserver 127.0.0.53
options edns0 trust-ad
search .
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.
nameserver 192.168.1.1
nameserver fe80::1%2
search .
i guess it's not happy with the "zone" part of the ipv6 address (%2), i'll confess i have no idea what the significance of that part is
which is kind of annoying, because the intention is to just pass these values straight through Docker and into the container's resolv.conf, so stripping it off feels weird. seems like an overzealous CLI flag validation 🤔
apparently the zone refers to the network interface to use, either numeric index or name (e.g. eth2) - so carrying it into the container verbatim could be wrong anyway, since it's a different network namespace with different interfaces. hmm
@wild zephyr what does docker run alpine cat /etc/resolv.conf say for you? wondering which values docker carries in by default
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.
nameserver 192.168.1.1
search .
ok cool, looks like it skips ipv6
I think docker run only looks at /etc/resolv.conf
i think it went through the same logic actually
since 192.168.1.1 comes from the systemd config
yeah.. just noticed
maybe because I didn't enable ipv6? I think there's a flag in the engine for that
yep dockerd --ipv6
interesting
@still garnet docker run with ipv6 enabled:
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.
nameserver 192.168.1.1
nameserver fe80::1%2
search .
^ just edited the snippet. Old clipboard buffer 😄
how about docker run alpine ip addr
curious if that zone (%2) even refers to a valid interface
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
68: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 2001:db8:1::242:ac11:3/64 scope global flags 02
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe11:3/64 scope link tentative
valid_lft forever preferred_lft forever
seems busted since there's no 2: interface, but what do i know. i'm going to just have it skip ipv6 addresses, we'll see if someone complains
thanks for the debugging 🙏
🚀 . LMK when you update the code and I can help you checking the linting thingy 😛
already pushed 🙂
👍 running lints
of course they pass locally 🤔
@still garnet I'm wondering if it could be a race condition due to GHA runners being slow. I've noticed that the lint mage target does yarn lint and lintGeneratedCode in parallel. 🤔
hmm
I've rerun it a bunch of times and it's consistently failed
but I thought I remembered at least one time where it fixed it
can't find any proof of that in the run history though
we found the issue sdk generation issue. Rebase with main was needed.

Hello, I hope this is the right channel for this question. I am looking to contribute to dagger, and I was wondering if there is any specific label for beginner friendly issues. I tried searching for them but couldn't find one :), thank you!
Thank you for considering a contribution! There are a few labels you can look at for inspiration. But this is something we need to get better at:
- “size/small”: https://github.com/dagger/dagger/issues?q=is%3Aopen+is%3Aissue+label%3Asize%2Fsmall (not super up to date)
Asking here is also an excellent way to get ideas 😁
Okay I will try to work on one of these issues. Thank you 🙂
How did you rebuild it yourself? With client-gen? Or ./hack/make sdk:nodejs:generate?
The crux of it is client-gen will generate right out of Go. But the code might have a little too many whitespace sometimes, and the biggest part is the split of a function when the line is too long.
./hack/make sdk:nodejs:generate will make it OK as it will pass that generated code through a yarn lint --fix that will make it OK.
You need to commit the generated code in your PR as well.
I guess we need to document that and/or add some CI part that will check if that part of the code needs to be regenerated is actually done. (I guess any modification in the GraphQL schema in a PR should check if the SDK code has been submitted/committed as well.)
the problem was that the sdk templates/etc. changed on main since I last merged, but had no conflicts with my PR so I didn't notice it was behind, and GitHub actions runs against a merge of your PR into main, so it was hard to notice where the delta was coming from. in hindsight it's obvious though
Do you see a way we can avoid that in the future?
@civic yacht Do you have any time to chat today? I've got some interesting info from some runs with the new caching code... and I'd love to chat with you about it.
Yep I'm good 12 your time or later
1p MT/12p PT?
Sounds good!
@civic yacht @wet mason whenever y'all find time, https://github.com/dagger/dagger/pull/4505 is good for a technical review now that the API and feature set is stable and the tests are no longer flaky
Started looking a little bit ago!
it's a big one 🙂 so no rush
@civic yacht I'm looking into updating the engine setup docs for services, which I think only requires documenting the resolv.conf changes. but I think I can hide that complexity by pushing it into the entrypoint. seems possible by 1) copying the contents of the current file, 2) unmounting it, and 3) writing it back as a regular file with our DNS server + search domain added. the only catch is it needs --privileged, which we already have/need. does that sound crazy? 🤔 main advantage is we don't need to pass --dns flags to Docker when running the engine.
i guess that could be an issue if/when we want to support rootless
i'm happy to also just update the doc, only iffy part is that we'd start documenting addresses like 10.87.0.1 (which was literally chosen by me asking "someone give me a number from 0 to 255" and @ancient kettle said 87 😂)
i'll try it out and see if anything blows up
Your message gave me flashbacks to weirdness around docker and having to use fsnotify to synchronize resolv.conf on certain distros like Ubuntu that I had to deal with back in the day (may no longer be relevant), so I was doing some refreshing on stuff before answering you, but yeah if you give it a try and it looks good to you then :shipit:!
ah gotcha, thanks for checking. that does raise an interesting point, if someone changes their DNS config on the host I suppose it wouldn't propagate to the dagger engine container anymore, now that it isn't a bind-mount
pushed! all tests pass, seems OK
@civic yacht also: I removed dumb-init - it was only needed before because the CNI plugin kept killing and restarting dnsmasq, but I removed that. I could keep dumb-init, but I figured we might want to notice zombie leaks anyway
Yeah it's hard to piece together exactly what the behavior is with docker since it depends on the version, whether the host is running systemd-resolved, the various knobs of docker container config, etc. It seems like in some situations there is synchronization happening (which the umount might break depending on how docker implemented the sync) but in others that doesn't work anyways: https://github.com/docker/for-linux/issues/889
So in conclusion 🤷 . May just have to ask for a little more beta testing on different platforms before next release than usual. LGTM for now!
Ah cool, yeah all else being equal less stuff in the image is better, so sounds good!
is it time to merge https://github.com/dagger/dagger/pull/4505 then? 👀
soft-pinging @wet mason in case you've been meaning to take a look
Took a quick look last week, I can review post-merge no need to hold up
sounds good - merged! 
Can’t wait to play with this, @still garnet!!
Thank you, 
Can't wait to play with it!
Can't wait to have your feedback on the feature Alvise 😇
next community call idea: compose adapter for Dagger services 🤯
ooh that sounds really fun
think i found a fix for the engine not flushing progress logs -- at least in the Go SDK: https://github.com/dagger/dagger/pull/4615/commits/d737b5f8f1911177815331f35d3b5307ebab0643
tl;dr we were closing stdin but then just letting the session command die via the Pdeathsig: SIGKILL flag. so this commit makes client.Close() wait on the session command after closing stdin.
i removed the go proc.Wait() - I think it should be redundant since folks should already be calling Close() to release resources (I'd consider a zombie process to be a resource). (cc @civic yacht as the go proc.Wait() was a recent fix for leaking zombies)
not sure if we need a similar fix for the Python/NodeJS SDKs (cc @rancid turret @mellow bolt)
I stand by what I said on this one, by the way. 87 is fine, upstanding member of the uint8 community.
it has proven to be a reliable address range 
😄
Hi, is there breaking changes in github.com/dagger/engine@v0.3.13?
Since I upgrade from v0.3.12, network fetch become very slow.
I found a new search dns.dagger in /etc/resolv.conf.
What about this? could I disable it?
Sorry about that, we introduced a major new feature (service API and container-to-container networking) and a regression slipped through. We will fix asap.
sorry! simplest thing for now is to just make this feature opt-in, so I'll have a PR up for that shortly
@civic yacht thanks for the review, I added the GraphQL schema docs but I'm hard-blocked on Docker Hub rate limits 😭 would you mind running sdk:all:generate for https://github.com/dagger/dagger/pull/4657?
(or anyone else that sees this)
sure will do!
not sure why I'm hitting this rate limit, I thought it would use the Docker credential helper, and supposedly I'm not even at my limit for unauthenticated requests 🤔 guess I'll investigate since I'm blocked otherwise
oh, rip, I hit my authenticated limit, not my anonymous one. that's fun. i guess i'll...log out?
ironically my first attempt resulted in Head "https://registry-1.docker.io/v2/library/golang/manifests/1.20.0-alpine": dial tcp: lookup registry-1.docker.io on 10.0.2.3:53: read udp 172.17.0.3:46260->10.0.2.3:53: i/o timeout
But removing my existing dagger-engine.dev and other engine seems to have fixed that, it's running now I think
lol,
. there's definitely a theme for the day. Google also had an outage: https://status.cloud.google.com/incidents/LnvJwfYu3TCyUrcrP7yf#RP1d9aZLNFZEJmTBk8e1
pushed a commit with the updated sdks
merging, also have a follow-up that might fix the network issues, but hard to be sure without a repro. will open a separate PR
follow-up PR: https://github.com/dagger/dagger/pull/4658 - @civic yacht this should also allow changes to propagate to resolv.conf since it relocates the bind mount (which is, apparently, a thing that is possible)
that's re: this concern from a while back
@wild zephyr do you know of any content there for running Dagger pipelines and/or the engine in K8s?
if anyone has ever lied awake at night wondering why kubernetes sets options ndots:5 in /etc/resolv.conf, here you go: https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-266251056
woah. didn't know about this before: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
customize
ndotswithdnsConfig
yup yup, was just looking into it to see if it makes sense to preserve these options for our engine and found a bunch of posts about it
i think it makes sense to leave alone, so people can reach services/etc. from their dagger pipelines with the normal semantics
right, makes sense. I wonder if the "fully qualified end dot" approach works as advertised...
If you have few static external names...adding a
.at the end.
e.g.google.com.vsgoogle.com
yeah, i've never actually seen that used. like what does that mean for config params, http://example.com./foo?
works 😁
curl https://dagger.io./index.html
haha just tried it, doesn't work in my browser but who knows what's going on there
@stray heron we should probably ship v0.3.14 now that https://github.com/dagger/dagger/pull/4657 is merged, plus https://github.com/dagger/dagger/pull/4658 which is an unverified fix that folks experiencing issues can try by opting-in
I can do the release tomorrow, just dropping a ping in case you're interested in doing it in your timezone or want to use it as an opportunity to test any release process changes you've made 🙂
(trying Discord's new @silent feature to avoid a 1:23AM ping, hopefully it works)
cc @cosmic cove @wet mason
/usr/local/bin/dagger-entrypoint.sh: line 31: can't create /etc/dagger/engine.toml: Read-only file system
mounted configMap is not writable.
@flat lagoon thanks so much for testing it, definitely didn't think of that 😅 working on a fix!
It just for user who have to set registry mirrors.
I try to disable without set _EXPERIMENTAL_DAGGER_SERVICES_DNS=1.
Then it still not work well.
what happens? do you have any error output to share? or is everything just slow?
would you mind sharing /etc/resolv.conf?
It is a cache issue, I pruned with buildctl, it works again.
Here reuse the accident scene of dagger/engine@v0.3.13
@still garnet once you fixed /etc/dagger/engine.toml writable issue,
I can continue to test with _EXPERIMENTAL_DAGGER_SERVICES_DNS=1
just confirming, is the issue still happening with it unset? or is it OK?
Nope, it works well as "v0.3.12"
ok great
will ping when I have a new image!
@flat lagoon ok, you can try sha256:a184e9a44df789ee493af53d9cce531d2cd079e2e9dedc06129306cf60b344d7 (aka vito/dagger-engine:dev2)
here's the code change: https://github.com/dagger/dagger/pull/4666
could not re-define exist inline table or its sub-table : worker.oci
It could not just append to toml file.
people may a full buildkit.toml
oh, can you remove it from your configmap?
I already did it
i was just wondering about this, we probably want user-provided config to be additive
For me, I use the engine shared cross projects,
so i need to increase the gc and max-parallelism settings
https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md
oh i see, so you had other config values set under [worker.oci]
hmm, tricky
might be time to abandon bash 🤔
@still garnet when set _EXPERIMENTAL_DAGGER_SERVICES_DNS=1. network issue back
Connection failed [IP: 123.125.16.219 80]
May just load & patch and re-write by golang.
yeah something like that
# cat /etc/resolv.conf
# dagger dnsmasq server
nameserver 10.87.0.1
search gitlab.svc.cluster.local. svc.cluster.local. cluster.local. openstacklocal
options ndots:5
hm, it doesn't seem DNS related at this point
/ # env | grep _EXPERIMENTAL_DAGGER_SERVICES_DNS
_EXPERIMENTAL_DAGGER_SERVICES_DNS=1
so it's saying it can't reach 123.125.16.219:80? maybe this has to do with the bridge network
each container has its own IP address now, maybe it's unable to navigate the network due to a firewall or something? not sure, just guessing
It is just some CDN IP address.
I have been thinking for a while on how to execute a native SDK lang function directly as a step in the pipeline, and I can't find some standard way to execute this without going down to CLI (that should be built by the step), or have some kind of marshalling of sort.
The idea would be to transparently do this (example in Go)
client.Container().From("alpine").
WithFunction(func() error {
// whatever Go logic I want here
return nil
}).ExitCode(ctx)
Or maybe, it's just something that would come naturally with the extensions and the marshalling/query would be GraphQL? WDYT?
cc @wet mason @tepid nova @civic yacht @still garnet
Have you seen this issue? https://github.com/dagger/dagger/issues/3885
Yes, it's more to compose with existing dagger API. But I don't see how to execute actual Go code inside a pipeline (for example, a user yesterday was limited to get a directory files on 1 level, no recursitivity was allowed by the API. So then, he was limited to do a WithExec([]string{ "find", "." }) and possibly parse/filter then.
I was thinking more of a
WithCall(func() error {
filepath.Walk("/", func(path string, info fs.FileInfo, err error) error {
if err != nil {
return err
}
if info.Name() != "myfile" {
continue
}
// do some action on the file
return nil
})
})
Yeah, that sort of thing was requested by a user a few months ago.
Apart from running that go code in a container (and without exporting the dir to a temporary location), at a glance I don't see how it can be done.
I get each WithCall would be with the same language of the SDK. Or we would have some
WithGoCall(), WithPythonCall(). And each would spin some language container to build/execute the code in the right runtime and return afterwards.
Yep
Problem is you need to put that code in a string to inject in the container.
I can do it in Python because it's very dynamic.
But it's a reach, I think.
But then, if we have the extensions, then the current SDK API would be extended with the extension API and whatever input output would be marshalled by graphql, and who cares about the language. So maybe it's kind of a not so interesting topic since extensions will come eventually.
Yeah, and also, what if your code use some imports, how do you transfer that info to the runtime container if your function is from this file.
Yeah, I guess it's too tricky.
Exactly.
Similar question here → https://discord.com/channels/707636530424053791/1042482671844085882
Similar use case as well: "Just looping through some files and checking the content for example. In general executing some repository or environment dependent logic."
Any ETA when extensions would be revived?
No ETA yet but there is a thought by @still garnet and @civic yacht that solving the relatively short term problem of project entrypoints (ISSUE) can be done in a way that we get extensions too. So that’s very interesting to me, getting there in pragmatic incremental steps
@final walrus do you have a specific reason for asking? Idea or deadline?
@tepid nova idea: for Pulumiverse, the OSS community around Pulumi packages (https://github.com/pulumiverse/), I would like to take the CI workflow maintenance away from the package authors. An extension abstracting the whole build setup would be awesome.
@fair ermine Hey -- you pinged me for a PR on Discord but I can't find the message anymore (too many channels syndrome). It's the list return right?
Extends Go template capabilities.
Store scalar type private field to return it if it's set by a list execution.
Add test on list of object.
Add Schema setter and getter to fetch global GQL
type...
I'll improve the description of the PR based on Helder's comment tho
@obsidian rover https://gist.github.com/shykes/7f4b81ae5053d223ea51306d959d4b6d
@civic yacht your privileged exec feature has immediately come in handy: https://gist.github.com/vito/c5f34f2a8f6c9f5993fa60ec43d9c237
needed to find a demo of using services for a test suite and remembered concourse's web ui tests do that, and then needed privileged exec to run the worker
Awesome! Yeah I think there's a lot of use cases that call for both. Another I was thinking about is that it's now possible to fully run a dagger engine actually in dagger, which might be a path to somewhat simplifying the tangled web of our self-referential CI
that'd be nice! bass's tests do something similar, but with buildkit instead of a dagger engine: https://github.com/vito/bass/blob/main/bass/bass.bass#L173-L182
possibly controversial API proposal: https://github.com/dagger/dagger/pull/4723
tl;dr I think _, err := container.ExitCode(ctx) is a massive footgun; it shows up all over the place and when https://github.com/dagger/dagger/issues/3192 is fixed these will all turn into silently discarded failures
it seems worth adding a convenience for "run this container and error if it fails" since it'll come up in nearly every pipeline, so I added err := container.Run(ctx) with a bit of codegen magic to handle a new Void scalar type in the schema
cc @tepid nova
Love it!
I'm all for adding .Run but I never thought we would change the behavior to make exiting non-zero a non-error when 3192 is fixed, I always imagined we'd add a field like AllowFailure, in which case non-zero exit code doesn't result in an error. It's honestly pretty easy to do at this point because of the plumbing we added to be able to get stdout+stderr in the error code. But even still Run will be good to have in place too
hm, that feels a little non-obvious to me - like an artifact of the current progression (ExitCode being what everyone calls for lack of a more obvious alternative => improving it by having it include stderr). I don't think it's something we would have designed intentionally. also worth noting the same issue affects Stdout/Stderr
for example i think calling Run followed by ExitCode/Stdout/Stderr to collect metadata about the failure would be cleaner without having to pass AllowFailure along to all three
i guess it is cleaner to be able to skip Run if your intention is to collect the output (e.g. for parsing) and short-circuit on failure though
that's probably the lesser evil to Stdout silently returning "" on failure 🤔
i think it's at least true that during the initial API design we wanted them to just return whatever the value was (hence the big commented out test for ExitCode on nonzero exit), but yeah, it might be safer to just have it error everywhere
It's subjective but seems like a pattern common in my experience (go exec.Cmd returns non-nil error when exit code is not 0, I think the python lib I used threw an exception in that case, etc.). I agree that having AllowFailure all over would probably be gross.
In go at least, another option would be to just set both non-nil error and the actual other return value (exit code, stdout, stderr). I don't know how that'd work in python with exceptions (genuinely, maybe that's totally normal)
maybe the low level api should just provide all the info (error message and whatever was requested of exit code, stdout, stderr) no matter what and it could be a per-sdk decision on how to most idiomatically handle it
I'm good with just adding AllowFailure 👍
much less dangerous for hastily written things that need to e.g. grab output of a command and parse it
Well this is all theoretical discussion because I don't have time to work on it immediately 😄 and either way having Run is good since it's just more obvious than ExitCode when you don't care about any output and just want side-effects
Not to mention Void, which is far superior than just returning a meaningless bool like we did in the past
I'll just capture the outcome of this convo on the issue + PR for now (if it's not already, haven't skimmed 3192 in a bit)
Reading through this discussion, one potentially confusing part thing I imagine seeing a lot of would be things like chain.chain.chain.Stdout().Run() or .<other leaf node>.Run() since people already get confused with the lazy evaluations on other leaf nodes
not sure what you mean - wouldn't stdout().run() be invalid since stdout() returns a string?
or do you mean generally people sprinkling .run() everywhere to try to un-lazy it?
side tangent: I'm working on a host => container networking PR, which leverages .Run() to also publish ports. felt more natural than doing it on ExitCode/Stdout/Stderr
This. It's probably just a matter of docs/examples, but my thought was that people would think this is "the way to execute a pipeline"
https://github.com/dagger/dagger/pull/4738 this was much easier than I thought it would be 🤯 (modulo API bikeshedding)
I think that intuition is pretty correct though - at some point you'll need to force your pipeline to run, right now the method is calling ExitCode which feels a bit roundabout imo
but yeah, will have to watch out for any "sprinkling" behavior/misunderstandings
it helps that it returns 'void' so it can only possibly be used at the end of a chain
yup that makes sense. I guess I'm not super worried about it since most people that would potentially trip on that will only be using .Run() and not the other nodes anyway. They may still trip on it with things like Publish() but we just have to document it well
@still garnet how is a hostname determined (looking at endpoint specifically)?
it's a hash of the container's vertex digest
well, a hash of the digest just prior to setting the hostname, since there's a chicken-egg problem there
@civic yacht might be onto something with the iptables: Resource temporarily unavailable issue 🤔 https://github.com/dagger/dagger/pull/4711/commits/2f17c1e7f4e273f3d13dc080e9b0dcafe5a18880 (it's pretty stupid simple in hindsight)
can't confirm though, since I'd blow through my Docker Hub rate limit real fast. 😅 hoping Actions just passes or fails immediately, since it failed in just one try last time.
side note: I'm tempted to have our tests use a registry mirror or something soon; I've had occasional days where I end up blocked by the combined authed+unauth'd rate limit (300).
oh neat! I didn't know it was just a lockfile underneath, I figured there was some lock internal to the kernel or something
yeah same 😅
side note: I'm tempted to have our tests use a registry mirror or something soon; I've had occasional days where I end up blocked by the combined authed+unauth'd rate limit (300).
yeah we should switch to gcr or something else without limits. for the dev engine, we could just put the registry mirror config in engine.toml
two runs succeeded so far, which is better than before. first run failed on a seemingly unrelated 10m test timeout. github apparently caps you at 2 single-job reruns so now i'm rerunning all jobs (🤷♂️) (edit: just a weird UI thing, you can still rerun by clicking the ♻️ icon in the sidebar)
hit the test timeout again. We're brushing against the default 10 minute timeout now; here's one that passed in 8.75 mins: https://github.com/dagger/dagger/actions/runs/4378197520/jobs/7662629715#step:5:6437
I figured this might happen with all the new tests for services. I like that I can trust our tests, but it's a shame how slow they're getting. At least it's a decent benchmark and incentive for performance improvements.
edit: hm, 4/7 runs have timed out now, maybe it is my PR? has anyone else hit this?
I got some PR got a lot of tests failing with timeout. It's not very reliable.
just noticed all 27 services tests aren't using t.Parallel(), so that may be some low-hanging fruit
yup, down to 6.3-7.3 minutes now 🎉
Also, I was thinking that we need to make the image version homogeneous.
Right now, we have:
"alpine"
"alpine:3.15.6"
"alpine:3.16"
"alpine:3.16.1"
"alpine:3.17"
"alpine:3.17.2"
"alpine:latest"
Same with
"golang"
"golang:1.18.2-alpine"
"golang:1.18-alpine"
"golang:1.19"
"golang:1.19-alpine"
"golang:1.20.0-alpine"
"golang:latest"
I bet we can shave off some image layer download + apk download thanks to all using the same images.
yeah, we should really pull these out into consts or something
those moments when you're troubleshooting a bug and then you find a new one.. cc @still garnet
@wild zephyr found @civic yacht 's last PR to fix a similar docker socket forwarding issue: https://github.com/moby/buildkit/pull/3431 (but this issue was an EOF, not a hang)
Dynamic Secrets PR looks pretty close ...
Was the group API / pipeline stuff ever demo'd or documented?
The snippets on the homepage seem to show code I can't write myself:
import Client, {connect} from "@dagger.io/dagger"
connect(async (client: Client) => {
const ctr = client
.pipeline("test")
.container()
.from("alpine")
.withExec(["apk", "add", "curl"])
.withExec(["curl", "https://dagger.io"])
const result = await ctr.stdout()
console.log(result.substring(0, 300))
})
Hey 👋 ! Which version of dagger are you using ?
Do you have issues with .pipeline() ?
great 🙂
Was this ever released?
https://discord.com/channels/707636530424053791/1076019767699316776
Not yet, but there is a stale PR on it: https://github.com/dagger/dagger/pull/4522
could use a ✅ for this: https://github.com/dagger/dagger/pull/4764
How do you have this footage of me?
I already prepared a demo of it using gcp secrets manager.
Can I see it, @tawny flicker ? When will it be tagged and released?
I can't confirm nor deny the release date.
If you want, I can demo it to you in the #911305510882513037
Otherwise, the code is here: https://github.com/dolanor/demos/blob/main/secret/gcp/main.go
Awesome. Thanks for the code. Nice and simple API
Hopefully the secret obfustication fix for white space rolls out at the same time
That would be all my blockers gone 😀
That's my next priority 🙂
In the bikeshedding process we trust 😛
👋 I have two esoteric P1 engine assigned issues that I probably won't be able to tackle this week since I'm focused in something else now. Just sharing in case some adventurous traveler decides to pursue that journey.
https://github.com/dagger/dagger/issues/4757
https://github.com/dagger/dagger/issues/4673
cc @cosmic cove @stray torrent
@fair ermine can you take these? ☝️
I can take the #4757 for now 🚀
Hey @still garnet @civic yacht @wet mason
If one of you have 10 minutes free today, could you review the PR https://github.com/dagger/dagger/pull/4683 please?
It's the addition of the Go SDK to fetch list of object and partially solve https://github.com/dagger/dagger/issues/3457
will take a look! sorry for the radio silence 🙏
Had a thought about building libraries for dagger: I'd love to open source some tools I'm building; I do a fair bit of work with monorepos, and so have built patterns for resolving cross-service dependencies and mounting only the deps you need for build.
However, what would be really awesome would be if there was a way to generate these utilities, same way the dagger client itself is generated. Is there some way to take my code, convert to the GQL query itself, and then render out a new client for it somehow? 🤔 that way, I can write my utils in whatever language (I have some in TS, and some in Go), but benefit the entire Dagger community (it can also help us internally if we want to explore porting our dagger code to another lang). (I figured engine-dev was the closest I'd get to codegen)
This is what we’re planning for extensions 😁
Awesome! Got an issue I can follow?
I can't find any that would describe our new version of the extensions. Everything else was more oriented towards the CUE extensions or early idea of the new version, but things might have changed drastically.
@wet mason doing some work on the dagger TUI on the side, do you mind if I switch it to just use the standard ANSI 16-color palette? nerds like me like to have my entire OS using one color scheme
Yep go for it
I'm not really an artist 😄 I slapped some colors together
Would be good at some point to get Camille involved in that
yeah definitely
fixed up the colors, now investigating strange rendering issues related to the log output. I think we need to use a vterm so special characters in the logs don't disrupt the UI. had to do the same for Bass. even a simple \t throws a wrench in the gears, which I don't really understand
Having done a bunch of work on the ttyv2, there is a lot of prediction that can be wrong, based on rune size.
Basically the correct way to calculate size is to use grapheme cluster size.
In the end, it gets really tricky.
Also when displaying logs with rewritten output like secrets
Write() will tell you it wrote X bytes, but in reality, it could be anything displayed afterwards because of the scrubbed secrets etc. Which then will make your output wrap lines weirdly
I noticed this this using the docker cli: https://github.com/dagger/dagger/blob/0100316a0b53c307a1311c29006529ec4f026660/internal/engine/docker.go#L22
Curious why it's not using the API directly?
FYI I'm reverting the "how it works" diagram. Please consult me in the future before changing the fundamentals of how we explain our product.
@civic yacht Hey 😄
Could you review this PR when you have 5 minutes please? It's about the log duplication issue
https://github.com/dagger/dagger/pull/4789
If this PR is merged, logs will no longer be duplicated on multi-stage export:
Example of output with changes applied
#1 resolve image config for docker.io/library/alpine:latest
#1 DONE 2.0s
#1 re...
@civic yacht @still garnet I'm working on spinning up a dev dagger engine as a service inside of stable dagger engine. Does it automatically listen on a particular TCP port? Or do I need to tell it to listen via config? And, if it does listen... which port is that? I'm not finding anything obvious (although it's possibly listening on :1234...)
You need to tell it to via either the config file or via a command line flag, let me find it...
- --addr
- tcp://0.0.0.0:1234
Yep that's it, you can also run go run ./cmd/engine --help from the repo to see all the flags available (just for future reference)
Thanks, @civic yacht!
@ancient kettle i'll be curious to see how this goes, might need to do something similar to test AppArmor
Me too! So far... it's failing. 🙂
i believe in you
@ancient kettle same way you would with any other container, you can call Stdout/Stderr on it. but that might not be very helpful depending on the situation (i.e. if it's not exiting)
you could try running it with dagger-ui 😛
Inside of the service?
wrapping whatever command is running the service - e,g, dagger-ui <cmd>
(you'd need to install dagger-ui from the PR: https://github.com/dagger/dagger/pull/4522)
Ok. So, I'm running a mage command (go run main.go -w $DAGGER_SRC_ROOT engine:test), so basically:
- build
dagger-ui - run
dagger-ui go run main.go -w $DAGGER_SRC_ROOT engine:test
That sounds good. 🙂
I'll try it.
yep, that should work
Here goes nothing. 🙂
oh - shoot, you might need to build the local dagger binary from the dagger-ui branch
since I changed it to use TCP for the journal event stream
hmmm
That would make sense. I'm just getting a blank screen.
try env _EXPERIMENTAL_DAGGER_JOURNAL=/tmp/journal.log dagger-ui - that'll fall back to tailing instead
make sure the journal is empty since it appends by default
@ancient kettle any luck?
Getting there! I had the config wrong... and running it as a service was swallowing errors. I started running it as a straight exec, and I think I've finally got it, and I'll switch back to a service in a moment.
@still garnet I'm getting some odd DNS errors:
#0 0.073 host alias: lookup FM39PVG9VCP9S.dagger.local on 10.87.0.1:53: no such host
#12 ERROR: process "sh -c echo '{container{from(address:\"alpine\"){withExec(args:[\"echo\", \"hello\"]){stdout}}}}' | ${_EXPERIMENTAL_DAGGER_CLI_BIN} query --debug" did not complete successfully: exit code: 1
Is that because the stable and dev dagger engines are fighting over network space or something similar?
Actually... it kinda looks like it's not waiting for the dev engine to start before running the WithExec that's relying on it.
@ancient kettle yeah, very likely. you might need this: https://github.com/dagger/dagger/pull/4711 - but alternatively you could disable services networking for now
did you configure a WithExposedPort?
re: the second part
Oh, I did not.
yea, need that for the health checking/polling/waiting to start
@still garnet Am I holding this wrong?
I'm setting _EXPERIMENTAL_DAGGER_SERVICES_DNS=0, but then it's not letting me use services at all.
you should only need to set it for the inner Dagger that you're running as a service
Ok, that makes sense.
Hmm... that might get tricky, given that I'm trying to run the Dagger CI against it.
On the plus, that totally works. 🙂
It's a bit of a bummer that a service exec is marked as failed in dagger-ui
@ancient kettle good point, might be able to detect that it's interrupted instead
@civic yacht @still garnet So, I'm pretty sure I'm gonna be blocked on this Dagger CI refactor by https://github.com/dagger/dagger/pull/4711
Because we'll need to be able to run dev dagger engine with services and DNS enabled to run our test suite against it. Am I missing anything?
That said, I was able to run a decent number of tests without it.
I'll go review it now
out of curiosity, what're you working on atm? just missing some broader context. didn't know you were trying to run our full suite against it!
I'm working on a refactor of our mage tasks + tests to utilize services, so we don't have to rely on Docker, which will more easily allow us to use Dagger as a k8s Daemonset, which will allow us to play more easily with self-hosting our runners.
Run-on sentences ftw!
Precisely this. 😄
I'm particularly looking at this in order to have faster docs builds in CI. I assume this will let us have persistent cache mounts right?
Potentially, yes. We have some things to sort out/decide there (mostly around cost, honestly).
@still garnet re: this comment: https://github.com/dagger/dagger/pull/4711#discussion_r1143987469
I think I found the issue, it's just that I forgot to close the clients in those test cases before shutting down the engine, so then the engine blocked on GracefulStop: https://github.com/dagger/dagger/blob/d275122c01bcb585b54aa212ebaee9518d4eb438/cmd/engine/main.go#L354
Do you mind if I push a commit with that to your PR just so I can test it there? Just a few lines of changes to remotecache_test.go
nice glad you found it, the hanging tests was a head-scratcher but I didn't find time to dig in. push away!
tests passed 👍 thanks! starting on the review comments now
interesting, apparmor even has some configuration specifically for the dnsname plugin we've vendored (the original is no longer maintained): https://gitlab.com/apparmor/apparmor/-/blob/bba1a023bf146352f4d4d796fd0a8d1dd1e54fa7/profiles/apparmor.d/usr.sbin.dnsmasq#L110-116
hesitant to rely on that, though, since it's unmaintained and these rules might be removed at some point
@civic yacht https://github.com/dagger/dagger/pull/4711 is ready for another look - I pulled in the AppArmor fixes too since they're very closely related
LGTM!
Thanks for getting that over the line so quickly, @still garnet and @civic yacht! I’ll keep pushing forward on the refactor today.
@civic yacht https://github.com/moby/buildkit/pull/3732 is interesting because that should only be in 0.11
Did dagger update to 0.11 before the 0.4 engine release?
Yeah we've been on 0.11 for a while now, we upgraded during one of the v0.11 rcs
Haha no worries at all, it's one of those things that's almost impossible to notice until you hit it in real life. Plus your reproducer script was invaluable for figuring out what was going on, so I appreciate it!
PR to upgrade the engine to buildkit w/ that fix: https://github.com/dagger/dagger/pull/4817 cc @wild zephyr @still garnet or anyone who can take a quick look
@civic yacht I think go build -o /app/bin/buildctl github.com/moby/buildkit/cmd/buildctl would work too.
Thanks yeah I'll try that, I just found this comment on how the error I'm hitting is pointless too lol https://github.com/golang/go/issues/57485#issuecomment-1419841319
what’s Container.withLabel ?
with comments
So this, it looks like: https://docs.docker.com/config/labels-custom-metadata/
@tepid nova the label metadata that gets injected in the resulting OCI artifact?
same thing as the LABEL dockerfile instruction?
Yes, exactly
Quick question:
When reading the operator manual https://github.com/dagger/dagger/blob/main/core/docs/d7yxc-operator_manual.md
Is "engine" interchangeable with "runner" or does the engine also handle the "session" part?
BTW it's possible I'll ask a big amount of stupid questions today around codegen, etc 😅
runner is the machine running the engine. engine is the software being run.
also we are not 100 % consistent on methodology
Yeah it's reached the point of being pretty confusing now, this came up in a conversation yesterday too. Opened an issue for fixing this here: https://github.com/dagger/dagger/issues/4826
@still garnet I'm trying to run this test against dev engine running in stable engine....
func TestContainerPublish(t *testing.T) {
c, ctx := connect(t)
defer c.Close()
const htpasswd = "john:$2y$05$/iP8ud0Fs8o3NLlElyfVVOp6LesJl3oRLYoc3neArZKWX10OhynSC" //nolint:gosec
registryHost := "registry"
registry := c.Container().From("registry:2").
WithExposedPort(5000, dagger.ContainerWithExposedPortOpts{Protocol: dagger.Tcp}).
WithExec(nil)
pushedRef, err := c.Container().
From("alpine:3.16.2").
WithServiceBinding(registryHost, registry).
Publish(ctx, registryHost+":5000"+"/container-publish:latest")
require.NoError(t, err)
}
And I'm getting this error:
input:1: container.from.withServiceBinding.publish failed to solve: failed to push registry:5000/container-publish:latest: failed to do request: Head "https://registry:5000/v2/container-publish/blobs/sha256:3feafd38d98ccbd96b16d14ef277f30a39258b3dcc6e6c8521a00b13ad068f13": dial tcp: lookup registry on 10.88.0.1:53: no such host
It looks like the service comes up, but the alias isn't getting set?
this won't work unfortunately, for a couple of reasons: a) buildkit's push/pull runs in the context of the engine container, so it doesn't have service aliases wired up, and b) pushing/pulling to/from a registry that isn't localhost or 127.0.0.1 will require TLS, even if it were to be magically routed to the right service
Ok... :/
we've talked about supporting a) by munging the address, but then we'll run into b), so yeah it's a chunk of work. something like this would help a lot: https://github.com/moby/buildkit/issues/3366
That makes sense. Bummer. So, I guess I might need another way to do these registry tests in our CI.
Since we're running the dev engine as a dagger service, we could run a registry alongside it, that will just leave the TLS problem
but since we are setting up both the engine and registry ourselves I guess we can just include tls certs with the containers too?
BTW, if you can set the DNS name for the registry, you can override the engine / buildkit conf to allow non-tls access to any registry. Just sharing in case this helps
https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md?plain=1#L101
true! probably a lot easier than setting up TLS
Ah, this is even better. Thank you, @wild zephyr 🙏🏻
It’s hard to keep track of all of the things we can/can’t do with Buildkit. I appreciate you remembering this. 😊
Last time I tried using insecure with the image exporter options it straight up didn't work and based on this I'm not sure if the config file works either: https://github.com/moby/buildkit/pull/3382 So worth a shot but probably keep expectations low
Or maybe just be sure to only set http = true and not insecure in the config file, if I'm reading that PR correctly
I'm almost sure I recall testing this a few days ago and managed to make it work with insecure registries. I'm not 100% sure though but I can try it relatively quickly
Yeah when I tried it it was with the image export option registry.insecure: https://github.com/moby/buildkit#imageregistry
Which didn't work at all but is slightly different, it seems like maybe it'll work with the config file as long as you only set http=true and not insecure=true
How should the tests connect to that registry? I'm guessing that the DNS inside of dev engine won't have entries for the registry running in stable engine. Is there a way to get an IP for it? /cc @still garnet
I guess I'm not grokking the CNI setup.
@ancient kettle I think the idea is to run the engine (in Dagger) with the registry as a bound service, which you can give an alias. That same alias can then be configured with http = true in engine.toml
containers running in the dev engine wouldn't be able to reach the registry, but that's OK - only the engine really needs to, afaik
Ah, yes. Thanks, that clicked it into place (about the engine, but not the containers it is running, needing access).
Ok. That all appears to work quite nicely! http = true gave me zero issues, and many of the tests ran great! remotecache_test.go is gonna need some work.
awesome!
Awesome, good to know at least that works! For remotecache, one of the tests just needs a registry, so that should be reusable from the other tests. The other part is minio, which is an s3 implementation. I think you can re-use the exact same idea as the registry except you shouldn't need to worry about disabling tls, it all works over http.
There's some existing code that initializes a bucket in minio, which is sort of annoying because it requires using a different container image, but I think you can just run that code at the beginning of the test suite, same place you set up the registry/minio services. But if any of that gives you any trouble let me know, I can look in more depth too
The part that I'm gonna need to figure out is the running separate dev containers.
Yeah if we want to make those be special case tests that require some specialized setup outside the direct test code, I think that would be totally fine for now.
Another approach would be to allow each test case to optionally get clients for its own separately spun up dev engine when it starts. E.g. most test cases will just call getClient(...) that returns a dagger client for the default dev engine we already spun up, but if they want their own engine they can use a different func. Buildkit uses a similar-ish approach in their integ tests. Not sure about the details there, but conceptually something along those lines might be nice.
Also, feel free to send out the updates for the rest of the test suite and skip those for now. It will be easier to get an idea on how to approach it after looking at the structure of everything else.
❓ What's the current state of debug sessions? (attaching a shell)
hmmm... anyone else seeing this? services don't seem to be stopping, and a bunch of zombie processes are piling up. pictured: htop in dagger engine container after running a few tests that run services.
repro:
git checkout main; git pullgo test ./core/integration/ -run ContainerExecServices -count=1 -v<- not against dev engine, happens against v0.4.2 too
noticed this because tests that ran dockerd as a service were failing waiting for containerd to start, likely because of a flock held by a leaked dockerd service from a prior run
cc @civic yacht @ancient kettle
debug sessions
hmmm anyone else seeing this services
@still garnet I was trying to use the new terminal UI (downloaded the latest from that PR, built it), and ran /Users/joel/src/dagger/dagger-ui/dagger run -- go run main.go -w $DAGGER_SRC_ROOT engine:test
I'm seeing this screen... but then it just hangs. My mage go.mod is using dagger.io/dagger v0.5.2. Is there something else I need to do?
hmm, haven't tried running our Mage stack with it yet, not sure why it would hang. I've run dagger run go test ./core/integration directly though and seen lots of things come in
Good to know! I'll keep playing with it.
@ancient kettle as a sanity check you could run dagger run ls, which takes a bit (needs to start the session) but should at least show some output
hmm, well that's not printing anything for me, so something else is probably up.
trying to reproduce this now, but dagger run ./hack/make engine:test at least seems to work fine 🤔
I’m rather confused. 🤔
I’ll try some more stuff tonight/tomorrow.
definitely don't think it's a "holding it wrong" situation, maybe something platform-specific. I've only tested on Linux
pushed a potential fix, not sure though, just a race I found with viewing command output
Damn, still not working. I wonder if it is a platform-specific thing.
Gonna try a couple more things.
I've rebuilt it (go build + ./hack/make), I've blown away dagger engine and spun it back up via dagger run as well as ./hack/dev. Still nothing from ./cmd/dagger run ls.
Your guess of platform specific makes some sense to me.
I just ran dagger run ./hack/make engine:test and everything passed, and then dagger run ls and that works too, though it takes 2-3 seconds before it shows it. interestingly dagger run go test ./core/integration doesn't work though because they all run against the same session (from dagger run), so things like the workdir are wrong.
I'll try testing on Mac soon, have an old machine sitting on a shelf. (hopefully this isn't extra spooky m1-only behavior or something.)
also: that 2-3 second startup time is begging for some optimization, not sure what's going on there but this likely affects every SDK
@still garnet I spun up a Linux VM on my machine last night/this morning and ./dagger run ls works just great. Your platform-specific hypothesis is looking stronger and stronger to me.
I'm happy to help you debug it, if that's useful for you!
weird! yeah, maybe try this: dagger run ls 2> stderr, let it hang for a minute or two (just to make the stuck goroutines stand out more), then kill -QUIT the process and upload the trace somewhere
@ancient kettle when you press arrow keys, does the top left change from FOLLOW to BROWSE? trying to see if the UI is totally wedged
and if you resize the window, does it update to reflect the new size?
Yeah, it FOLLOW/BROWSE changed. The collapse/expand works, too. The resize works, as well.
I don't think the UI is totally wedged.
@still garnet I mucked around with it some more... really not sure what's happening there. 
my one guess is something is going wrong with the cmd.Run() and the UI isn't reflecting the error. you could try wrapping it in a if err != nil { panic(err) } maybe
I started hooking up a Mac Pro I've had sitting on a shelf collecting dust, but don't have a spare USB cable, so context-switching for now 😄 will get back to it later
Sounds good! I'll try that!
@civic yacht @still garnet Any ideas for how I can get docker hub credentials to dev engine (running in stable engine)? I'm getting blocked on rate limits. 😢
@ancient kettle if you're logged in, log out. if you're logged out, log in. that's been my trick for getting a few more requests, since you get both the anonymous and authorized limit 😆
(using docker login / docker logout)
it should be forwarding your auth already
Pretty sure that's working for stable engine. How would that work for dev engine (running as a service in stable engine)?
oh, doy. hmm
There's currently a workaround for that by mounting in this: https://github.com/sipsma/dagger/blob/11948c1c9a3821b51400ee1e1b84f0d633167513/internal/mage/util/util.go#L163-L163
I think you could re-use the same idea? Devil will be in the details of course
It's also not great in that it should probably use a secret
But probably okay for now
this doesn't help you right now, but last time I looked into this I came away thinking we should just push all of our test images to a private registry. I sunk a bunch of time into trying to limit the API requests (e.g. by running a local registry mirror) but it gets really tricky since we run multiple engines in some tests
I'll try this!
Yeah Buildkit's integ tests do something along those lines. Or another possibility is to just use a mirror that doesn't have the same rate limits (I think I've heard gcr does that)
Damn. It's already being mounted to /root/.docker in the container that's running go test -- which would be the client.
Hm yeah you might just be hitting the limits @still garnet mentioned then... If we just replace all our hardcoded image refs with consts and then update those consts to use the gcr mirror instead, that might work. Or we could use the mirrors configuration in /etc/dagger/engine.toml for the dev engine, which would be a less tedious update (though maybe less ideal long-term)
I thought I had the mirrors set... but apparently it was in the wrong place. Trying again with that fixed.
On a positive note, nearly all of the integration tests pass just fine in this new dagger-in-dagger refactor.
I am so looking forward to having dagger run work on the Mac, @still garnet. Sorting through all of these trying to find the test failures... not my favorite.
I'll spend some more time poking at that tomorrow.
will take a look this evening too, been looking for a reason to rev up the ol trashcan
Nice! Always nice having an a good excuse. 🙂
Of interest, these are the integration tests I'm having trouble with:
TestContainerInsecureRootCapabilitesWithServiceTestDirectoryWithTimestamps.changes file and directory timestamps recursively
Let me know if you have any log output for the TestContainerInsecureRootCapabilitesWithService one, happy to take a look
I'll run that test in a few!
@still garnet Our old nemesis rises again.
@ancient kettle TestContainerInsecureRootCapabilitesWithService currently fails on the second run because the dockerd service leaks (the upstream buildkit issue from yesterday)
Oh. Maybe that's the problem!
Also, looks like I'm getting a panic after 10min.
Might be related to the services issue, as well?
#15 601.3 panic: test timed out after 10m0s
#15 601.3 running tests:
#15 601.3 TestContainerExecServicesChained (2m36s)
... + a bunch of goroutine stacks
That problem with TestContainerInsecureRootCapabilitesWithService only happens if the engine keeps running between between test invocations (it works if you just do ./hack/dev go test -v -count=1 -run=TestContainerInsecureRootCapabilitesWithService ./core/integration/ multiple times in a row, at least for me, since the dev engine will be restarted each time). I would have expected that each run is resetting the dev engine in the new dagger-in-dagger approach too, but if not then yeah that's most likely it
#15 20.54 container_test.go:2998:
#15 20.54 Error Trace: /app/core/integration/container_test.go:2998
#15 20.54 Error: Received unexpected error:
#15 20.54 input:1: container.from.withMountedCache.withServiceBinding.withEnvVariable.withExec.stdout process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
#15 20.54 Stdout:
#15 20.54
#15 20.54 Stderr:
#15 20.54 Unable to find image 'alpine:latest' locally
#15 20.54 latest: Pulling from library/alpine
#15 20.54 af6eaf76a39c: Pulling fs layer
#15 20.54 af6eaf76a39c: Download complete
#15 20.54 af6eaf76a39c: Pull complete
#15 20.54 Digest: sha256:ff6bdca1701f3a8a67e328815ff2346b0e4067d32ec36b7992c1fdc001dc8517
#15 20.54 Status: Downloaded newer image for alpine:latest
#15 20.54 cat: can't open '/tmp/from-outside': No such file or directory
#15 20.54
#15 20.54 Please visit https://dagger.io/help#go for troubleshooting guidance.
#15 20.54 Test: TestContainerInsecureRootCapabilitesWithService
#15 20.55 Error: failed to solve: process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
#15 20.55 #4 ERROR: process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
That test gave me an issue yesterday too, but it just seemed to take a really long time: #1089980300672643092 message
Unsure if it's related to the upstream problem or a different one
Oh okay yeah that's a different issue from service process leak I think (but let me know if that's the same problem you were seeing @still garnet)
Yeah, dev engine should be run once per mage engine:test run and killed afterwards.
I'm running dev engine with:
WithExec(nil, dagger.ContainerWithExecOpts{
InsecureRootCapabilities: true,
ExperimentalPrivilegedNesting: true,
})
Not sure if both of those are necessary or not...
hm nope never seen that before
That looks correct. I don't have any immediate ideas on what would cause the test to fail with that error now... it should just be setting up a shared cache mount in two containers that are both running in the dev engine, so I'm not sure what would be different now that causes that to change behavior.
If you want to just push the code as a draft I can look more closely, it may be some minute detail somewhere
Yeah, I'll do that in a few.
@civic yacht @still garnet https://github.com/dagger/dagger/pull/4848
Don't mind core/integration/remotecache_test.go - I haven't figured out how I should refactor this yet.
Kinda beautiful, isn't @still garnet? 🙏🏻 You made it possible. 🙂
Awesome, sorry been finishing up something else and need to catch up on pings now, will take a look in a bit!
@ancient kettle Okay managed to fix it, and the fix also appears to have gotten TestDirectoryWithTimestamps passing too. Pushed the fix here: https://github.com/dagger/dagger/pull/4848/commits/6570ea111ecd9acd1ee327bbceedf12d434dc73f
Before that change, the engine's state directory was an overlayfs mount, which meant that it couldn't itself create new overlayfs mounts so it had to fallback to the extremely slow copy-based "native" snapshotter. Putting a cache mount at /var/lib/dagger fixes that because cache mounts are just bind mounts from the "outer" buildkit's filesystem, which is itself a bind mount from the actual host (since we run the default engine in docker w/ -v /var/lib/dagger.
So yeah that fix is what we want in general anyways, but the fact that it caused errors in those two test cases in particular is pretty interesting though. I am not totally sure, but I think it may have to do with weirdness of overlay where the listing of directory contents is cached in the kernel in such a way that different processes can get inconsistent results (search readdir here: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt). Because all the mounts we put into containers were actually just bind mounts from a single underlying overlay mount, that could easily explain different containers getting inconsistent views of the filesystem. Doesn't matter in the end, just kind of interesting.
@ancient kettle welp. one macOS upgrade, 7 failed recovery mode reboots, a table flip, and an Xcode upgrade later, I can't reproduce your issue. 😅 caveat: this is an old Intel Mac Pro from 2013. dagger run ls runs successfully, with the same-ish delay
Wow. I’m sorry it was such a trial. I guess I need to see if one of our m1/m2 peeps can repro. If not… well, then I have some ghosts to exorcise, apparently. 👻👻👻
I can try reproducing tomorrow on a m1 mac. Are there instructions somewhere?
@tepid nova 🙏
cd dagger
git remote add aluzzardi https://github.com/aluzzardi/dagger
git fetch aluzzardi ui
git checkout ui
go install ./cmd/dagger
dagger run ls # sanity check (should take a few seconds)
dagger run ./hack/make engine:test # try actually running everything
@civic yacht 👋 How do we stand today on the limits for the engine to process requests? Let's say you send 10000 concurrent requests (pipelines), is the engine smart enough to throttle that or do you need to specifically configure the parallelism? Do you have a list of open issues on this subject, i.e., performance issues related to using "unlimited" concurrency from an SDK?
@rancid turret my understanding is there's no default limit, but you can configure one in the engine.toml: https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md (Ctrl+F max-parallelism)
i know one user has done that at least
Panic on M1: https://gist.github.com/grouville/f4873751dd184f1518ae83f1bcfc5ca7, but a rerun seems to work
oo fun, thanks
did you start with a small window and resize it for the second run by any chance?
What @still garnet said, I also recently added a command line flag to the engine that does the same thing, which is slightly more convenient but still requires a custom provisioned engine to use. If we want to make it easy to configure on any engine we can though; no issue for that yet, if it's something needed I can make one
I dont think so (even though I'm not 100% sure). I retried after cleaning everything (including volumes). But it didn't crash. And in the tests, I failed to crash it with window resizes
Ok, you seem to be right @still garnet. Making very small windows while it runs makes it fail. But I don't remember doing that (I might have done it automatically though)
So weird. Clean checkout. I deleted brew dagger just to be safe. Ran dagger run ls... and 🦗🦗🦗
you're not running it in an empty directory right? what about dagger run echo potato? 😄
or what about dagger run say hello?
what about dagger run reboot. you feeling lucky?
Maybe in a bit? Gotta meeting now. 🙂
if you didn't hear anything from say hello that's enough evidence without rebooting 😛
Yeah, not getting anything from that. Just say hello does work.
I'll try that panic right now.
Concurrency limit
@still garnet I changed this code.
diff --git a/cmd/dagger/run.go b/cmd/dagger/run.go
index 5b3527c2..2110d08b 100644
--- a/cmd/dagger/run.go
+++ b/cmd/dagger/run.go
@@ -72,8 +72,11 @@ func Run(cmd *cobra.Command, args []string) error {
eg := new(errgroup.Group)
eg.Go(func() error {
+ subCmd.Stdout.Write([]byte("Hello from Joel! #1\n"))
return withEngine(ctx, sessionToken.String(), journalW, progOutWriter{program}, func(ctx context.Context, api *router.Router) error {
+ subCmd.Stdout.Write([]byte("Hello from Joel! #2\n"))
go http.Serve(sessionL, api) // nolint:gosec
+ subCmd.Stdout.Write([]byte("Hello from Joel! #3\n"))
return subCmd.Run()
})
})
When I run dagger run ls, I now get Hello from Joel! #1.
Looks like withEngine is returning an error.
was just about to ask 🙂
🙂
Huh. err: failed to solve: unknown cache exporter: "dagger"
Gonna check the engine logs.
maybe it's talking to an outdated engine somehow? or a vanilla buildkit?
(should definitely handle errors here better)
I cleared out the dagger engine that was running (including volumes)
Then re-ran.
Same error. 🤔
The image is registry.dagger.io/engine:main
oo maybe that needs to be pulled?
I'm going to delete that image, and see if I can pull it again.
What. The. F.
Hello from Joel! #1
Hello from Joel! #2
Hello from Joel! #3
CONTRIBUTING.md
Dockerfile
LICENSE
NOTICE
README.md
RELEASING.md
TESTING.md
auth
cmd
codegen
core
demos
docs
engine
go.mod
go.sum
hack
internal
network
project
router
sdk
secret
telemetry
tracing
website
Hello from Joel! #4
err: <nil>Hello from Joel! #5
🎉? (is that a good wtf or a bad one)
It works!
I'm just so confused what happened...
I should have captured the sha...
Probably an old image or something.
yeah, maybe that didn't matter until now because you've always used ./hack/make and friends. do you have that set as an env var in your shell somewhere?
No, I don't think so. I ran env | grep DAGGER and verified it was empty.
I must have pulled engine:main a while ago and it kept using that. ¯_(ツ)_/¯
oo, we default to that for dev builds apparently
Ah, yeah that makes sense.
@ancient kettle @obsidian rover pushed fixes for both issues. it'll now synchronously start the engine and command, which makes for better error handling and avoids showing a 'blank' UI that looks stuck
Thanks, @still garnet!
👋 what does anybody think about changing []string Go SDK arguments with (args ...string)?
I generally find myself writing WithExec("foo", "bar") just to realize afterwards that I forgot the []string{} thing
In terms of pure DX, I agree with you
I agree it would be nice, the challenge would be passing the optional dagger.WithExecOpts arg though
true 🤔
we could add sugar to SDKs for things like that, just like Buildkit has llb.Shlex("foo bar baz") which expands to llb.Args([]string{"foo", "bar", "baz"}). There's risk in fragmentation / thicker SDKs, but well, these things would be nice when you're using it all the time
it's a bummer that go doesn't allow adding variadic args of different types. Doing something like func(args ...string, argsb ...bool) doesn't seem a bad idea initially
and should be easy for the compiler to validate that
i think the problem would come with interfaces, e.g. if you had func foo(ws ...io.Writer, rs ...io.Reader) and you called it like foo(writer, readWriter, reader) it wouldn't know where to group the readWriter
@civic yacht if you're still looking into the exit 4294967295 issue, here's what I know so far:
- When @stray heron and @hybrid widget reported it they were actually seeing an error when resolving service aliases ("host: lookup SDFFKJHJKHFDG: no such host" or something), which can only happen after a successful healthcheck, and indicates the service exited afterwrds.
- When I saw it last night it seemed to happen for a random subset of tests in Bass's test suite, which runs against
buildkitdrunning as a service container, so in that case I thinkbuildkitdwas exiting partway through. With enough runs I was able to have all the tests pass.
For one of your tests though it seems to have happened when running the healthcheck itself?! https://github.com/dagger/dagger/actions/runs/4579508241/jobs/8087359112?pr=4879#step:5:3831
Yeah actually let me push a commit quick, just to debug I changed the code to ignore that exit code from proc.Wait and that resulted in the test case passing for me locally. That's making me feel like the process is actually exiting fine but something is going haywire in getting the exit code back to the client.
Need to actually understand before merging though of course
hmm ok - that seems counter to what I've seen so far (I thought services were exiting prematurely somehow) but maybe multiple things are happening
Hm yeah running the full suite of tests in the PR with that change seems to have caused all sorts of other bizarre failures, so almost certainly more to it than just the process exiting fine...
The exit code being the max value of uint32 makes me suspect that something is returning -1 somewhere and that is incorrectly being converted https://go.dev/play/p/9eGFxEgw374
starting a thread for this
Hey, just wondering if this is still the case with Dagger engine?
https://github.com/dagger/dagger/issues/2227
I'm guessing the client-side config is pending?
Yes. This is my workaround: https://github.com/dagger/dagger/issues/2227#issuecomment-1494447042
quick fix to CLI build failures happening in the publish workflow in main: https://github.com/dagger/dagger/pull/4887
@still garnet PageUp/PageDown?! ♥️ https://github.com/dagger/dagger/pull/4522/commits/0bfaad4dbdd57c5b5dcaa93b490b91be588a71df
lol, you noticed that fast, I'm writing up a comment to summarize latest updates
Yes This is my workaround https github
@civic yacht We never implement the Container.With(func(c *dagger.Container) *dagger.Container) pattern, did we...?
@rancid turret has been carrying the torch on that: https://github.com/dagger/dagger/issues/3885#issuecomment-1495879670
Thanks, @still garnet!
@still garnet @civic yacht This Dagger-in-Dagger PR is mostly done. I realized I still need to refactor core/integration/remotecache_test.go, so I'm digging into that now. https://github.com/dagger/dagger/pull/4848
@ancient kettle now I want to add a TUI key to open the details log output in $EDITOR 
just ran your PR and a test failed, but it's pretty hard to peruse the output in the little pane. (one could say it's a paneful experience)
huh, yeah, I could have sworn it respected it, but I guess not
I'll double check, and make sure I've got your latest.
i'm also trying to figure out how I can have services 'marked' somehow so the follow mode doesn't prioritize them. and so they're not marked as 'failed'
nah it definitely doesn't (just tried)
unfortunately the services thing is tricky because pipeline info is baked into containers at withExec time, and containers don't know whether they're a service at that point.

a dumb hack might be to use llb.WithCustomName to prefix the vertex with something like service: . Hopefully there are better ways than that but would technically be an option I think
that's what I was going for, but I don't think there's a point in time where a) I can use llb.WithCustomName and b) the container knows it's going to be a service
I guess I could do it based on whether it has exposed ports? (even though not all services will do that, since sometimes ports are dynamic, i.e. rabbitmq)
ack, sometimes you do WithExposedPort after WithExec, so even that's not perfect
There's a chance it might work if you pass llb.WithCustomName as an arg to .Marshal here: https://github.com/sipsma/dagger/blob/ae656003ba7b7073f0d4d31cdbcc43489cd2dba7/core/container.go#L1156-L1156
The only reason I could see it not working would be some weird stuff with overriding the metadata of a definitionop, in which case you could do something similar to the workaround here: https://github.com/sipsma/dagger/blob/ae656003ba7b7073f0d4d31cdbcc43489cd2dba7/core/container.go#L410-L410
But at that point there's clearly a can of worms forming, so maybe not worth it
Yeah the types used to enable that flexibility are kind of complicated: https://github.com/moby/buildkit/blob/3187d2d056de7e3f976ef62cd548499dc3472a7e/client/llb/state.go#L451 but it is convenient
whoa, handy vim tip: to pretty-format json (i.e. in journal.log), just do :!jq ., i.e. :'<,'>!jq . (for selection) or :%!jq . (for whole file)
i've gone too long without using this...
If you have the json LSP (jsonls is the one that I use) enabled and binding for the format action ( vim.keymap.set('n', '<space>f', function() vim.lsp.buf.format { async = true } end, bufopts)
). you can do set ft=json and use that as well 🙏
woo, now concourse's web UI integration tests show everything passed, instead of saying the services failed
❤️🔥❤️🔥❤️🔥
I'm adding o to open in $EDITOR, I realized it's kind of important because otherwise the output is kind of "lightning in a bottle"
I think it should work while you have a group selected too, not just an item, but I'm trying to figure out what that should actually do 
it'd just be nice to be able to smash "Home" (go to top/root) => "o" to save all the logs
oo, maybe it could create a file/directory tree and then open the folder in your editor...?
I like that idea a lot!
pushed! turns out bubbletea (our TUI library) already had all the necessary tools for this, which is impressive. it's kind of tricky to e.g. open vim from a TUI, since there's only one 'alt' screen. you have to basically halt the background process, otherwise it starts eating user input before it can make it to the editor
@ancient kettle yeah, just had a few questions to make sure we agree on design, but anyone can assign to implement. The issue isn't yet marked ready for work \cc @cosmic cove.
Sounds good, @rancid turret!
I'm getting inconsistent TestContainerExecWithUser/user_and_group_ID failures: https://github.com/dagger/dagger/actions/runs/4630289933/jobs/8192748504#step:5:6423.
Any hint on how to debug ? I don't think the changes of the PR should impact this
@obsidian rover my guess is we're just brushing up against the test timeout again, we've had to optimize some things in the past to bring it back under the 10m threshold. (I'm guessing this because none of the tests running at the time of failure seem to have been stuck for a crazy amount of time, so it's not an individual test causing the timeout)
I had a run hit this yesterday and it passed on rerun
...but that's not really sustainable
Thanks 🙏 I'll try again, but it's my third try
time to rm -rf some tests 
we can maybe just bump the timeout, if we're adding tests all the time it's kind of inevitable. it'd be awesome if we could speed them up instead, but that's not a priority at the moment
@ancient kettle did you notice any performance difference by any chance with the dagger-in-dagger refactor?
Yeah, I'd love to do that. 10mn is kinda long.
I'm sure we can improve some part, make some parallel, but I don't know how much we will gain.
I'm really curious if there's low-hanging fruit in the session startup time. dagger run ls shouldn't take 2-3 seconds
so that's even after bypassing any communication with our registry?
Or I guess, dagger run does communicate with the registry, so that's the likely culprit
yeah, I'm guessing so
Doesn't seem much different (except for engine-race-detection for some reason)... Also note that this doesn't include the caching tests (because I'm still working on getting those working).
mine: https://github.com/dagger/dagger/actions/runs/4621842009
vs
latest main: https://github.com/dagger/dagger/actions/runs/4630354758
@rancid turret @still garnet @civic yacht @tepid nova I was feeling impatient this morning... https://github.com/dagger/dagger/pull/4898
I only applied it to Directory and Container, which is at odds with the NodeJS SDK, but can easily shift that here: https://github.com/dagger/dagger/pull/4898/files#diff-85da13f35530543cd3af1d79ed2aea0a29f58dfd42b4aeeca1a307b27bed58cfR16
Why not let it be added to all object types? 🙂
What do you imagine doing with func (c *CacheVolume).With(...) *CacheVolume?
Do you know where the timeout is set ? @still garnet
@obsidian rover it's the Go default, so we just need to add a -timeout=15m (or something) flag wherever we run go test
looks like that's in internal/mage/engine.go
A colleague just found this, thought it might be worth mentioning to you folks https://github.com/josh-project/josh
just found: https://tart.run/. Interesting product for people requiring fast macOS runners. cc @ancient kettle @civic yacht
@still garnet ... Just gonna drop this here... https://github.com/mlange-42/git-graph
git is a DAG, too... 😉
ooo. I was thinking about this last week! Since dagger run is a bit of a lie, trying to represent everything as a tree. I was thinking about git log --graph couldn't find a good generic tool/library for it
It's a bit of a bummer that tool is in rust... and I couldn't find something similar in go... 😢
time to rewrite everything i guess
this is definitely way easier to read than git log, wonder how it would look for a Dagger graph
looks like it's super tied to git though. maybe i need a... dagger graph to git repo converter.... 
@civic yacht @still garnet I'm getting a really weird error from TestHTTP - https://github.com/dagger/dagger/actions/runs/4660748842/jobs/8249189993?pr=4848#step:5:13742... and I can't replicate locally. 😠
lol wtf
I'm so confused... I didn't change that code at all... (beyond the attempt to fix it)
it's a simple HTTP fetch too 🤔
Yup. 🤷🏻♂️
and it's not a fluke, either; same error in engine-race-detection
only guess I can think of is some kind of weird cache false positive?
Yeah, I've seen it a few times now.
would be interesting if we could identify the byte sequence it's getting back somehow
That's a interesting/good hypothesis, actually.
Based on the first byte being 1F it might be a gzip payload? Very very odd though. It's also sad that the engine output gets clipped here: https://github.com/dagger/dagger/actions/runs/4660748842/jobs/8249189993?pr=4848#step:5:7858
also a good guess, like gzip decompression not happening for some reason?
🤷♂️
@ancient kettle when you try to repro locally are you running the full test suite or just that test in isolation? would be good to see if it repros when you run from a fresh cache and running the full suite
I just ran the single test. I'll try the full suite from a fresh cache.
If it's a bug where it picks up the wrong cache then that might be more likely to trigger it
Yeah... has me wondering if it's the cache mount for dev-dagger-engine biting me.
you could also change the test to print the content base64-encoded
just so we can play with the data it gets
The dev engine uses a docker volume mount already today, which is essentially the same thing, so I would be surprised
Ok, it repros locally with the full test suite.
Go http docs seem to say that if you manually set gzip compression request headers, you need to decompress on your own: https://go.dev/src/net/http/transport.go#L190
Here buildkit sets that header: https://github.com/moby/buildkit/blob/3187d2d056de7e3f976ef62cd548499dc3472a7e/source/http/httpsource.go#L183
But I didn't find anywhere it handles decompressing... could just be missing that or misunderstanding something but that feels suspicious
@civic yacht @still garnet I have another data point...
When I comment out TestRemoteCacheRegistry and TestRemoteCacheS3... it passes.
Maybe dagger-in-dagger-in-dagger is just too much dagger?
That is pretty weird too but did you give a try base64 encoding the bytes so we could see what they are? If it's just the compressed bytes of the actual contents we expect then I would highly suspect this commit (from 2 weeks ago in buildkit) broke it: https://github.com/moby/buildkit/commit/f2f3d8cf94dcf8d560b1d4392a73948a81455e5e
I don't know entirely how it would be inconsistent between different runs, but there's a plausible path since the transport used for the http requests is global to the engine, it's just created on startup, so there could be some strange interactions happening there that vary depending on the order and timing of requests to the engine
Yeah, I'll get that for us. Hopefully gets us more clues.
Base64ed:
H++/vQgAAAAAAAAD77+9VMuO77+9NhDvv73vv70r77+977+9IRnDlu+/ve+/vVwWG08eE2zvv71F77+9QDAYBAtabFPvv71QbO+/ve+/ve+/vShf77+9b8mX77+977+977+977+9WiwWAXzvv73vv73vv73vv73vv73vv70i77+9a++/vWfvv73vv71ySe+/vWlrKe+/vV3vv73vv70/77+9Su+/vSHvv73vv73vv73vv73vv73Rkzrvv73vv70Pd++/ve+/vXXvv71U77+977+9WEJSI++/ve+/vQY377+977+977+977+9XFAt77+977+977+9EFPvv71a77+977+9a++/vWEB77+9Ae+/vUzvv73vv73vv71xAu+/ve+/ve+/ve+/vXZDW++/ve+/vR3vv73vv71777+977+9cO+/vSrvv73vv70t2pJUVgQ9DN61OjsO77+977+977+977+977+977+9RSTvv71QWzJ9UU0h77+9SH7EkH/vv715TFkH77+977+9Ue+/vR3vv70XKH9977+977+9PO+/ve+/ve+/ve+/vQcKC++/vRzvv73vv73vv71RHzt42IFm77+94Ki9OlLvv73vv73vv73vv71277+92qjvv73vv70+CHBW77+9W3wE77+9DH3vv71N77+977+977+9Y++/vUfvv71dHjfvv73bhRMzRe+/ve+/vSDvv70rDhPvv70J77+977+9be+/ve+/vVvvv70OVO+/vR1877+977+977+9Ee+/vciAS3zvv713DO+/ve+/vXhV77+9Ou+/ve+/vWvvv71r77+977+9NO+/vSgDRe+/vdq2Alvvv70iYkYFybF6LkHvv73vv73vv70+77+9GHTvv73vv71n77+9DO+/ve+/vUnvv705Pu+/ve+/ve+/vQpw77+9KFLvv71x77+9fgDSq1h1du+/ve+/ve+/vXfvv70+yY5aTmPvv73vv73vv70qd1JuTh3vv73vv73vv73vv73vv71iQu+/vUQbIzXvv70/X++/vSTvv70QOe+/ve+/vSVBDgvvv73vv73Hsu+/ve+/vWo077+9PlXvv73vv73vv70UAhTvv71L77+9Bu+/ve+/vX3vv73vv73vv71gae+/vWMQce+/vdW8Qu+/vSDvv71o77+977+9AzonXu+/ve+/vV7vv73vv70C77+9W2bvv70QCO+/vSXvv71177+977+9aSQc77+9YGQ9JCg/77+9VF3vv73vv73vv70i77+9Z1hYT0tVQu+/vSg+77+977+9d++/ve+/vQfCg1Tvv70a77+9emMoORvvv73vv73vv73vv71DUgBOHO+/vWzvv73vv71B77+977+9V++/ve+/ve+/vVZ4DQ7vv70dUDpm77+9dEDvv71SG92QJ++/vSpxT++/vVLvv70ZBAYG77+9U++/ve+/vTYgMu+/ve+/ve+/vTUh77+977+977+977+9akvvv73vv71XF/G4va7vv70yBmBaCe+/vTTvv71KbNWF77+9Je+/ve+/vXc1bu+/ve+/ve+/vSHvv71WB++/vXdd77+977+9cu+/vVPvv73vv716LmYH77+9KF9477+977+9M0vvv70l77+977+9fe+/ve+/vXTvv73vv73Fqe+/ve+/vVnvv73vv73vv70lD1It77+977+9B08vMDbvv73vv73vv73vv71f77+9cC5V77+977+9E1JX77+9XmJQfe+/ve+/vT3vv73vv71nZe+/ve+/ve+/vSXvv71v77+9d++/ve+/vTY177+9Ju+/vXHvv71fd3zvv71J77+977+95Z2U77+9SO+/vU/vv73vv71D77+977+9Pe+/ve+/vXfbqXA3F++/ve+/vR9e77+977+9Xu+/vUsWHu+/vWMDMe+/ve+/vXdPc01t77+977+977+9S3YOJu+/vXpsIe+/ve+/vS3vv70Y77+9Uu+/ve+/vTTvv73vv73vv70p77+977+977+977+977+977+9UDF+ZO+/vSEa77+977+9PR/vv73UvO+/ve+/vdOLBe+/vU/vv73vv73vv73vv73vv73vv73vv73vv73JuO+/ve+/vWLvv71FMVYacu+/vWlW77+9Ae+/ve+/vU8H77+9BgAA
This website doesn't like it... https://bugdays.com/gzip-base64
incorrect header check
[nix-shell:~]$ file out
out: Linux rev -1111494879.14013 ext3 filesystem data, UUID=bfbd3206-605a-09ef-bfbd-34efbfbd4a6c, volume name "\325\205\357\277\275%\357\277\275\357\277\275w5n\357\277\275\357\277\275\357\277\275!\357\277\275V\007\357\277\275w]\357\277\275\357\277\275r\357\277\275S\357\277\275\357\277\275z.f\007\357\277\275(_x\357\277\275\357\277\2753K\357\277\275%\357\277\275\357\277\275}\357\277\275\357\277\275t\357\277\275\357\277\275\305\251\357\277\275\357\277\275Y\357\277\275\357\277\275\357\277\275%\017R-\357\277\275\357\277\275\007O/06" (needs journal recovery) (errors) (compressed) (64bit) (huge files)
ah yes of course. a Linux -1111494879.14013 ext3 filesystem. it doesn't seem to be in good health though. maybe that's the problem!
I'm trying it out locally too, this is too bizarre...
fwiw, ChatGPT isn't able to identify a hexdump of it, but it is able to identify an actual hexdup of a gzip'd README
this is certainly an interesting coincidence if nothing else
This is only supposed to happen when you write C...
Ah okay, I think I know what's happening. I think the base64 before was done after our API had already converted []byte->string, which meant the base64 encode include the escape chars like \x. I put a base64 dump earlier in the spot where we actually get the bytes (https://github.com/sipsma/dagger/blob/4be9d6ee11cb5d3a66ec3e0f846436535009fd5e/core/file.go#L92) and can see that it actually is the gzip of the file we were expecting:
echo 'H4sIAAAAAAAAA4VUy47jNhC8+yuI9SEZw5avwVwWG08eE2yQRbJAMBgEC1psU8RQbIGP8Shfk2/Jl6Wa0thaLBYBfLCk7uqq6iLXa/Vnp7NySd1paym+Xa2mP/JKqyGyjbrv9dGTOtzvD3eKgnWBVJa2WEJSI5eoBjeQx/ukXFAth6zxEFOzWq3Xa/VhAYMB9Eyeh6lxAr2264R2Q1vByR2ppHu60HDBKq+DLdqSVFYEPQzetTo7DvO43/+P1kUkvVBbMn1RTSG7SH7EkH//eUxZB6OjUb8d7hcof33b5Tyk2/2eBwoL0Rzt/qZRHzt42IFmguCovTpSoJPL6Xa12qjN5j4IcFaeW3wEiwx9m0399oFj1kfnXR43m9uFEzNF1eog9isOE/cJpNdth69brA5U8R185MGQEYvIgEt8lncM48J4VaM6rvNrgWupqTT+KANFh9q2AlvhImJGBcmxei5Bka/APo4YdNLFZ8wM5rVJnTk+pauGCnDuKFKdceB+ANKrWHV2uavFd9w+yY5aTmPK1N8qd1JuTh32qa/st2JC9UQbIzWZP1/pJOcQOaXdJUEOC4ilx7Krkmo06T5VoJLAFAIUg0v8BvyZfarsuGBppGMQcZ/VvELXIOJosbgDOidenKxej4gC6FtmoxAI0SX8daSmaSQc72BkPSQoP7tUXZyPyyKlZ1hYT0tVQtooPqmHd7++B8KDVIsasHpjKDkb6v6By0NSAE4c3myx9kHA4VekwetWeA0Opx1QOmaXdECuUhvdkCf1KnFPuVLqGQQGBo9T8ZU2IDLE9vM1IfSjq7lqS8rcVxfxuL2u2TIGYFoJ7jTWSmzVhcIlga13NW66ldUhtVYH93ddl9BynlOd23ouZgeJKF94p80zS+4l7ol9kb504bbFqeGnWb6hniUPUi3J6gdPLzA2k52Djl/qcC5V4IwTUlf8XmJQfZC1Pf7MZ2WY6s4l9W+vd4ThNjWmJqBxvF93fN5J6c7lnZTeSP9PlKdDmOE9ma9326lwNxfW5h9etJBe3EsWHpdjAzH7qXdPc01t+IXnS3YOJup6bCGP8y2xGO9Sy9E01u7LKYzf5/a78lAxfmTvIRq7gj0fsdS87MvTiwWDT46lr47W6vGLybjF/GLuRTFWGnK6aVb/Af32TwezBgAA' | base64 -d | gzip -d
## What is Dagger?
...
So that makes me suspect more that it might be the buildkit commit I linked above. I still don't totally get why it would be inconsistent, but I'll try quick with that buildkit commit fixed and see if it fixes it here
Okay yeah that fixed it, can't repro in dagger anymore, will send out a buildkit pr. @ancient kettle no need for this to block the dagger-in-dagger pr, if there's a URL that never returns gzip content that we can use then we could switch to that, but honestly if we even just temporarily comment out that test and have an issue for re-adding once we've picked up a buildkit with the fix in place that's fine, we will still have the other http test that uses a service to give us coverage.
Upstream fix here: https://github.com/moby/buildkit/pull/3788/files
I also realized as I wrote tests for that why we are hitting it in dagger, you can reproduce consistently by just running this test case in isolation:
func TestHTTP(t *testing.T) {
t.Parallel()
c, ctx := connect(t)
defer c.Close()
// do two in a row to ensure each gets downloaded correctly
url := "https://raw.githubusercontent.com/dagger/dagger/main/TESTING.md"
contents, err := c.HTTP(url).Contents(ctx)
require.NoError(t, err)
require.Contains(t, contents, "tests")
url = "https://raw.githubusercontent.com/dagger/dagger/main/README.md"
contents, err = c.HTTP(url).Contents(ctx)
require.NoError(t, err)
require.Contains(t, contents, "Dagger")
}
It has to do with some weird stuff involving how buildkit stores cache metadata. It technically wouldn't result in any false cache positives or anything, but it was unoptimal and coincidentally triggered that buildkit bug. Sent out a PR to improve that in dagger, more details in the description here: https://github.com/dagger/dagger/pull/4927
@ancient kettle the change in that dagger PR works around the issue without needing to remove the test or wait on upstream, pushed it the your dagger-in-dagger PR and tests pass now
@civic yacht excellent sleuthing! i thought the other base64 looked a bit funny. lots of repeated patterns, I guess from all the escape characters
@civic yacht Really appreciate you digging into this bug. I concur with @still garnet: excellent sleuthing. 🙏🏻
No problem! It was a weird one... Glad we don't have to get blocked on it
<matrix I don't even see the code.gif>
This is a better one... https://gfycat.com/miserlylightheartedanteater
@still garnet Turns out... network names have size limits... dev-engine-remote-cache-0 (25 characters) is over that limit.
Stack Overflow says 15 characters: https://stackoverflow.com/questions/54540781/docker-network-create-error-numerical-result-out-of-range
Has anyone gotten this? Here is my command:
docker network create
--opt com.docker.network.bridge.enable_ip_masquerade=true
--opt com.docker.network.bridge.enable_icc=true
--opt com.do...
time for derc-0!
😆
@civic yacht blatant nerd snipe attempt, feel free to ignore, but curious if you've had similar thoughts: SourcePath is really difficult to deal with! It would be really really nice if there was a way to just turn a llb.State + SourcePath back into a llb.State efficiently. Right now we have to go to great lengths to account for them everywhere, and the code is really convoluted. It seems like one option is to just llb.File(llb.Copy(st, sourcePath, ".", &llb.CopyInfo{CopyDirContentsOnly: true})) instead of keeping track of the source path, but I'm assuming that comes with a performance cost. 🤔
Yeah I've wanted that in upstream buildkit for a while, it came up a lot in my mind when working on MergeOp since right now you can only efficiently merge the / of input states, merging a subdir of an input requires a copy.
I would greatly prefer we just deal with the complication internally for now since yeah the cost of doing all those copies can become significant in realistic cases, e.g. if the copy involves a lot of large binaries, node modules, java jars, etc.
yea totally. ok, glad I'm not alone here 😄
When I talked to Tonis about this during MergeOp work, I think the idea was that we could give FileOp copy the same optimizations MergeOp has where the new snapshot is created by hardlinking files instead of actual on disk copies (with exceptions for when chown/chmod/etc. is used, but that's always going to require a real copy anyways)
For context, it's wreaking havoc in https://github.com/dagger/dagger/pull/4932 since now I'm adding more copy operations to change ownership, which requires using a SourcePath in nearly all cases in order to also chown the directory mount itself. I'm inclined to not chown by default at this point due to performance concerns, but I'm following through with it anyway to kick all the cobwebs out of these code paths. I found out Container.WithRootFS doesn't respect the given directory's source path, for example.
That makes sense - so my proposed workaround would "just work"? I can see why he'd want to go that route, since it keeps the API surface area the same
The chown/chmod restriction would be unfortunate though, since overlay has metacopy; theoretically you don't need to do a full copy in that case. Not sure if Buildkit leverages that at the moment
Yeah that would only be a restriction until buildkit supports metacopy, but doing that will be a large effort since there's a lot of code all over the place that interacts with overlay on a fine grain level, it will need to be updated to handle the "metacopied" files+dirs
gotcha, good to know
But yes if Buildkit had that then we could just copy freely, MergeOp would actually become unneeded as it would just be a special case of Copy. Though I just remembered the big wrench in all of this and why we didn't pursue it at the time, which is that MergeOp is lazy (meaning you can merge remote layers in registries/caches without having to actually pull them), but FileOp isn't. FileOp could be made lazy but it would be a huge refactor in all likelihood.
@civic yacht @still garnet I think this is ready, whenever you feel good about it: https://github.com/dagger/dagger/pull/4848
today the dupl linter is my worst enemy. it complained once, and I listened, turning a lot of duplicate test code into a shared suite.
now it's complaining that they're all calling the shared suite. https://github.com/dagger/dagger/actions/runs/4703523206/jobs/8342122580?pr=4932#step:4:491
i don't even know where to put a // nolint: dupl comment for this.
gonna just disable dupl and gosec for *_test.go files
@civic yacht Thanks for your comments. I’ll try to address them this weekend/Monday. Today has been another day of taking care of the family…
No worries at all, I'm glad that the family is higher priority than my dumb PR comments 😁 But actually, the comments are all just minor things, it looks great! So whenever you have the time+energy to look it hopefully shouldn't be too much
Thanks again for the comments. I addressed all of them except for the internal/mage/engine.go vs internal/mage/util/engine.go one. I think that a refactor of all of that is probably a good idea. We're definitely gathering some cruft there. If you're ok with it, I'd like to separate that out as a separate issue, because it will touch a number of other parts of our CI, as well, and I'd love to get this piece over the line.
Let me know what you think.
SGTM, doing another pass now!
@civic yacht Doing some refactoring in core/ and core/schema/ area, do you remember if the motivation behind this marshal => unmarshal dance was to prevent accidental mutation, or was it for something else?
My refactor is to try to eliminate all the *Foo <-> FooID <-> fooIDPayload encoding/decoding, since it seems pretty wasteful and repetitive. Example for File: https://github.com/vito/dagger/commit/cc5031ddfd76e6d78dc4d402eb2ece55ffd452c5
But one thing I had in the back of my head was having to worry about mutation, and potentially switch from pointers to just by-value structs to avoid that (e.g. func (*Foo) WithX() *Foo -> func (Foo) WithX() Foo).
I added a test using parallel named queries, but it looks like this marshal/unmarshal dance is allowing the test to pass. I'm considering switching to a parent, ok := parent.(P); if !ok { marshalUnmarshal } just to skip the marshal/unmarshal if the type already matches, and confirmed that it allows the test to fail, at which point I would probably switch away from pointers, but curious if this is otherwise load-bearing
Here's the test for mutation:
func TestContainerParallel(t *testing.T) {
t.Parallel()
res := struct {
Container struct {
A struct {
EnvVariable string
}
B string
}
}{}
err := testutil.Query(
`{
container {
a: withEnvVariable(name: "FOO", value: "BAR") {
envVariable(name: "FOO")
}
b: envVariable(name: "FOO")
}
}`, &res, nil)
require.NoError(t, err)
require.Equal(t, res.Container.A.EnvVariable, "BAR")
require.Empty(t, res.Container.B) // fails with BAR if there is mutation
}
The motivation for that was just that it's the only way I knew of to be able to reliably convert p.Source, which is of type any, to the generic type P. Basically just offloading the conversion to the magic inside the json package
Oh sorry I just read the last paragraph of your message, if that type assertion works consistently then yeah I don't remember there being any other reason to do the marshal/unmarshal
👍 now I'm curious what other types might flow through there and if any of them need conversion, so I'll add some debug logs
thanks!
One other thing is that even if we pass by-value structs, if any of the fields within are slices or maps, then those are still essentially being passed by reference. I think the situation with slices is sort of complicated, but I'm pretty sure that if you e.g. just changed an existing element of the slice that would also change it for others, so might be risky
I fixed the TestRemoteCacheRegistry issue. I think I was hitting the time limit (10 seconds, iirc - right @still garnet?) between when the dagger engines were getting spun up, which caused the registry to shut down, losing the data, etc. So, I created a cache volume mount to store the data. 🙂
I've also added checks to make sure the SHA values aren't ""
yeah true - I'm also thinking of just adding an explicit Copy() method to deal with those things, and leaving it a pointer. we'd have to be more explicitly careful, but it's probably better to keep our wits about us
if you mean the service shutdown grace period, yep, 10 seconds. you could also explicitly run the container in a separate goroutine if you want to have more control over it
slices are especially tricky with the whole "sometimes appending actually mutates" behavior 😬
Ah, that's a good idea, too. You'd just call cancel on the context you pass it to kill it?
yep!
I'm glad I didn't jump on https://github.com/dagger/dagger/issues/3323 2 weeks ago :p
😄 yeah, pulling on this thread revealed an elephant on the other side
Whoa, think I just found some low-hanging fruit for speeding up dagger session start time, which has a huge impact on test runtime and dagger run. Flamegraph before/after attached.
The subset of core/integration/ I ran went from 187.7s to 63.3s. 😱 (The subset excludes any tests that need to talk to a registry since those no longer work when running directly on the host, post-Dagger-in-Dagger refactor)
Here's the patch, which we can apply to our fork of graphql-go-tools: https://gist.github.com/vito/f28b442d7f3bd3bee854921a7c4c16f9
Seems like this parse-then-print-then-parse dance is to deduplicate type definitions, but we don't have any dupes in the first place, so directly appending seems to work fine.
Those flamegraphs were captured by hacking --cpuprofile into the Go SDK's session start flags, and then collecting them all with go tool pprof *.prof - so that spans all tests
This is also on top of https://github.com/dagger/dagger/pull/4973 - haven't done a benchmark without it yet, I was so surprised by that first flamegraph that I went on a detour to figure that out first. But now I expect this PR to be peanuts in comparison 😅
Wow, 187.7s to 63.3s is amazing! Does the code you patched only run once (or a limited number of times) at session startup? Or was it running every time a query is sent or something like that?
(I think it's just on startup, I'm just double checking)
Yeah, should just be on engine session start when we load up all the schemas
It seems like dagger is not cleaning up processes?
Possibly also not killing containers when the client goes away (below I just cancelled a run)
root 99943 0.0 0.1 723092 19500 ? Sl 18:34 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 9c6ba9ce66366de52df135e29f5edb7e02c10e3f85d691d4ac538730b5dd068c -address /run/containerd/containerd.
root 99964 0.6 0.4 961188 80904 ? Ssl 18:34 0:06 \_ /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
root 100031 0.0 0.0 0 0 ? Zs 18:34 0:00 \_ [dnsmasq] <defunct>
root 100032 0.0 0.0 1160 756 ? S 18:34 0:00 \_ /usr/sbin/dnsmasq -u root --conf-file=/var/run/containers/cni/dnsname/dagger/dnsmasq.conf
root 119007 0.0 0.0 0 0 ? Z 18:48 0:00 \_ [runc] <defunct>
root 119031 0.0 0.0 0 0 ? Zs 18:48 0:00 \_ [_shim] <defunct>
root 122623 0.0 0.0 0 0 ? Z 18:50 0:00 \_ [runc] <defunct>
root 122635 0.0 0.1 731736 19292 ? Ssl 18:50 0:00 \_ /_shim /bin/sh -ec ???export DEBIAN_FRONTEND=noninteractive ???UPDATED_PATH=/var/cache/apt/moby/updated ???UPDATED_MAX_AGE=60 ???if [ -z "$(find
root 122645 0.0 0.0 2888 924 ? S 18:50 0:00 \_ /bin/sh -ec ???export DEBIAN_FRONTEND=noninteractive ???UPDATED_PATH=/var/cache/apt/moby/updated ???UPDATED_MAX_AGE=60 ???if [ -z "$(find ${U
root 122647 1.3 0.3 70632 64748 ? S 18:50 0:00 \_ apt-get install -y qemu qemu-system qemu-utils openssh-client linux-image-kvm
systemd+ 122654 0.0 0.0 18656 8548 ? S 18:50 0:00 \_ /usr/lib/apt/methods/http
@wet mason when you have a moment, 🙏 https://github.com/dagger/graphql-go-tools/pull/3 (I think permissions on this repo are pretty narrow btw)
Sweet!
@still garnet LGTM and permissions fixed
@still garnet it's weird, looks like they were forcing a plain text export just to re-import back. Looks like a cheaply implemented deep copy?
re-import as in parse back
yeah, I don't quite get what it protected against, I even forced a duplicate type definition in our own schemas and nothing seemed to break even with just naively appending 🤷♂️
the tests also don't cover duplicate types
maybe there's some funky pointers?
Shouldn't improve life for people on optic fiber, but still nicer than without: https://github.com/dagger/dagger/pull/4975
And very easy to review
nice!
@still garnet in any case, I'd merge and keep it in mind if we see some weird behavior in the future. Probably will never happen. Only possible reason I can think of is those doc.Definitions being pointer-altered somewhere else, and this code was making sure to get an immutable copy
Maybe a change made out of despair at some point? 😅
yeah definitely. it's suspiciously "cheap" win
- I tried to figure things out with git blame but it was all added at once
all those pointer values are created within the method (either parsed from text or the ast.NewDocument(nil) right above) so that all seems reasonably safe to me
I wouldn't be surprised if this was just YOLO'd and they didn't anticipate it becoming so load-bearing 😂
there are other issues upstream about performance not being a focus "yet" (edit: nevermind that's in graphql-go/graphql, not this tools repo: https://github.com/graphql-go/graphql/issues/119)
Wouldn't be surprised it was useless. Quality of this library is very low, @civic yacht and I spent a lot of time dealing with weird behavior at the beginning
Both graphql-go (almost abandonware) and graphql-go-tools (24 stars project)
It's the only way to do dynamic graphql in Go though (for extensions), so that's why it is the way it is
makes sense - I was considering a codegen approach but figured extensions might get in the way of that
Yeah, gqlgen+codegen would be the best way. Initial version of dagger had extensions and that required dynamic, which prevented us from using gqlgen
Yeah I don't think they anticipated the use case in our test suite, which boils down to spinning up probably close to 100 graphql servers in rapid succession 🙂
we even tried to tweak gqlgen to allow for dynamic schema, but that was a much worse rabbit hole
each one merging 10-ish graphql schemas together, so probably close to 1k schemas being handled
https://github.com/dagger/dagger/pull/4977 engine tests passed in 5 minutes 🎉
looks like it was ~8-10 minutes before, not quite the 3x boost I saw locally but I'll take it
Yeah that's still really nice for something so easy. Did this also fix all the startup lag you've mentioned previously when using the TUI?
yeah, dagger run ls is pretty snappy now
Some improvement for the secrets: you can now use them with Dockerfile builds: https://github.com/dagger/dagger/pull/4971
I don't think this code is doing what it is intended to do: https://github.com/dagger/dagger/blob/c4db48ef6e5c2332c7668ea6e73a419574d83dde/cmd/shim/main.go#L395-L409
That cmd is run in a new goroutine so it will run in a different thread.
Could be wrong... but at least in my experience its pretty difficult to manage threads for cmd's.
Took a look, it seems like exec.Cmd.Start never leaves the caller's goroutine:
- https://github.com/golang/go/blob/7c47a6b15782b13ecb76fd3c6c18e5f1edc34733/src/os/exec/exec.go#L693
- https://github.com/golang/go/blob/7c47a6b15782b13ecb76fd3c6c18e5f1edc34733/src/os/exec_posix.go#L54
- https://github.com/golang/go/blob/7c47a6b15782b13ecb76fd3c6c18e5f1edc34733/src/syscall/exec_unix.go#L209
- https://github.com/golang/go/blob/7c47a6b15782b13ecb76fd3c6c18e5f1edc34733/src/syscall/exec_linux.go#L130
- https://github.com/golang/go/blob/7c47a6b15782b13ecb76fd3c6c18e5f1edc34733/src/syscall/exec_linux.go#L291
Let me know if I'm missing something though. Either way, would be nice if there was some official guarantee of this upstream so it doesn't just change out from us someday: https://github.com/golang/go/issues/27505
Yeah I was hunting around (as you did) and a little surprised hence the disclaimer.
I know (or at least am pretty sure!) I've had issues with setting namespaces for the cmd in the past. Maybe it was old issues with goroutine locking.
Oh totally, trying to do any of this with Go is always a shit show... I found a bug in runtime.LockOSThread a few years ago after having to diagnose why go routines were suddenly no longer able to find files (they had randomly teleported to a different mount namespace). And I at one point could consistently crash the runtime by just trying to set a pid namespace on a exec.Cmd, but I didn't even bother reporting because it seemed just not worth it
This is one case where the whole "rewrite it in Rust" meme has some degree of validity, even if it's not actually practical 🙂
Hey @civic yacht and @still garnet, can you guys confirm if fixing exitCode (https://github.com/dagger/dagger/issues/3192) is feasible today, including not messing up the cache on failure? I mean even as a new field, so ignore breaking changes.
yeah, I think if you find the code that fetches the stderr/stdout/exit status for annotating the error response, we should also be able to use that code to implement those APIs without affecting the cache
PSA: I've noticed a few of my PR's test builds seem to fail with seemingly unrelated service test failures/timeouts. Investigating now, let me know if you've seen the same
Hanging/flaky service tests
Commented here: https://github.com/dagger/dagger/issues/3192#issuecomment-1515352716
tl;dr: yes 🙂

