#maintainers

1 messages · Page 7 of 1

civic yacht
#

Yeah I just realized that we only have support for specifying a path to the CLI binary, so it's a bit annoying, you'll have to curl the CLI archive and unpack it yourself if you aren't using the one specified in the sdk...

#

Totally possible but kind of annoying

#

Let me find the url format, one sec

wet mason
#

thank you!

civic yacht
#

https://dl.dagger.io/dagger/main/<git sha>/dagger_<git sha>_linux_amd64.tar.gz

wet mason
#

thanks

wet mason
#

@tawny flicker Hey, there's a bug in Go codegen for Client.Pipeline:

    return &Client{
        q: q,
    }

it's missing c: r.c so it's nil and panic'ing when I try to use it

fair ermine
civic yacht
#

The long term solution is to run the registries properly w/ actual services once we have those

fair ermine
#

Hmm okay, I also spotted a test that fail on my side but I hope everything will work after that

civic yacht
median charm
#

Hi guys, I am getting a x509 certificate error with dagger/engine (I'm using a private registry for images), however I am sure that buildkit is ok, since with dagger-cue and the same buildkit instance, everything works just fine.
I am running docker login before spawning the engine, and as the docs say, this should grant the container to have the same docker creds as the host machine, but it looks like that's not the case here.
Do you know if there is any workaround for this problem?

obsidian rover
median charm
#

No, sorry if it wasn't clear, I am using two different GH workflows, one with python and dagger-cli, the other with dagger-cue

#

They both use the same buildkit instance, but while for dagger-cue everything works with auth for docker.#Pull, the same is not true for dagger-cli/python; after a successful docker login to my private registry, I get the error #1 ERROR: failed to do request: Head "<registry>/v2/python/manifests/3.9-rhel8": x509: certificate signed by unknown authority

twin crow
#

FYI, I documented the summarize of the current state of secrets support, what works and what's not working yet: https://github.com/dagger/dagger/issues/4366 as well as indexing all related open issues. If you think it's not complete, feel free to comment.

GitHub

This serves as an index for all the issues related to secrets. If someone takes ownership on this issue, this should start by triaging the issues list below, corresponding to a proposal to add supp...

bronze hollow
#

I don't think it is on a par with CUE, as with CUE we had the ability to mark stdout of host commands as dagger.#Secret

#
    client: commands: sops: {
        name: "sops"
        args: ["-d", "./secrets.yaml"]
        stdout: dagger.#Secret
    }
wild zephyr
oak sandal
#

You all are amazing to add this feature in so fast.
https://github.com/dagger/dagger/issues/3830
I'll make some time to test this out for my setup. I'm mainly curious about replacing Buildkit with Dagger engine. Currently, I've got this deployed in K8s (without TLS options) : https://github.com/moby/buildkit/blob/master/examples/kubernetes/deployment%2Bservice.privileged.yaml

Would running Dagger engine be as simple as replacing the image to Dagger engine image and then connecting the SDK via _EXPERIMENTAL_DAGGER_RUNNER_HOST env var?

bronze hollow
#

There’s now quite a lot of active secrets issues currently open and I’m not really sure what the expectations are for a potential solution. It seems like the first take on it will be with session attachables, similar to the Earthly approach?

tepid nova
wet mason
wet mason
#

@wild zephyr 🙏

ancient kettle
wet mason
spiral fog
wet mason
#

@spiral fog This week -- I'm currently running tests locally using the new version

tepid nova
#

Does dagger currently edit out the value of secrets from logs, stdout, stderr, the way it did in 0.2 ?

tepid nova
# bronze hollow Nope

Just to confirm, this is true both for values that are decoded from eg. a json file, and also values that map directly to the raw contents of a file or environment variable? Asking because in your issue you mention the former, but unclear on the latter.

tidal spire
# tepid nova Just to confirm, this is true both for values that are decoded from eg. a json f...

Correct

s := client.Host().EnvVariable("FOO").Secret()
out, err := client.Container().From("alpine").
    WithSecretVariable("FOO", s).
    WithExec([]string{"sh", "-c", "echo secret env var is $FOO"}).Stdout(ctx)

fmt.Printf("container output is: %s", out)

p, err := s.Plaintext(ctx)
fmt.Printf("secret var is: %s", p)
#4 sh -c echo secret env var is $FOO
#0 0.072 secret env var is THIS_IS_SECRET
container output is: secret env var is THIS_IS_SECRET
secret var is: THIS_IS_SECRET%
tepid nova
median charm
#

Hi, I was trying to connect my Python SDK directly to dagger engine, so that I first run docker run --name dagger_engine dagger/engine:v0.3.7 and in another shell first export DAGGER_RUNNER_HOST=docker-container://dagger_engine and then execute my python script trying to pull Python from a private registry, and then printing its version.

The problem that I am encountering is that I am getting x509: certificate signed by unknown authority.
From what I see from the help, dagger/engine is a wrapper around buildkit, so that I should be able to provide my CA certs and then use the flag --tlscacerts /some/path/certs.pem.

The problem, however, is that locally I cannot mount any volume to docker containers. I though then about two solutions:

  1. host dagger on a k8s cluster where I can mount volumes (so CA certs), and then port-forward remote 1347 to local 1347 -> PROBLEM: how should I setup DAGGER_RUNNER_HOST, since tcp:// links are not supported?
  2. force dagger/engine to use another buildkit instance, i.e., an instance on k8s that I can port-forward to local to another port -> PROBLEM: is that even possible?
median charm
#

Hi I was trying to connect my Python SDK

wild zephyr
white pilot
#

Not sure if this is some engine-related bug, my setup, or something else, but since adopting Dagger in our new monorepo, we frequently run into No space left on device coming out of buildkit:

#18 ERROR: open /var/lib/buildkit/runc-overlayfs/snapshots/snapshots/24/fs/opt/az/lib/python3.6/site-packages/azure/mgmt/network/v2017_10_01/aio/operations/__pycache__/_operations.cpython-36.pyc: no space left on device

Does this require some configuration on our gitlab runners? And is there a way to provide a more helpful error message for this? I wouldn't be surprised if this somehow relates to running docker:dind.

wild zephyr
white pilot
white pilot
wild zephyr
white pilot
rancid turret
wet mason
#

@civic yacht @still garnet @stray heron I believe https://github.com/dagger/dagger/pull/4340 is ready to go

I did a self-review to point out a few things. I recently switched telemetry to a different target, and noticed some weirdness with the numbers. I believe it's because of how buildkit reports started/completed times, and the fix is to implement the same nasty mess buildkit does (interval list)

wild zephyr
wet mason
# wild zephyr hey folks, is there any README file somewhere about how to set the OTEL stack? I...

Hey sorry, was not around. We're using grafana cloud (which is basically a hosted Loki/Tempo/etc) to ship/store OTLP data. Do you mean setting up a different account or bringing up the stack on your own?

otel libraries support a standard set of env variables which we're using in #3826: https://opentelemetry.io/docs/concepts/sdk-configuration/otlp-exporter-configuration/

We're sending those straight to Tempo (hosted by grafana labs), but you could also send them to an OTEL Collector instead (https://opentelemetry.io/docs/collector/) and configure it to ship them somewhere else

OpenTelemetry

Environment variables for configuring your OTLP Exporter.

OpenTelemetry

Vendor-agnostic way to receive, process and export telemetry data.

wild zephyr
wet mason
#

Oh, not really. OTEL/OTLP is a format, there’s different consumers. I’ve used Jaeger in the past and I think we still have 0.2 docs about it (there’s a compose tracing yml file in the repo IIRC)

#

That’s for traces though. Logs (e.g. events) are a different beast. Never used Loki locally

#

For the grafana dashboard itself, Gerhard has a json config file somewhere

wet mason
#

@wild zephyr btw — not sure what your use case is, but you could just send those in the hosted version and build a different dashboard

wild zephyr
wild zephyr
#

Do we have a planned release this week? 🙏 cc @stray heron @wet mason @civic yacht

wet mason
#

Hey, sorry @wild zephyr, forgot to get back to you. I was catching up on the latest RELEASING.md changes, haven't released myself in a long while

#

I think we should release soon (today ideally), haven't released in a couple of weeks. Is that ok for Python/Node @rancid turret @wild zephyr ?

wild zephyr
rancid turret
tepid nova
#

Thinking about your entrypoints list @still garnet ... And a comment I read in the issue. We've been assuming that we can't use any of the pipeline code as entrypoint, because it's trapped in arbitrary code - basically a black box until we execute it. But I'm wondering if maybe we should revisit that.

#

Perhaps we could use the new Pipeline() directive, and make those pipelines the entrypoints? Then the session could be configured (via dagger run or otherwise) to selectively run some pipelines, and not others

#

The client would still run as a whole of course. But that's cheap (or at least it's reasonable to expect the client developer to make it cheap). But each time it makes a Pipeline("foo") call, the engine would make a decision on whether to actually run the following instructions, or silently drop them

still garnet
#

yeah, I think we've been talking past each other, since I assumed we wanted those pipelines to be the entrypoints already

#

just wrote a big old reply that surfaces these questions 😄

tepid nova
#

I'm open to any design. My starting point is simply what our users do today, eg. with mage.

#

The de-facto entrypoint today for a project, is a project-specific executable tool, which most commonly takes a pipeline name as its first command-line argument.

tepid nova
wet mason
#

@civic yacht Would you have some time to cut a release today? The steps have changed quite a lot since I've last ran them. Or I can do that on my own -- is it pretty smooth overall?

civic yacht
# wet mason <@949034677610643507> Would you have some time to cut a release today? The steps...

Last release was rough but that was due to all the changes around the CLI/sessions/etc. I ironed out the problems afaik but I can't be 100% certain its smooth again until we do another one. So I don't mind running through it again this time but in general hopefully it's something we can all share the load on.

That being said, I'd like to wait for word on which version to use for the nodejs SDK. Last time I asked last minute and just bumped the patch but later it turned out a minor bump was preferable

#

Also want to make sure the node SDK doesn't have any half-way done features that need more merges before releasing

wet mason
#

100% on sharing the load. Maybe we should have a rotation, so everyone is up to speed with the release process? /cc @twin crow

#

Also builds incentives to automate all of it 🙂

civic yacht
mellow bolt
mellow bolt
wild zephyr
rancid turret
civic yacht
wet mason
#

Ah. Permission issues, it's because it runs from a PR ...

#

/cc @stray heron

stray heron
#

Looking into this now. ETA Monday.

stray heron
wild zephyr
#

@civic yacht @wet mason happily volunteering to release and/or shadow someone for the release this week

civic yacht
#

Happy to be around for questions if anything comes up while you run through it though

tepid nova
#

Feedback on our own dev workflow. Running ./hack/make lint after not touching the repo for a few weeks is... overwhelming. I just watched what seemed like a massive chain of docker builds, unfold on my terminal for 5 solid minutes. I had no idea what was going on.

#

I think we should stop relying on the default buildkit output for display to the user. I would prefer if it were disabled by default, and instead the tool gave me an output relevant to the task at hand (linting)

#
$ ./hack/make lint
<insert output relevant to linting>
wet mason
#

@tepid nova that's the frustration that led me to prototype the terminal UI

wet mason
#

@tepid nova The tricky part is that pipelines are asynchronous so:

printf(“building the engine”)
WithExec(“go build”)

We’d be printing “building the engine” right away, but it actually gets built by a different function later on!

We rely on async behavior throughout the CI code quite a lot (helper to get the repo code, helper to set up a go env, python, node, …), basically our “targets” combine a bunch of helpers together to assemble a final target, so any message we’d output in helpers would be “too soon”

In case of errors it’s worse:

“building the engine”
[bunch of messages about other things]
[much much later …]
“go build failed” <— referring to the engine

Current system is not human legible. It’s just enough so when it fails, the output is useful enough to figure out why

#

@tepid nova the best way around this IMO is visualisation “aided” by code (provide hints of what’s going on using Pipeline()). Terminal visualisation for local dev, and possibly someday (wink wink) cloud visualisation for CI

tawny flicker
tepid nova
#

I agree with that, it's hard for the client code to print information about pipeline execution to the user directly.

BUT that's not the only thing I may want to print. For example, in the linting example what I want to print is the output of the linting tool. Right now that is mixed with all the rest of the logs. But from the client tool's perspective, it's not just another log event - it's what I want to print to the user.

#

So I think there's 2 distinct visualization problems:

  • Problem 1: when you want to visualize the engine logs, it's hard to follow. Solution: better viz UI in the terminal, then in web

  • Problem 2: when a Dagger client wants to control its own output, we don't give clear guidance on how, or whether they should. It's all or nothing: dump the logs, or get no output

#

An example of problem 2, in the context of the linter: I want to see my linting summary. I don't care about dagger logs. How do I get that?

  • Option 1: client tool implements its own output logic.
output, err := lintPipeline.Stdout(ctx)
if err != nil { ... }
fmt.Printf("Linting result:\n%s\n", output)
  • Option 2: give pipelines an API to control their own user output (can't just be exec stdout! too noisy)
???
#

maybe as extra inputs to Pipeline()?

wild zephyr
#

👋 just wondering if there's a reason why we're not currently re-using cache upon engine updates. Since dagger engine volumes are not named, when the engine image changes, all my buildkit cache gets invalidated. cc @civic yacht @wet mason

tepid nova
#
lintPipeline := c.Pipeline("linter", &PipelineOptions{
  // If Info points to a container, use its stdout as a markdown-formatted info field for the pipeline
  Info: c.blabla.blabla.Exec("lint-tool")
}
wet mason
wet mason
tepid nova
#

Right. Which we need anyway

#

But I realize now there's a big design decision underlying this - how are developers supposed to handle output to the user. In that regard, is Dagger A) a library, or B) a framework.

wet mason
tepid nova
#

But I guess we can have a default rule to decide who wins

wet mason
#

Right

tepid nova
#

Sub-pipelines are OK assuming each sub-pipeline has its own status field. If a pipeline has lots of sub-pipelines, and no exec of its own, it just has a blank status (no rollup by default)

#

The problem in this model is that, eventually we're going to want callbacks

wild zephyr
civic yacht
# wild zephyr Interesting.. I would have expected buildkit to handle this gracefully internall...

(sorry was in interview, catching up) Yeah buildkit does actually attempt to handle this sort of but it has had multiple problems in the past so I'm not at the point where I fully trust it. There are also cases where dagger might want the cache to be invalidated. E.g. if we change something about the shim behavior, the shim is not (currently) part of the cache key for execs so then execs won't actually re-run if you upgrade and that could cause problems

#

There are better solutions than just getting rid of the previous cache, but it was simplest to start there

wet mason
still garnet
#

@wet mason @tepid nova one option re: lint-like commands that print output - with bubbletea you can print to an 'alternate screen' that disappears when it's done, so i think that would let us have both?

wet mason
#

Like for a linter the output of from nodejs / yarn install that led to the setup of the linter are not the “meaty” part

still garnet
#

yeah, i'm thinking all that command output could render in a progress UI on the alternate screen, and then the user's fmt.Println(lintStdout) would go to the regular stdout

wet mason
#

Oh! You mean Println from the pipeline code itself, not from a container. Makes sense!

still garnet
#

yeah exactly

wet mason
#

Yep yep

#

Actually I was thinking and had on my to do list: you know how the TUI was showing hierarchical pipelines?

What if the “root node” at the top is the pipeline itself (e.g. go run or dagger run). And the output of that when you look at that node is the stdout of the pipeline code?

Like:

  • dagger run foo
    • sdk/node/lint
      • yarn lint

If you focus on the root node, you’d only see the output of your pipeline code. Not mixed together with container logs

/cc @tepid nova ?

still garnet
#

yup yup, did something similar with bass - but i think you'd still want to support regular stdout, for unix piping or redirecting etc etc

#

and being able to see output after the pipeline finishes

wet mason
wet mason
#

I think it’s more important than the summary bit

#

Ok you’ve convinced me

So basically:

  • while running, it’s an alt screen with complex ui
  • among pipeline status you can see your program’s output
  • upon completion, you see the exact, raw output of your code
#

That feels good. As a user you’re free to handle your code’s output as you see fit. However, while it’s computing, you get this “Heads Up Display” that augments your output with useful information. It goes away on exit, you only see what you meant the output to be

still garnet
#

yeah, you could also potentially have it decide between summary or raw output based on whether stdout is a tty, but i think you'd want the raw output to not be truncated in the summary either way (just so they don't have to redirect)

#

and/or show summary if there was an error

wet mason
#

Yeah, neat. That would make the experience so much better already without any new concepts

#

Also I like the simplicity of it

“How do I display extra information to the user?” —> Printf

On the telemetry side, we could also send the code’s stdout/stderr

still garnet
#

does that rely on having an entrypoint?

#

to capture fmt.Print etc

#

seems like youd need to capture it with a reexec/fork or running it in dagger (dagger-in-dagger)\

wet mason
obsidian rover
#

Is the function marshalValue in sdk/go/internal/querybuilder/marshal.go concurrent safe ? I have the TestGit test failing only when running in parallel of all the Go SDK tests

#

Fake alert 😇 Human error 👼

wet mason
#

I resurrected the POC "terminal UI" code, adapted it to the new Pipeline API and ran it through our live CI!

It "works", but there's some very large issues to fix (e.g. with our CI, there's too many tasks and it doesn't support pagination so it's not really usable)

https://github.com/dagger/dagger/pull/4522

obsidian rover
civic yacht
wet mason
#

@tawny flicker did my comment on secrets help you get unstuck?

tawny flicker
wet mason
#

os.Getenv() or os.Readfile

tawny flicker
#

Right

wet mason
#

depending whether they were passed as ENV or mounted as file

tawny flicker
#

Perfect, easy peasy

wet mason
#

Basically in the shim you're "just like the user" -- so you have to act accordingly

tawny flicker
#

Thanks

wet mason
#

np

hot smelt
#

Some news, I did'nt got accepted to devoxx fr, how about you @tawny flicker ?

obsidian rover
hot smelt
north jay
#

Is there any objection to supporting embedded dagger engines in the go client?

tawny flicker
#

I keep seeing the download of

github.com/containerd/stargz-snapshotter

Why is it downloaded like 3 or 4 times during a run?

#

It's like 40MB, so not fun to download 3-4 times, is there some cache issue about it?

tawny flicker
hot smelt
north jay
#

In the go SDK I want to write libraries that don't need a full *dagger.Client but instead something much more minimal like a *dagger.Container.
Problem is, some cases I always will want to call From() on the container. I feel like a *dagger.Container type in the function signature does not convey what the function is expecting to receive.

I'm wondering if there could be something like a *dagger.Scratch type that is basically a *dagger.Container but the type is explicitly a brand new container with nothing and is just a placeholder.

Of course I could be over thinking it and it really doesn't matter.
Though I think a dagger.Scratch could indeed be a special type that supports .From() where as *dagger.Container supporting .From() is kind of weird.

wild zephyr
# north jay In the go SDK I want to write libraries that don't need a full `*dagger.Client` ...

well.. it's tricky since there are some things from *dagger.Container that are useful to chain in multiple froms. ie:

    foo := client.CacheVolume("mycache")

    c := client.Container().From("alpine").WithMountedCache("/mycache", foo).WithMountedSecret("/secret", client.Host().EnvVariable("VERYSECRET").Secret())

    c.From("ubuntu").WithExec([]string{"ls", "-la"}).Stdout(ctx)
}

^ in this case VERYSECRET and mycache will be persisted across Froms. Abstracting that to *dagger.Scratch will require copying a lot of those methods as well.

north jay
wild zephyr
wild zephyr
#

👋 I'll bump dagger to latest go 1.20 primarily to validate that our tests run ok since there's been a few users that reported some instabilities after bumping to Go 1.20 which we couldn't replicate. LMK in case you have any comments/questions

wild zephyr
still garnet
#

lmao. yaml

#

@wild zephyr could alternatively quote it like "1.20" so we still get patch updates automatically

ancient kettle
#

God, I hate YAML sometimes.

still garnet
#

my favorite part is the kitchen clock syntax, like 03:15. i had a valid use case for it at one point, but i may have forced the issue

north jay
civic yacht
wild zephyr
wild zephyr
civic yacht
wild zephyr
wild zephyr
tepid nova
#

@north jay if I understand correctly, it's having to pass a dagger.Client everywhere that you find too heavyweight?

civic yacht
#

k it passed So definitely some weird

north jay
#

Probably not a good way around it at the moment since I also need cache mounts which requires the client.
Really probably one of those things where it really doesn't matter much.

wild zephyr
obsidian rover
wild zephyr
#

ETA for release start ~1h. Last call for any last-minute changes

wild zephyr
wild zephyr
civic yacht
fair ermine
#

@civic yacht I still have a strange error when I try to install Dagger:

package github.com/dagger/dagger/cmd/dagger
        imports github.com/dagger/dagger/engine
        imports github.com/dagger/dagger/core/schema
        imports github.com/dagger/dagger/project
        imports github.com/dagger/dagger/router
        imports github.com/dagger/dagger/internal/engine
        imports github.com/dagger/dagger/engine/remotecache
        imports github.com/moby/buildkit/cache/remotecache
        imports github.com/moby/buildkit/cache/remotecache/v1
        imports github.com/moby/buildkit/worker
        imports github.com/moby/buildkit/cache
        imports github.com/moby/buildkit/snapshot
        imports github.com/containerd/containerd/snapshots/overlay/overlayutils: build constraints exclude all Go files in /Users/tomchauveau/go/pkg/mod/github.com/containerd/containerd@v1.6.14/snapshots/overlay/overlayutils
package github.com/dagger/dagger/cmd/dagger
        imports github.com/dagger/dagger/engine
        imports github.com/dagger/dagger/core/schema
        imports github.com/dagger/dagger/project
        imports github.com/dagger/dagger/router
        imports github.com/dagger/dagger/internal/engine
        imports github.com/dagger/dagger/engine/remotecache
        imports github.com/moby/buildkit/cache/remotecache
        imports github.com/moby/buildkit/cache/remotecache/v1
        imports github.com/moby/buildkit/worker
        imports github.com/moby/buildkit/cache
        imports github.com/moby/buildkit/snapshot
        imports github.com/moby/buildkit/util/overlay: build constraints exclude all Go files in /Users/tomchauveau/go/pkg/mod/github.com/moby/buildkit@v0.11.1/util/overlay
#

Same with ./hack/make engine:build

civic yacht
fair ermine
#

Snif 😦
Weird the CI hasn't spotted that

civic yacht
# fair ermine Snif 😦 Weird the CI hasn't spotted that

Yeah the fix is simple, I'm adding a workflow that will catch this too. The problem is that in PRs we only ever built for the host platform (to save time). My last PR fixed a problem with the engine image but now have to do the same for the CLI

fair ermine
#

I got it, may we shall run engine-race-detection on both ubuntu and mac too? With a matrix that should not take that much time

civic yacht
fair ermine
#

Okay party_blob

fair ermine
civic yacht
still garnet
#

just noticed we're leaking goroutines in the SDK, in my testing it's 2 leaked per call to c.Container().From(...).WithEnvVariable(...).WithExec(...).Stdout(ctx)

#

seems to be leaked HTTP connections, maybe we're not closing the response body? 🤔 edit: we're closing it

#

going hunting

#

sidebar: i don't know how i went this long without discovering GODEBUG=tracebackancestors=999 but it's super helpful in cases like this. (it shows who created a goroutine in the stack)

still garnet
tawny flicker
#

Am I the only one who use some http.Get on a local webserver to trace some behaviour in the engine/shim/container?

tawny flicker
# wild zephyr not sure I follow Tanguy 👀

I was trying to trace wether my secret scrubber was working or not once integrated in the shim, but logging to the shim sometimes didn't seem to output anything or even would break things (I might mix it up with the provisionning, though).
So in the end, instead of logging to stdout/stderr, I just http.Get to a local ip webserver that the shim, the engine or the container have access to, and I can see what happens in order.

#

I just unblocked myself doing that.

still garnet
#

@tawny flicker one annoyance I've run into is the Dagger client in the tests doesn't flush its logs before the test fails, so pretty often they're cut off. maybe that's what's happening?

tawny flicker
#

Well, it was manual testing

#

There were different reason for failure:

  • first, I was running with the wrong engine
  • It seems my new run didn't use the new shim, but I'm convinced it was because of the cache of the CI code. But after dumping the engine + the volume, the new shim got executed
  • now that our cache show the previous output, I forgot my CI code didn't change and I thought it was executed with the new shim (I didn't use a cache buster at that time) (even though I could have checked the "CACHED" log print.
  • the real bug was: I modified the amount of character written and outputed the newest version with its len. The cmd.Run didn't like that I didn't write the amount it sent to me ^^' (as one should). But the laconic short write error brought me into different rabbit holes and fixes (mutex on the writer, …)
#

anyway, my setup works ok now. But I'm gonna use a cache buster this time :p

wild zephyr
tawny flicker
#

And how do you get the content of the file? On the dagger engine /var/lib/dagger fs?

wet mason
still garnet
#

@wet mason ran into a codegen limitation, which we actually hit before but worked around, but now I'm wondering if we should fix: it's not possible to have foos: [Foo!]! where Foo is e.g. type Foo { a: Int!, b: String! } because you get this error on x.Foos(ctx):

input:1: Field "exposedPorts" of type "[Port!]!" must have a sub selection.

we previously ran into this with envVariables: [EnvVariable!]! but wrote the test with raw GraphQL to work around it. I think looking up exposed ports might be a more common thing to do though, so it'd feel bad to punt again.

i think we could get away with just selecting all of the fields, at least in these two cases, but I have a feeling the rabbit hole goes a little deeper still; the returned type is a slice of objects which have their own methods for querying fields, like (*Foo).A(ctx). but I don't see how that would work when the foo in question is, say, the 12th object returned from the array. seems like we would instead want "dumb" structs with all their fields pre-populated or something. not sure if what I'm saying makes any sense. 🤔

wet mason
#

It's tricky because of the reasons you pointed out

fair ermine
#

Yeah it's really tricky to do :/

wet mason
#

Also because the same object (e.g. Env) can be returned pre-filled (e.g. envVariables) or "on demand" (e.g. envVariable)

#

.EnvVariables[0].Name

vs

EnvVariable("FOO").Name(ctx)

fair ermine
#

Yeah and the issue is that you don't parse it the same way in GraphQL, so you have to deal with type conversion and stuff to handle it...
Since it's not possible to have

type EnvVariable struct {
   Name string
   Value string
}

func (e EnvVariable) Name() string {
   return e.Name
}

func (e EnvVariable) Value() string {
   return e.Value
}

Because EnvVariable cannot have both public field and method with the same name.
And if you make it private, you cannot retrieve data in JSON because the parser cannot access private field to insert data in...

still garnet
#

yeah, definitely seems tricky! hmm

fair ermine
#

So you have to create a intermediate type to retrieve data and fill it in EnvVariable and deal with type conversion
It's really tricky for now and I still haven't found an elegant way to do

wet mason
#

lots of codegen custom code though

#

@fair ermine Is it an immediate problem though?

fair ermine
wet mason
#

We could generate an EnvVariable that only contains public fields

fair ermine
#

And how do you replace the existing binding?

type EnvVariable struct {
    name  string
    value string

    q *querybuilder.Selection
    c graphql.Client
}

// The environment variable name.
func (r *EnvVariable) Name(ctx context.Context) (string, error) {
    if r.name != "" {
        return r.name, nil
    }

    q := r.q.Select("name")

    var response string
    q = q.Bind(&response)
    return response, q.Execute(ctx, r.c)
}
#

If name become public, how the user can select with .Name(ctx) ?

still garnet
#

i wonder if we could annotate certain types as "dummy" or "eager" or something, and they'd always have regular fields instead of the getter methods? that seems like what I would want for EnvVariable, Port, and also FileInfo (from Directory.Stat)

fair ermine
#

The solution I tried to make was somehow close to what you think

I created a "dummy" temporary type with public field to catch values and then I convert it to EnvVariable by inserting private fields

still garnet
#

yeah, sounds the same, I guess I'm proposing to just not even have the methods for these types, so they're just a struct that we unmarshal into

#

(not sure how this would translate to the other SDKs)

fair ermine
#
    type envVariable struct {
        Name  string
        Value string
    }

    var response []envVariable
    q = q.Bind(&response)

    // Then make the conversion
#

That what I'm working on for now, I need to handle a proper conversion that will be elegant, I got new ideas that I'm gonna try today

wet mason
#

not sure which pattern we should adopt but they should be consistent

tepid nova
#

Depending on which road we take for handling dynamic secrets, Host.EnvVariable may in the future become redundant?

tepid nova
#

@still garnet asking here to avoid a tangent-on-the-tangent in the github thread: re this: https://github.com/dagger/dagger/blob/c48df04f949a0eb94567fdc75a213ff6b7fb6330/core/integration/services_test.go#L1-L11

--> Perhaps a naive question: should execs never be cached when they have one or more service bindings? And would that solve this particular problem?

GitHub

A programmable CI/CD engine that runs your pipelines in containers - dagger/services_test.go at c48df04f949a0eb94567fdc75a213ff6b7fb6330 · dagger/dagger

#

(separately, I'm fine with an experimental flag to cut scope without breaking the tests)

still garnet
#

hm that's a good question, I'll try to think of an example that makes the answer obvious

#

oh, actually, one example might be databases for tests

#

we probably want to cache those test runs

#

so far I haven't really run into a case where I didn't want it cached, fwiw. the only awkward part is we still start/detach the service redundantly.

tepid nova
#

Assuming we had a simple cache: bool option in the future, we could change its default behavior depending on service bindings (no service: default to true. servide: default to false), but you could still override explicitly either way?

#

I may have misunderstood the problem - thought that tricky test situation was caused by indesirable caching of a client exec

still garnet
#

it's caching that we still want in the end, but the issue for the tests was that calling Container.Directory/Container.File forces the container to evaluate ahead of the time that the test wanted it to

#

for example, a test that asserts that we start the service when we try to read the contents of a file that came from a container that uses a service. it would result in a false positive because getting the file forces the service start + container evaluation before we can even try to read its contents

#

it's a kind of confusing thing to explain 😅

#

in my limited experience using a feature like this, I think caching by default still makes sense, but I can totally see how there may be cases where you don't want it to be cached once services are involved. but the test situation is more of a special case for the tests

tepid nova
#

Ok 🙂 Thanks for explaining it to me

still garnet
#

np!

still garnet
#

(reran 3 times, same failure)

ancient kettle
still garnet
#

here's the diff for my case - it's just whitespace!

--- sdk/nodejs/api/client.gen.ts
+++ sdk/nodejs/api/client.gen.ts
@@ -409,7 +409,6 @@
    */
   Udp,
 }
-
 /**
  * The platform config OS and architecture in a Container.
  *
@@ -2605,6 +2604,7 @@
 /**
  * A port exposed by a container.
  */
+
 export class Port extends BaseClient {
   /**
    * The port description.
#

super weird

ancient kettle
#

Hmm...

#

That is weird.

#

🤷🏻‍♂️

still garnet
#

our actions runs don't have any magic caching right?

wild zephyr
wild zephyr
still garnet
#

oo good to know

#

@wild zephyr would you mind sharing your resolv.conf?

#

...assuming there's nothing sensitive in there

ancient kettle
wild zephyr
# still garnet <@336241811179962368> would you mind sharing your `resolv.conf`?
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search .
still garnet
#

huh, where's it getting fe80::1%2?

#

oh, how about /run/systemd/resolve/resolv.conf

wild zephyr
# still garnet oh, how about `/run/systemd/resolve/resolv.conf`
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 192.168.1.1
nameserver fe80::1%2
search .
still garnet
#

i guess it's not happy with the "zone" part of the ipv6 address (%2), i'll confess i have no idea what the significance of that part is

#

which is kind of annoying, because the intention is to just pass these values straight through Docker and into the container's resolv.conf, so stripping it off feels weird. seems like an overzealous CLI flag validation 🤔

#

apparently the zone refers to the network interface to use, either numeric index or name (e.g. eth2) - so carrying it into the container verbatim could be wrong anyway, since it's a different network namespace with different interfaces. hmm

#

@wild zephyr what does docker run alpine cat /etc/resolv.conf say for you? wondering which values docker carries in by default

wild zephyr
# still garnet <@336241811179962368> what does `docker run alpine cat /etc/resolv.conf` say for...
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 192.168.1.1
search .
still garnet
#

ok cool, looks like it skips ipv6

wild zephyr
#

I think docker run only looks at /etc/resolv.conf

still garnet
#

i think it went through the same logic actually

#

since 192.168.1.1 comes from the systemd config

wild zephyr
#

yeah.. just noticed

#

maybe because I didn't enable ipv6? I think there's a flag in the engine for that

#

yep dockerd --ipv6

still garnet
#

interesting

wild zephyr
# still garnet interesting

@still garnet docker run with ipv6 enabled:


# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 192.168.1.1
nameserver fe80::1%2
search .
still garnet
#

weird, it didn't follow through to the systemd config?

#

oh nvm (saw edit)

wild zephyr
#

^ just edited the snippet. Old clipboard buffer 😄

still garnet
#

how about docker run alpine ip addr

#

curious if that zone (%2) even refers to a valid interface

wild zephyr
#
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
68: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2001:db8:1::242:ac11:3/64 scope global flags 02 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:3/64 scope link tentative 
       valid_lft forever preferred_lft forever
still garnet
#

seems busted since there's no 2: interface, but what do i know. i'm going to just have it skip ipv6 addresses, we'll see if someone complains

#

thanks for the debugging 🙏

wild zephyr
#

🚀 . LMK when you update the code and I can help you checking the linting thingy 😛

still garnet
#

already pushed 🙂

wild zephyr
#

👍 running lints

wild zephyr
#

@still garnet I'm wondering if it could be a race condition due to GHA runners being slow. I've noticed that the lint mage target does yarn lint and lintGeneratedCode in parallel. 🤔

still garnet
#

hmm

#

I've rerun it a bunch of times and it's consistently failed

#

but I thought I remembered at least one time where it fixed it

#

can't find any proof of that in the run history though

wild zephyr
#

we found the issue sdk generation issue. Rebase with main was needed.

still garnet
frank wren
#

Hello, I hope this is the right channel for this question. I am looking to contribute to dagger, and I was wondering if there is any specific label for beginner friendly issues. I tried searching for them but couldn't find one :), thank you!

tepid nova
#

Asking here is also an excellent way to get ideas 😁

frank wren
tawny flicker
# still garnet anyone else run into this sdk/nodejs failure? <https://github.com/dagger/dagger/...

How did you rebuild it yourself? With client-gen? Or ./hack/make sdk:nodejs:generate?

The crux of it is client-gen will generate right out of Go. But the code might have a little too many whitespace sometimes, and the biggest part is the split of a function when the line is too long.
./hack/make sdk:nodejs:generate will make it OK as it will pass that generated code through a yarn lint --fix that will make it OK.
You need to commit the generated code in your PR as well.

I guess we need to document that and/or add some CI part that will check if that part of the code needs to be regenerated is actually done. (I guess any modification in the GraphQL schema in a PR should check if the SDK code has been submitted/committed as well.)

still garnet
tawny flicker
ancient kettle
#

@civic yacht Do you have any time to chat today? I've got some interesting info from some runs with the new caching code... and I'd love to chat with you about it.

civic yacht
ancient kettle
civic yacht
still garnet
civic yacht
still garnet
#

it's a big one 🙂 so no rush

still garnet
#

@civic yacht I'm looking into updating the engine setup docs for services, which I think only requires documenting the resolv.conf changes. but I think I can hide that complexity by pushing it into the entrypoint. seems possible by 1) copying the contents of the current file, 2) unmounting it, and 3) writing it back as a regular file with our DNS server + search domain added. the only catch is it needs --privileged, which we already have/need. does that sound crazy? 🤔 main advantage is we don't need to pass --dns flags to Docker when running the engine.

#

i guess that could be an issue if/when we want to support rootless

#

i'm happy to also just update the doc, only iffy part is that we'd start documenting addresses like 10.87.0.1 (which was literally chosen by me asking "someone give me a number from 0 to 255" and @ancient kettle said 87 😂)

#

i'll try it out and see if anything blows up

civic yacht
# still garnet i'll try it out and see if anything blows up

Your message gave me flashbacks to weirdness around docker and having to use fsnotify to synchronize resolv.conf on certain distros like Ubuntu that I had to deal with back in the day (may no longer be relevant), so I was doing some refreshing on stuff before answering you, but yeah if you give it a try and it looks good to you then :shipit:!

still garnet
#

ah gotcha, thanks for checking. that does raise an interesting point, if someone changes their DNS config on the host I suppose it wouldn't propagate to the dagger engine container anymore, now that it isn't a bind-mount

still garnet
#

pushed! all tests pass, seems OK

#

@civic yacht also: I removed dumb-init - it was only needed before because the CNI plugin kept killing and restarting dnsmasq, but I removed that. I could keep dumb-init, but I figured we might want to notice zombie leaks anyway

civic yacht
# still garnet ah gotcha, thanks for checking. that does raise an interesting point, if someone...

Yeah it's hard to piece together exactly what the behavior is with docker since it depends on the version, whether the host is running systemd-resolved, the various knobs of docker container config, etc. It seems like in some situations there is synchronization happening (which the umount might break depending on how docker implemented the sync) but in others that doesn't work anyways: https://github.com/docker/for-linux/issues/889

So in conclusion 🤷 . May just have to ask for a little more beta testing on different platforms before next release than usual. LGTM for now!

civic yacht
still garnet
wet mason
#

Took a quick look last week, I can review post-merge no need to hold up

still garnet
#

sounds good - merged! party_blob

ancient kettle
tawny flicker
obsidian rover
wild zephyr
still garnet
still garnet
#

think i found a fix for the engine not flushing progress logs -- at least in the Go SDK: https://github.com/dagger/dagger/pull/4615/commits/d737b5f8f1911177815331f35d3b5307ebab0643

tl;dr we were closing stdin but then just letting the session command die via the Pdeathsig: SIGKILL flag. so this commit makes client.Close() wait on the session command after closing stdin.

i removed the go proc.Wait() - I think it should be redundant since folks should already be calling Close() to release resources (I'd consider a zombie process to be a resource). (cc @civic yacht as the go proc.Wait() was a recent fix for leaking zombies)

not sure if we need a similar fix for the Python/NodeJS SDKs (cc @rancid turret @mellow bolt)

ancient kettle
still garnet
#

it has proven to be a reliable address range salute

ancient kettle
#

😄

flat lagoon
#

Hi, is there breaking changes in github.com/dagger/engine@v0.3.13?
Since I upgrade from v0.3.12, network fetch become very slow.

I found a new search dns.dagger in /etc/resolv.conf.
What about this? could I disable it?

https://github.com/dagger/dagger/issues/4652

GitHub

What is the issue? May cause by #4505 Here a new search in /etc/resolv.conf added before k8s searches search dns.dagger search gitlab.svc.cluster.local. svc.cluster.local. cluster.local. openstackl...

tepid nova
still garnet
still garnet
still garnet
#

@civic yacht thanks for the review, I added the GraphQL schema docs but I'm hard-blocked on Docker Hub rate limits 😭 would you mind running sdk:all:generate for https://github.com/dagger/dagger/pull/4657?

(or anyone else that sees this)

still garnet
#

not sure why I'm hitting this rate limit, I thought it would use the Docker credential helper, and supposedly I'm not even at my limit for unauthenticated requests 🤔 guess I'll investigate since I'm blocked otherwise

#

oh, rip, I hit my authenticated limit, not my anonymous one. that's fun. i guess i'll...log out?

civic yacht
#

ironically my first attempt resulted in Head "https://registry-1.docker.io/v2/library/golang/manifests/1.20.0-alpine": dial tcp: lookup registry-1.docker.io on 10.0.2.3:53: read udp 172.17.0.3:46260->10.0.2.3:53: i/o timeout

But removing my existing dagger-engine.dev and other engine seems to have fixed that, it's running now I think

still garnet
civic yacht
still garnet
#

merging, also have a follow-up that might fix the network issues, but hard to be sure without a repro. will open a separate PR

still garnet
#

follow-up PR: https://github.com/dagger/dagger/pull/4658 - @civic yacht this should also allow changes to propagate to resolv.conf since it relocates the bind mount (which is, apparently, a thing that is possible)

still garnet
still garnet
#

@wild zephyr do you know of any content there for running Dagger pipelines and/or the engine in K8s?

still garnet
still garnet
#

yup yup, was just looking into it to see if it makes sense to preserve these options for our engine and found a bunch of posts about it

#

i think it makes sense to leave alone, so people can reach services/etc. from their dagger pipelines with the normal semantics

hasty basin
#

right, makes sense. I wonder if the "fully qualified end dot" approach works as advertised...

If you have few static external names...adding a . at the end.
e.g. google.com. vs google.com

still garnet
#

yeah, i've never actually seen that used. like what does that mean for config params, http://example.com./foo?

hasty basin
still garnet
#

haha just tried it, doesn't work in my browser but who knows what's going on there

still garnet
#

@stray heron we should probably ship v0.3.14 now that https://github.com/dagger/dagger/pull/4657 is merged, plus https://github.com/dagger/dagger/pull/4658 which is an unverified fix that folks experiencing issues can try by opting-in

I can do the release tomorrow, just dropping a ping in case you're interested in doing it in your timezone or want to use it as an opportunity to test any release process changes you've made 🙂

(trying Discord's new @silent feature to avoid a 1:23AM ping, hopefully it works)

cc @cosmic cove @wet mason

flat lagoon
#

mounted configMap is not writable.

still garnet
#

@flat lagoon thanks so much for testing it, definitely didn't think of that 😅 working on a fix!

flat lagoon
#

I try to disable without set _EXPERIMENTAL_DAGGER_SERVICES_DNS=1.
Then it still not work well.

still garnet
#

what happens? do you have any error output to share? or is everything just slow?

#

would you mind sharing /etc/resolv.conf?

flat lagoon
#

@still garnet once you fixed /etc/dagger/engine.toml writable issue,
I can continue to test with _EXPERIMENTAL_DAGGER_SERVICES_DNS=1

still garnet
#

just confirming, is the issue still happening with it unset? or is it OK?

flat lagoon
still garnet
#

ok great

#

will ping when I have a new image!

#

@flat lagoon ok, you can try sha256:a184e9a44df789ee493af53d9cce531d2cd079e2e9dedc06129306cf60b344d7 (aka vito/dagger-engine:dev2)

flat lagoon
#

people may a full buildkit.toml

still garnet
#

oh, can you remove it from your configmap?

flat lagoon
still garnet
#

i was just wondering about this, we probably want user-provided config to be additive

flat lagoon
still garnet
#

oh i see, so you had other config values set under [worker.oci]

#

hmm, tricky

#

might be time to abandon bash 🤔

flat lagoon
#

@still garnet when set _EXPERIMENTAL_DAGGER_SERVICES_DNS=1. network issue back
Connection failed [IP: 123.125.16.219 80]

flat lagoon
still garnet
#

yeah something like that

flat lagoon
#
# cat /etc/resolv.conf
# dagger dnsmasq server
nameserver 10.87.0.1
search gitlab.svc.cluster.local. svc.cluster.local. cluster.local. openstacklocal
options ndots:5
still garnet
flat lagoon
#
/ # env | grep _EXPERIMENTAL_DAGGER_SERVICES_DNS
_EXPERIMENTAL_DAGGER_SERVICES_DNS=1
still garnet
#

each container has its own IP address now, maybe it's unable to navigate the network due to a firewall or something? not sure, just guessing

flat lagoon
still garnet
#

network troubleshooting

#

gerhard 1272 we should probably ship v0

tawny flicker
#

I have been thinking for a while on how to execute a native SDK lang function directly as a step in the pipeline, and I can't find some standard way to execute this without going down to CLI (that should be built by the step), or have some kind of marshalling of sort.

The idea would be to transparently do this (example in Go)

client.Container().From("alpine").
  WithFunction(func() error {
    // whatever Go logic I want here
    return nil
  }).ExitCode(ctx)

Or maybe, it's just something that would come naturally with the extensions and the marshalling/query would be GraphQL? WDYT?
cc @wet mason @tepid nova @civic yacht @still garnet

rancid turret
tawny flicker
# rancid turret Have you seen this issue? https://github.com/dagger/dagger/issues/3885

Yes, it's more to compose with existing dagger API. But I don't see how to execute actual Go code inside a pipeline (for example, a user yesterday was limited to get a directory files on 1 level, no recursitivity was allowed by the API. So then, he was limited to do a WithExec([]string{ "find", "." }) and possibly parse/filter then.
I was thinking more of a

WithCall(func() error {
  filepath.Walk("/", func(path string, info fs.FileInfo, err error) error {
    if err != nil {
      return err
    }
    if info.Name() != "myfile" {
      continue
    }
    // do some action on the file
    
    return nil
  })
})
rancid turret
#

Yeah, that sort of thing was requested by a user a few months ago.

#

Apart from running that go code in a container (and without exporting the dir to a temporary location), at a glance I don't see how it can be done.

tawny flicker
#

I get each WithCall would be with the same language of the SDK. Or we would have some
WithGoCall(), WithPythonCall(). And each would spin some language container to build/execute the code in the right runtime and return afterwards.

rancid turret
#

Yep

#

Problem is you need to put that code in a string to inject in the container.

#

I can do it in Python because it's very dynamic.

#

But it's a reach, I think.

tawny flicker
#

But then, if we have the extensions, then the current SDK API would be extended with the extension API and whatever input output would be marshalled by graphql, and who cares about the language. So maybe it's kind of a not so interesting topic since extensions will come eventually.

tawny flicker
#

Yeah, I guess it's too tricky.

rancid turret
#

Exactly.

#

Similar use case as well: "Just looping through some files and checking the content for example. In general executing some repository or environment dependent logic."

final walrus
tepid nova
#

No ETA yet but there is a thought by @still garnet and @civic yacht that solving the relatively short term problem of project entrypoints (ISSUE) can be done in a way that we get extensions too. So that’s very interesting to me, getting there in pragmatic incremental steps

#

@final walrus do you have a specific reason for asking? Idea or deadline?

final walrus
wet mason
#

@fair ermine Hey -- you pinged me for a PR on Discord but I can't find the message anymore (too many channels syndrome). It's the list return right?

fair ermine
#

I'll improve the description of the PR based on Helder's comment tho

tepid nova
still garnet
civic yacht
still garnet
still garnet
#

possibly controversial API proposal: https://github.com/dagger/dagger/pull/4723

tl;dr I think _, err := container.ExitCode(ctx) is a massive footgun; it shows up all over the place and when https://github.com/dagger/dagger/issues/3192 is fixed these will all turn into silently discarded failures

it seems worth adding a convenience for "run this container and error if it fails" since it'll come up in nearly every pipeline, so I added err := container.Run(ctx) with a bit of codegen magic to handle a new Void scalar type in the schema

cc @tepid nova

civic yacht
# still garnet possibly controversial API proposal: <https://github.com/dagger/dagger/pull/4723...

I'm all for adding .Run but I never thought we would change the behavior to make exiting non-zero a non-error when 3192 is fixed, I always imagined we'd add a field like AllowFailure, in which case non-zero exit code doesn't result in an error. It's honestly pretty easy to do at this point because of the plumbing we added to be able to get stdout+stderr in the error code. But even still Run will be good to have in place too

still garnet
#

hm, that feels a little non-obvious to me - like an artifact of the current progression (ExitCode being what everyone calls for lack of a more obvious alternative => improving it by having it include stderr). I don't think it's something we would have designed intentionally. also worth noting the same issue affects Stdout/Stderr

#

for example i think calling Run followed by ExitCode/Stdout/Stderr to collect metadata about the failure would be cleaner without having to pass AllowFailure along to all three

#

i guess it is cleaner to be able to skip Run if your intention is to collect the output (e.g. for parsing) and short-circuit on failure though

#

that's probably the lesser evil to Stdout silently returning "" on failure 🤔

#

i think it's at least true that during the initial API design we wanted them to just return whatever the value was (hence the big commented out test for ExitCode on nonzero exit), but yeah, it might be safer to just have it error everywhere

civic yacht
# still garnet hm, that feels a little non-obvious to me - like an artifact of the current prog...

It's subjective but seems like a pattern common in my experience (go exec.Cmd returns non-nil error when exit code is not 0, I think the python lib I used threw an exception in that case, etc.). I agree that having AllowFailure all over would probably be gross.

In go at least, another option would be to just set both non-nil error and the actual other return value (exit code, stdout, stderr). I don't know how that'd work in python with exceptions (genuinely, maybe that's totally normal)

#

maybe the low level api should just provide all the info (error message and whatever was requested of exit code, stdout, stderr) no matter what and it could be a per-sdk decision on how to most idiomatically handle it

still garnet
#

I'm good with just adding AllowFailure 👍

#

much less dangerous for hastily written things that need to e.g. grab output of a command and parse it

civic yacht
#

Well this is all theoretical discussion because I don't have time to work on it immediately 😄 and either way having Run is good since it's just more obvious than ExitCode when you don't care about any output and just want side-effects

#

Not to mention Void, which is far superior than just returning a meaningless bool like we did in the past

still garnet
#

I'll just capture the outcome of this convo on the issue + PR for now (if it's not already, haven't skimmed 3192 in a bit)

tidal spire
#

Reading through this discussion, one potentially confusing part thing I imagine seeing a lot of would be things like chain.chain.chain.Stdout().Run() or .<other leaf node>.Run() since people already get confused with the lazy evaluations on other leaf nodes

still garnet
#

or do you mean generally people sprinkling .run() everywhere to try to un-lazy it?

#

side tangent: I'm working on a host => container networking PR, which leverages .Run() to also publish ports. felt more natural than doing it on ExitCode/Stdout/Stderr

tidal spire
still garnet
still garnet
#

but yeah, will have to watch out for any "sprinkling" behavior/misunderstandings

#

it helps that it returns 'void' so it can only possibly be used at the end of a chain

tidal spire
#

yup that makes sense. I guess I'm not super worried about it since most people that would potentially trip on that will only be using .Run() and not the other nodes anyway. They may still trip on it with things like Publish() but we just have to document it well

rancid turret
#

@still garnet how is a hostname determined (looking at endpoint specifically)?

still garnet
#

well, a hash of the digest just prior to setting the hostname, since there's a chicken-egg problem there

still garnet
#

@civic yacht might be onto something with the iptables: Resource temporarily unavailable issue 🤔 https://github.com/dagger/dagger/pull/4711/commits/2f17c1e7f4e273f3d13dc080e9b0dcafe5a18880 (it's pretty stupid simple in hindsight)

can't confirm though, since I'd blow through my Docker Hub rate limit real fast. 😅 hoping Actions just passes or fails immediately, since it failed in just one try last time.

side note: I'm tempted to have our tests use a registry mirror or something soon; I've had occasional days where I end up blocked by the combined authed+unauth'd rate limit (300).

civic yacht
still garnet
#

yeah same 😅

civic yacht
#

side note: I'm tempted to have our tests use a registry mirror or something soon; I've had occasional days where I end up blocked by the combined authed+unauth'd rate limit (300).
yeah we should switch to gcr or something else without limits. for the dev engine, we could just put the registry mirror config in engine.toml

still garnet
#

two runs succeeded so far, which is better than before. first run failed on a seemingly unrelated 10m test timeout. github apparently caps you at 2 single-job reruns so now i'm rerunning all jobs (🤷‍♂️) (edit: just a weird UI thing, you can still rerun by clicking the ♻️ icon in the sidebar)

still garnet
#

hit the test timeout again. We're brushing against the default 10 minute timeout now; here's one that passed in 8.75 mins: https://github.com/dagger/dagger/actions/runs/4378197520/jobs/7662629715#step:5:6437

I figured this might happen with all the new tests for services. I like that I can trust our tests, but it's a shame how slow they're getting. At least it's a decent benchmark and incentive for performance improvements.

edit: hm, 4/7 runs have timed out now, maybe it is my PR? has anyone else hit this?

tawny flicker
#

I got some PR got a lot of tests failing with timeout. It's not very reliable.

still garnet
#

just noticed all 27 services tests aren't using t.Parallel(), so that may be some low-hanging fruit

still garnet
#

yup, down to 6.3-7.3 minutes now 🎉

tawny flicker
#

Also, I was thinking that we need to make the image version homogeneous.
Right now, we have:

"alpine"
"alpine:3.15.6"
"alpine:3.16"
"alpine:3.16.1"
"alpine:3.17"
"alpine:3.17.2"
"alpine:latest"

Same with

"golang"
"golang:1.18.2-alpine"
"golang:1.18-alpine"
"golang:1.19"
"golang:1.19-alpine"
"golang:1.20.0-alpine"
"golang:latest"

I bet we can shave off some image layer download + apk download thanks to all using the same images.

still garnet
#

yeah, we should really pull these out into consts or something

wild zephyr
#

those moments when you're troubleshooting a bug and then you find a new one.. cc @still garnet

still garnet
bronze hollow
#

Dynamic Secrets PR looks pretty close ...

#

Was the group API / pipeline stuff ever demo'd or documented?

#

The snippets on the homepage seem to show code I can't write myself:

https://dagger.io/

#
import Client, {connect} from "@dagger.io/dagger"

connect(async (client: Client) => {
  const ctr = client
    .pipeline("test")
    .container()
    .from("alpine")
    .withExec(["apk", "add", "curl"])
    .withExec(["curl", "https://dagger.io"])

  const result = await ctr.stdout()

  console.log(result.substring(0, 300))
})
mellow bolt
#

Do you have issues with .pipeline() ?

bronze hollow
#

Restarting TypeScript language server fixed it 🙂

mellow bolt
#

great 🙂

bronze hollow
still garnet
bronze hollow
tawny flicker
tawny flicker
bronze hollow
#

Can I see it, @tawny flicker ? When will it be tagged and released?

tawny flicker
bronze hollow
#

Awesome. Thanks for the code. Nice and simple API

#

Hopefully the secret obfustication fix for white space rolls out at the same time

#

That would be all my blockers gone 😀

wild zephyr
cosmic cove
#

@fair ermine can you take these? ☝️

fair ermine
#

I can take the #4757 for now 🚀

still garnet
white pilot
#

Had a thought about building libraries for dagger: I'd love to open source some tools I'm building; I do a fair bit of work with monorepos, and so have built patterns for resolving cross-service dependencies and mounting only the deps you need for build.

However, what would be really awesome would be if there was a way to generate these utilities, same way the dagger client itself is generated. Is there some way to take my code, convert to the GQL query itself, and then render out a new client for it somehow? 🤔 that way, I can write my utils in whatever language (I have some in TS, and some in Go), but benefit the entire Dagger community (it can also help us internally if we want to explore porting our dagger code to another lang). (I figured engine-dev was the closest I'd get to codegen)

tepid nova
white pilot
tawny flicker
# white pilot Awesome! Got an issue I can follow?

I can't find any that would describe our new version of the extensions. Everything else was more oriented towards the CUE extensions or early idea of the new version, but things might have changed drastically.

still garnet
#

@wet mason doing some work on the dagger TUI on the side, do you mind if I switch it to just use the standard ANSI 16-color palette? nerds like me like to have my entire OS using one color scheme

wet mason
#

Yep go for it

#

I'm not really an artist 😄 I slapped some colors together

#

Would be good at some point to get Camille involved in that

still garnet
#

yeah definitely

#

fixed up the colors, now investigating strange rendering issues related to the log output. I think we need to use a vterm so special characters in the logs don't disrupt the UI. had to do the same for Bass. even a simple \t throws a wrench in the gears, which I don't really understand

tawny flicker
#

Having done a bunch of work on the ttyv2, there is a lot of prediction that can be wrong, based on rune size.
Basically the correct way to calculate size is to use grapheme cluster size.
In the end, it gets really tricky.

#

Also when displaying logs with rewritten output like secrets

#

Write() will tell you it wrote X bytes, but in reality, it could be anything displayed afterwards because of the scrubbed secrets etc. Which then will make your output wrap lines weirdly

north jay
tepid nova
#

FYI I'm reverting the "how it works" diagram. Please consult me in the future before changing the fundamentals of how we explain our product.

fair ermine
ancient kettle
#

@civic yacht @still garnet I'm working on spinning up a dev dagger engine as a service inside of stable dagger engine. Does it automatically listen on a particular TCP port? Or do I need to tell it to listen via config? And, if it does listen... which port is that? I'm not finding anything obvious (although it's possibly listening on :1234...)

civic yacht
ancient kettle
#
            - --addr
            - tcp://0.0.0.0:1234
civic yacht
still garnet
#

@ancient kettle i'll be curious to see how this goes, might need to do something similar to test AppArmor

ancient kettle
#

Me too! So far... it's failing. 🙂

still garnet
#

i believe in you

ancient kettle
#

@still garnet is there any way to get logs from a service?

still garnet
#

@ancient kettle same way you would with any other container, you can call Stdout/Stderr on it. but that might not be very helpful depending on the situation (i.e. if it's not exiting)

#

you could try running it with dagger-ui 😛

ancient kettle
#

Inside of the service?

still garnet
#

wrapping whatever command is running the service - e,g, dagger-ui <cmd>

ancient kettle
#

Ok. So, I'm running a mage command (go run main.go -w $DAGGER_SRC_ROOT engine:test), so basically:

  • build dagger-ui
  • run dagger-ui go run main.go -w $DAGGER_SRC_ROOT engine:test
#

That sounds good. 🙂

#

I'll try it.

still garnet
#

yep, that should work

ancient kettle
#

Here goes nothing. 🙂

still garnet
#

oh - shoot, you might need to build the local dagger binary from the dagger-ui branch

#

since I changed it to use TCP for the journal event stream

#

hmmm

ancient kettle
#

That would make sense. I'm just getting a blank screen.

still garnet
#

try env _EXPERIMENTAL_DAGGER_JOURNAL=/tmp/journal.log dagger-ui - that'll fall back to tailing instead

#

make sure the journal is empty since it appends by default

still garnet
#

@ancient kettle any luck?

ancient kettle
#

Getting there! I had the config wrong... and running it as a service was swallowing errors. I started running it as a straight exec, and I think I've finally got it, and I'll switch back to a service in a moment.

#

@still garnet I'm getting some odd DNS errors:

#0 0.073 host alias: lookup FM39PVG9VCP9S.dagger.local on 10.87.0.1:53: no such host
#12 ERROR: process "sh -c echo '{container{from(address:\"alpine\"){withExec(args:[\"echo\", \"hello\"]){stdout}}}}' | ${_EXPERIMENTAL_DAGGER_CLI_BIN} query --debug" did not complete successfully: exit code: 1
#

Is that because the stable and dev dagger engines are fighting over network space or something similar?

#

Actually... it kinda looks like it's not waiting for the dev engine to start before running the WithExec that's relying on it.

still garnet
#

did you configure a WithExposedPort?

#

re: the second part

ancient kettle
#

Oh, I did not.

still garnet
#

yea, need that for the health checking/polling/waiting to start

ancient kettle
#

@still garnet Am I holding this wrong?

I'm setting _EXPERIMENTAL_DAGGER_SERVICES_DNS=0, but then it's not letting me use services at all.

still garnet
#

you should only need to set it for the inner Dagger that you're running as a service

ancient kettle
#

Hmm... that might get tricky, given that I'm trying to run the Dagger CI against it.

#

On the plus, that totally works. 🙂

#

It's a bit of a bummer that a service exec is marked as failed in dagger-ui

still garnet
#

@ancient kettle good point, might be able to detect that it's interrupted instead

ancient kettle
#

@civic yacht @still garnet So, I'm pretty sure I'm gonna be blocked on this Dagger CI refactor by https://github.com/dagger/dagger/pull/4711

Because we'll need to be able to run dev dagger engine with services and DNS enabled to run our test suite against it. Am I missing anything?

GitHub

Follow-up to #4697
This might be a bit overkill, but it felt like a useful exercise just to get us away from global assumptions around the network stack.

The engine has two new flags --network-nam...

#

That said, I was able to run a decent number of tests without it.

still garnet
ancient kettle
#

Run-on sentences ftw!

still garnet
#

got it, ty

ancient kettle
dense dust
ancient kettle
civic yacht
#

@still garnet re: this comment: https://github.com/dagger/dagger/pull/4711#discussion_r1143987469
I think I found the issue, it's just that I forgot to close the clients in those test cases before shutting down the engine, so then the engine blocked on GracefulStop: https://github.com/dagger/dagger/blob/d275122c01bcb585b54aa212ebaee9518d4eb438/cmd/engine/main.go#L354

Do you mind if I push a commit with that to your PR just so I can test it there? Just a few lines of changes to remotecache_test.go

still garnet
still garnet
still garnet
#

hesitant to rely on that, though, since it's unmaintained and these rules might be removed at some point

still garnet
ancient kettle
# civic yacht LGTM!

Thanks for getting that over the line so quickly, @still garnet and @civic yacht! I’ll keep pushing forward on the refactor today.

north jay
civic yacht
north jay
#

Oh cool.

#

Fitting that the bug I was hit by was caused by me.

civic yacht
civic yacht
still garnet
#

@civic yacht I think go build -o /app/bin/buildctl github.com/moby/buildkit/cmd/buildctl would work too.

civic yacht
tepid nova
#

what’s Container.withLabel ?

wild zephyr
#

same thing as the LABEL dockerfile instruction?

obsidian rover
dense dust
#

BTW it's possible I'll ask a big amount of stupid questions today around codegen, etc 😅

tepid nova
#

also we are not 100 % consistent on methodology

civic yacht
ancient kettle
#

@still garnet I'm trying to run this test against dev engine running in stable engine....

func TestContainerPublish(t *testing.T) {
    c, ctx := connect(t)
    defer c.Close()


    const htpasswd = "john:$2y$05$/iP8ud0Fs8o3NLlElyfVVOp6LesJl3oRLYoc3neArZKWX10OhynSC" //nolint:gosec

    registryHost := "registry"

    registry := c.Container().From("registry:2").
        WithExposedPort(5000, dagger.ContainerWithExposedPortOpts{Protocol: dagger.Tcp}).
        WithExec(nil)

    pushedRef, err := c.Container().
        From("alpine:3.16.2").
        WithServiceBinding(registryHost, registry).
        Publish(ctx, registryHost+":5000"+"/container-publish:latest")
    require.NoError(t, err)

}

And I'm getting this error:

                            input:1: container.from.withServiceBinding.publish failed to solve: failed to push registry:5000/container-publish:latest: failed to do request: Head "https://registry:5000/v2/container-publish/blobs/sha256:3feafd38d98ccbd96b16d14ef277f30a39258b3dcc6e6c8521a00b13ad068f13": dial tcp: lookup registry on 10.88.0.1:53: no such host
#

It looks like the service comes up, but the alias isn't getting set?

still garnet
#

this won't work unfortunately, for a couple of reasons: a) buildkit's push/pull runs in the context of the engine container, so it doesn't have service aliases wired up, and b) pushing/pulling to/from a registry that isn't localhost or 127.0.0.1 will require TLS, even if it were to be magically routed to the right service

ancient kettle
#

Ok... :/

still garnet
ancient kettle
#

That makes sense. Bummer. So, I guess I might need another way to do these registry tests in our CI.

civic yacht
#

but since we are setting up both the engine and registry ourselves I guess we can just include tls certs with the containers too?

ancient kettle
#

I like that.

#

I've gotta go do dinner, but I'll check back in later.

wild zephyr
still garnet
#

true! probably a lot easier than setting up TLS

ancient kettle
civic yacht
civic yacht
wild zephyr
civic yacht
ancient kettle
#

I guess I'm not grokking the CNI setup.

still garnet
#

@ancient kettle I think the idea is to run the engine (in Dagger) with the registry as a bound service, which you can give an alias. That same alias can then be configured with http = true in engine.toml

#

containers running in the dev engine wouldn't be able to reach the registry, but that's OK - only the engine really needs to, afaik

ancient kettle
#

Ah, yes. Thanks, that clicked it into place (about the engine, but not the containers it is running, needing access).

ancient kettle
#

Ok. That all appears to work quite nicely! http = true gave me zero issues, and many of the tests ran great! remotecache_test.go is gonna need some work.

still garnet
#

awesome!

civic yacht
# ancient kettle Ok. That all appears to work quite nicely! `http = true` gave me zero issues, an...

Awesome, good to know at least that works! For remotecache, one of the tests just needs a registry, so that should be reusable from the other tests. The other part is minio, which is an s3 implementation. I think you can re-use the exact same idea as the registry except you shouldn't need to worry about disabling tls, it all works over http.

There's some existing code that initializes a bucket in minio, which is sort of annoying because it requires using a different container image, but I think you can just run that code at the beginning of the test suite, same place you set up the registry/minio services. But if any of that gives you any trouble let me know, I can look in more depth too

ancient kettle
civic yacht
# ancient kettle The part that I'm gonna need to figure out is the running separate dev container...

Yeah if we want to make those be special case tests that require some specialized setup outside the direct test code, I think that would be totally fine for now.

Another approach would be to allow each test case to optionally get clients for its own separately spun up dev engine when it starts. E.g. most test cases will just call getClient(...) that returns a dagger client for the default dev engine we already spun up, but if they want their own engine they can use a different func. Buildkit uses a similar-ish approach in their integ tests. Not sure about the details there, but conceptually something along those lines might be nice.

Also, feel free to send out the updates for the rest of the test suite and skip those for now. It will be easier to get an idea on how to approach it after looking at the structure of everything else.

tidal spire
#

❓ What's the current state of debug sessions? (attaching a shell)

still garnet
#

hmmm... anyone else seeing this? services don't seem to be stopping, and a bunch of zombie processes are piling up. pictured: htop in dagger engine container after running a few tests that run services.

repro:

  • git checkout main; git pull
  • go test ./core/integration/ -run ContainerExecServices -count=1 -v <- not against dev engine, happens against v0.4.2 too

noticed this because tests that ran dockerd as a service were failing waiting for containerd to start, likely because of a flock held by a leaked dockerd service from a prior run

cc @civic yacht @ancient kettle

#

debug sessions

civic yacht
#

hmmm anyone else seeing this services

ancient kettle
#

@still garnet I was trying to use the new terminal UI (downloaded the latest from that PR, built it), and ran /Users/joel/src/dagger/dagger-ui/dagger run -- go run main.go -w $DAGGER_SRC_ROOT engine:test

I'm seeing this screen... but then it just hangs. My mage go.mod is using dagger.io/dagger v0.5.2. Is there something else I need to do?

still garnet
#

hmm, haven't tried running our Mage stack with it yet, not sure why it would hang. I've run dagger run go test ./core/integration directly though and seen lots of things come in

ancient kettle
still garnet
#

@ancient kettle as a sanity check you could run dagger run ls, which takes a bit (needs to start the session) but should at least show some output

ancient kettle
still garnet
ancient kettle
still garnet
#

definitely don't think it's a "holding it wrong" situation, maybe something platform-specific. I've only tested on Linux

still garnet
ancient kettle
#

Gonna try a couple more things.

#

I've rebuilt it (go build + ./hack/make), I've blown away dagger engine and spun it back up via dagger run as well as ./hack/dev. Still nothing from ./cmd/dagger run ls.

#

Your guess of platform specific makes some sense to me.

still garnet
#

I just ran dagger run ./hack/make engine:test and everything passed, and then dagger run ls and that works too, though it takes 2-3 seconds before it shows it. interestingly dagger run go test ./core/integration doesn't work though because they all run against the same session (from dagger run), so things like the workdir are wrong.

#

I'll try testing on Mac soon, have an old machine sitting on a shelf. (hopefully this isn't extra spooky m1-only behavior or something.)

#

also: that 2-3 second startup time is begging for some optimization, not sure what's going on there but this likely affects every SDK

ancient kettle
#

@still garnet I spun up a Linux VM on my machine last night/this morning and ./dagger run ls works just great. Your platform-specific hypothesis is looking stronger and stronger to me.

#

I'm happy to help you debug it, if that's useful for you!

still garnet
still garnet
#

@ancient kettle when you press arrow keys, does the top left change from FOLLOW to BROWSE? trying to see if the UI is totally wedged

#

and if you resize the window, does it update to reflect the new size?

ancient kettle
#

Yeah, it FOLLOW/BROWSE changed. The collapse/expand works, too. The resize works, as well.

#

I don't think the UI is totally wedged.

ancient kettle
#

@still garnet I mucked around with it some more... really not sure what's happening there. thinkies

still garnet
#

my one guess is something is going wrong with the cmd.Run() and the UI isn't reflecting the error. you could try wrapping it in a if err != nil { panic(err) } maybe

#

I started hooking up a Mac Pro I've had sitting on a shelf collecting dust, but don't have a spare USB cable, so context-switching for now 😄 will get back to it later

ancient kettle
ancient kettle
#

@civic yacht @still garnet Any ideas for how I can get docker hub credentials to dev engine (running in stable engine)? I'm getting blocked on rate limits. 😢

still garnet
#

@ancient kettle if you're logged in, log out. if you're logged out, log in. that's been my trick for getting a few more requests, since you get both the anonymous and authorized limit 😆

#

(using docker login / docker logout)

#

it should be forwarding your auth already

ancient kettle
still garnet
#

oh, doy. hmm

civic yacht
#

I think you could re-use the same idea? Devil will be in the details of course

#

It's also not great in that it should probably use a secret

#

But probably okay for now

still garnet
#

this doesn't help you right now, but last time I looked into this I came away thinking we should just push all of our test images to a private registry. I sunk a bunch of time into trying to limit the API requests (e.g. by running a local registry mirror) but it gets really tricky since we run multiple engines in some tests

civic yacht
ancient kettle
civic yacht
ancient kettle
#

On a positive note, nearly all of the integration tests pass just fine in this new dagger-in-dagger refactor.

#

I am so looking forward to having dagger run work on the Mac, @still garnet. Sorting through all of these trying to find the test failures... not my favorite.

#

I'll spend some more time poking at that tomorrow.

still garnet
#

will take a look this evening too, been looking for a reason to rev up the ol trashcan

ancient kettle
ancient kettle
civic yacht
ancient kettle
ancient kettle
still garnet
#

@ancient kettle TestContainerInsecureRootCapabilitesWithService currently fails on the second run because the dockerd service leaks (the upstream buildkit issue from yesterday)

ancient kettle
#

Also, looks like I'm getting a panic after 10min.

#

Might be related to the services issue, as well?

#
#15 601.3 panic: test timed out after 10m0s
#15 601.3 running tests:
#15 601.3     TestContainerExecServicesChained (2m36s)
... + a bunch of goroutine stacks
civic yacht
#

That problem with TestContainerInsecureRootCapabilitesWithService only happens if the engine keeps running between between test invocations (it works if you just do ./hack/dev go test -v -count=1 -run=TestContainerInsecureRootCapabilitesWithService ./core/integration/ multiple times in a row, at least for me, since the dev engine will be restarted each time). I would have expected that each run is resetting the dev engine in the new dagger-in-dagger approach too, but if not then yeah that's most likely it

ancient kettle
#
#15 20.54     container_test.go:2998:
#15 20.54             Error Trace:    /app/core/integration/container_test.go:2998
#15 20.54             Error:          Received unexpected error:
#15 20.54                             input:1: container.from.withMountedCache.withServiceBinding.withEnvVariable.withExec.stdout process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
#15 20.54                             Stdout:
#15 20.54
#15 20.54                             Stderr:
#15 20.54                             Unable to find image 'alpine:latest' locally
#15 20.54                             latest: Pulling from library/alpine
#15 20.54                             af6eaf76a39c: Pulling fs layer
#15 20.54                             af6eaf76a39c: Download complete
#15 20.54                             af6eaf76a39c: Pull complete
#15 20.54                             Digest: sha256:ff6bdca1701f3a8a67e328815ff2346b0e4067d32ec36b7992c1fdc001dc8517
#15 20.54                             Status: Downloaded newer image for alpine:latest
#15 20.54                             cat: can't open '/tmp/from-outside': No such file or directory
#15 20.54
#15 20.54                             Please visit https://dagger.io/help#go for troubleshooting guidance.
#15 20.54             Test:           TestContainerInsecureRootCapabilitesWithService
#15 20.55 Error: failed to solve: process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
#15 20.55 #4 ERROR: process "docker-entrypoint.sh sh -e -c echo uo3txv448udzdo9j3ko0jg98w-from-outside > /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine cat /tmp/from-outside\ndocker run --rm -v /tmp:/tmp alpine sh -c 'echo uo3txv448udzdo9j3ko0jg98w-from-inside > /tmp/from-inside'\ncat /tmp/from-inside" did not complete successfully: exit code: 1
civic yacht
civic yacht
ancient kettle
ancient kettle
#

Not sure if both of those are necessary or not...

still garnet
civic yacht
#

If you want to just push the code as a draft I can look more closely, it may be some minute detail somewhere

ancient kettle
ancient kettle
#

Don't mind core/integration/remotecache_test.go - I haven't figured out how I should refactor this yet.

ancient kettle
still garnet
civic yacht
civic yacht
#

@ancient kettle Okay managed to fix it, and the fix also appears to have gotten TestDirectoryWithTimestamps passing too. Pushed the fix here: https://github.com/dagger/dagger/pull/4848/commits/6570ea111ecd9acd1ee327bbceedf12d434dc73f

Before that change, the engine's state directory was an overlayfs mount, which meant that it couldn't itself create new overlayfs mounts so it had to fallback to the extremely slow copy-based "native" snapshotter. Putting a cache mount at /var/lib/dagger fixes that because cache mounts are just bind mounts from the "outer" buildkit's filesystem, which is itself a bind mount from the actual host (since we run the default engine in docker w/ -v /var/lib/dagger.

So yeah that fix is what we want in general anyways, but the fact that it caused errors in those two test cases in particular is pretty interesting though. I am not totally sure, but I think it may have to do with weirdness of overlay where the listing of directory contents is cached in the kernel in such a way that different processes can get inconsistent results (search readdir here: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt). Because all the mounts we put into containers were actually just bind mounts from a single underlying overlay mount, that could easily explain different containers getting inconsistent views of the filesystem. Doesn't matter in the end, just kind of interesting.

still garnet
#

@ancient kettle welp. one macOS upgrade, 7 failed recovery mode reboots, a table flip, and an Xcode upgrade later, I can't reproduce your issue. 😅 caveat: this is an old Intel Mac Pro from 2013. dagger run ls runs successfully, with the same-ish delay

ancient kettle
tepid nova
#

I can try reproducing tomorrow on a m1 mac. Are there instructions somewhere?

still garnet
#

@tepid nova 🙏

cd dagger
git remote add aluzzardi https://github.com/aluzzardi/dagger
git fetch aluzzardi ui
git checkout ui
go install ./cmd/dagger
dagger run ls                      # sanity check (should take a few seconds)
dagger run ./hack/make engine:test # try actually running everything
rancid turret
#

@civic yacht 👋 How do we stand today on the limits for the engine to process requests? Let's say you send 10000 concurrent requests (pipelines), is the engine smart enough to throttle that or do you need to specifically configure the parallelism? Do you have a list of open issues on this subject, i.e., performance issues related to using "unlimited" concurrency from an SDK?

still garnet
#

i know one user has done that at least

still garnet
#

oo fun, thanks

still garnet
civic yacht
obsidian rover
obsidian rover
ancient kettle
still garnet
#

you're not running it in an empty directory right? what about dagger run echo potato? 😄

#

or what about dagger run say hello?

ancient kettle
#

I'm not... That's not working either.

#

I'm so confused. 😛

still garnet
#

what about dagger run reboot. you feeling lucky?

ancient kettle
#

Maybe in a bit? Gotta meeting now. 🙂

still garnet
#

if you didn't hear anything from say hello that's enough evidence without rebooting 😛

ancient kettle
ancient kettle
rancid turret
#

Concurrency limit

ancient kettle
#

@still garnet I changed this code.

diff --git a/cmd/dagger/run.go b/cmd/dagger/run.go
index 5b3527c2..2110d08b 100644
--- a/cmd/dagger/run.go
+++ b/cmd/dagger/run.go
@@ -72,8 +72,11 @@ func Run(cmd *cobra.Command, args []string) error {
     eg := new(errgroup.Group)
     eg.Go(func() error {
+        subCmd.Stdout.Write([]byte("Hello from Joel! #1\n"))
         return withEngine(ctx, sessionToken.String(), journalW, progOutWriter{program}, func(ctx context.Context, api *router.Router) error {
+            subCmd.Stdout.Write([]byte("Hello from Joel! #2\n"))
             go http.Serve(sessionL, api) // nolint:gosec
+            subCmd.Stdout.Write([]byte("Hello from Joel! #3\n"))
             return subCmd.Run()
         })
     })

When I run dagger run ls, I now get Hello from Joel! #1.

#

Looks like withEngine is returning an error.

still garnet
#

was just about to ask 🙂

ancient kettle
#

🙂

#

Huh. err: failed to solve: unknown cache exporter: "dagger"

#

Gonna check the engine logs.

still garnet
#

maybe it's talking to an outdated engine somehow? or a vanilla buildkit?

#

(should definitely handle errors here better)

ancient kettle
#

I cleared out the dagger engine that was running (including volumes)

#

Then re-ran.

#

Same error. 🤔

#

The image is registry.dagger.io/engine:main

still garnet
#

oo maybe that needs to be pulled?

ancient kettle
#

I'm going to delete that image, and see if I can pull it again.

#

What. The. F.

#
Hello from Joel! #1
Hello from Joel! #2
Hello from Joel! #3
CONTRIBUTING.md
Dockerfile
LICENSE
NOTICE
README.md
RELEASING.md
TESTING.md
auth
cmd
codegen
core
demos
docs
engine
go.mod
go.sum
hack
internal
network
project
router
sdk
secret
telemetry
tracing
website
Hello from Joel! #4
err: <nil>Hello from Joel! #5
still garnet
#

🎉? (is that a good wtf or a bad one)

ancient kettle
#

It works!

#

I'm just so confused what happened...

#

I should have captured the sha...

#

Probably an old image or something.

still garnet
ancient kettle
#

I must have pulled engine:main a while ago and it kept using that. ¯_(ツ)_/¯

still garnet
#

oo, we default to that for dev builds apparently

ancient kettle
#

Ah, yeah that makes sense.

still garnet
#

@ancient kettle @obsidian rover pushed fixes for both issues. it'll now synchronously start the engine and command, which makes for better error handling and avoids showing a 'blank' UI that looks stuck

wild zephyr
#

👋 what does anybody think about changing []string Go SDK arguments with (args ...string)?

I generally find myself writing WithExec("foo", "bar") just to realize afterwards that I forgot the []string{} thing

obsidian rover
still garnet
still garnet
#

we could add sugar to SDKs for things like that, just like Buildkit has llb.Shlex("foo bar baz") which expands to llb.Args([]string{"foo", "bar", "baz"}). There's risk in fragmentation / thicker SDKs, but well, these things would be nice when you're using it all the time

wild zephyr
#

it's a bummer that go doesn't allow adding variadic args of different types. Doing something like func(args ...string, argsb ...bool) doesn't seem a bad idea initially

#

and should be easy for the compiler to validate that

still garnet
#

i think the problem would come with interfaces, e.g. if you had func foo(ws ...io.Writer, rs ...io.Reader) and you called it like foo(writer, readWriter, reader) it wouldn't know where to group the readWriter

still garnet
#

@civic yacht if you're still looking into the exit 4294967295 issue, here's what I know so far:

  • When @stray heron and @hybrid widget reported it they were actually seeing an error when resolving service aliases ("host: lookup SDFFKJHJKHFDG: no such host" or something), which can only happen after a successful healthcheck, and indicates the service exited afterwrds.
  • When I saw it last night it seemed to happen for a random subset of tests in Bass's test suite, which runs against buildkitd running as a service container, so in that case I think buildkitd was exiting partway through. With enough runs I was able to have all the tests pass.
civic yacht
#

Need to actually understand before merging though of course

still garnet
#

hmm ok - that seems counter to what I've seen so far (I thought services were exiting prematurely somehow) but maybe multiple things are happening

civic yacht
#

The exit code being the max value of uint32 makes me suspect that something is returning -1 somewhere and that is incorrectly being converted https://go.dev/play/p/9eGFxEgw374

#

starting a thread for this

oak sandal
civic yacht
ancient kettle
still garnet
#

lol, you noticed that fast, I'm writing up a comment to summarize latest updates

ancient kettle
#

🙂

#

It was at the top of my GH notifications. 😉

oak sandal
#

Yes This is my workaround https github

ancient kettle
#

@civic yacht We never implement the Container.With(func(c *dagger.Container) *dagger.Container) pattern, did we...?

ancient kettle
#

@still garnet @civic yacht This Dagger-in-Dagger PR is mostly done. I realized I still need to refactor core/integration/remotecache_test.go, so I'm digging into that now. https://github.com/dagger/dagger/pull/4848

still garnet
#

@ancient kettle now I want to add a TUI key to open the details log output in $EDITOR thinkspin

#

just ran your PR and a test failed, but it's pretty hard to peruse the output in the little pane. (one could say it's a paneful experience)

ancient kettle
#

That'd be nice!

#

Also, Tab-Up/Down in the log details pane would be clutch.

still garnet
#

huh, yeah, I could have sworn it respected it, but I guess not

ancient kettle
still garnet
#

i'm also trying to figure out how I can have services 'marked' somehow so the follow mode doesn't prioritize them. and so they're not marked as 'failed'

still garnet
#

unfortunately the services thing is tricky because pipeline info is baked into containers at withExec time, and containers don't know whether they're a service at that point.

ancient kettle
civic yacht
still garnet
#

that's what I was going for, but I don't think there's a point in time where a) I can use llb.WithCustomName and b) the container knows it's going to be a service

#

I guess I could do it based on whether it has exposed ports? (even though not all services will do that, since sometimes ports are dynamic, i.e. rabbitmq)

#

ack, sometimes you do WithExposedPort after WithExec, so even that's not perfect

civic yacht
#

There's a chance it might work if you pass llb.WithCustomName as an arg to .Marshal here: https://github.com/sipsma/dagger/blob/ae656003ba7b7073f0d4d31cdbcc43489cd2dba7/core/container.go#L1156-L1156

The only reason I could see it not working would be some weird stuff with overriding the metadata of a definitionop, in which case you could do something similar to the workaround here: https://github.com/sipsma/dagger/blob/ae656003ba7b7073f0d4d31cdbcc43489cd2dba7/core/container.go#L410-L410

But at that point there's clearly a can of worms forming, so maybe not worth it

still garnet
#

oh nifty, I didn't know you could pass those at marshal time

#

will try that, ty!

civic yacht
still garnet
#

whoa, handy vim tip: to pretty-format json (i.e. in journal.log), just do :!jq ., i.e. :'<,'>!jq . (for selection) or :%!jq . (for whole file)

#

i've gone too long without using this...

wild zephyr
still garnet
#

woo, now concourse's web UI integration tests show everything passed, instead of saying the services failed

ancient kettle
still garnet
#

I'm adding o to open in $EDITOR, I realized it's kind of important because otherwise the output is kind of "lightning in a bottle"

#

I think it should work while you have a group selected too, not just an item, but I'm trying to figure out what that should actually do thinkies

#

it'd just be nice to be able to smash "Home" (go to top/root) => "o" to save all the logs

#

oo, maybe it could create a file/directory tree and then open the folder in your editor...?

still garnet
# ancient kettle I like that idea a lot!

pushed! turns out bubbletea (our TUI library) already had all the necessary tools for this, which is impressive. it's kind of tricky to e.g. open vim from a TUI, since there's only one 'alt' screen. you have to basically halt the background process, otherwise it starts eating user input before it can make it to the editor

rancid turret
obsidian rover
still garnet
#

@obsidian rover my guess is we're just brushing up against the test timeout again, we've had to optimize some things in the past to bring it back under the 10m threshold. (I'm guessing this because none of the tests running at the time of failure seem to have been stuck for a crazy amount of time, so it's not an individual test causing the timeout)

#

I had a run hit this yesterday and it passed on rerun

#

...but that's not really sustainable

obsidian rover
still garnet
#

time to rm -rf some tests chef_kiss

#

we can maybe just bump the timeout, if we're adding tests all the time it's kind of inevitable. it'd be awesome if we could speed them up instead, but that's not a priority at the moment

#

@ancient kettle did you notice any performance difference by any chance with the dagger-in-dagger refactor?

tawny flicker
still garnet
#

I'm really curious if there's low-hanging fruit in the session startup time. dagger run ls shouldn't take 2-3 seconds

tepid nova
#

so that's even after bypassing any communication with our registry?

#

Or I guess, dagger run does communicate with the registry, so that's the likely culprit

still garnet
#

yeah, I'm guessing so

ancient kettle
ancient kettle
ancient kettle
rancid turret
ancient kettle
obsidian rover
still garnet
#

@obsidian rover it's the Go default, so we just need to add a -timeout=15m (or something) flag wherever we run go test

#

looks like that's in internal/mage/engine.go

oak sandal
wild zephyr
#

just found: https://tart.run/. Interesting product for people requiring fast macOS runners. cc @ancient kettle @civic yacht

ancient kettle
still garnet
# ancient kettle `git` is a DAG, too... 😉

ooo. I was thinking about this last week! Since dagger run is a bit of a lie, trying to represent everything as a tree. I was thinking about git log --graph couldn't find a good generic tool/library for it

ancient kettle
still garnet
#

time to rewrite everything i guess

#

this is definitely way easier to read than git log, wonder how it would look for a Dagger graph

#

looks like it's super tied to git though. maybe i need a... dagger graph to git repo converter.... thinkies

ancient kettle
still garnet
#

lol wtf

ancient kettle
#

I'm so confused... I didn't change that code at all... (beyond the attempt to fix it)

still garnet
#

it's a simple HTTP fetch too 🤔

ancient kettle
#

Yup. 🤷🏻‍♂️

still garnet
#

and it's not a fluke, either; same error in engine-race-detection

#

only guess I can think of is some kind of weird cache false positive?

ancient kettle
#

Yeah, I've seen it a few times now.

still garnet
#

would be interesting if we could identify the byte sequence it's getting back somehow

ancient kettle
civic yacht
still garnet
civic yacht
#

🤷‍♂️

#

@ancient kettle when you try to repro locally are you running the full test suite or just that test in isolation? would be good to see if it repros when you run from a fresh cache and running the full suite

ancient kettle
#

I just ran the single test. I'll try the full suite from a fresh cache.

civic yacht
#

If it's a bug where it picks up the wrong cache then that might be more likely to trigger it

ancient kettle
#

Yeah... has me wondering if it's the cache mount for dev-dagger-engine biting me.

still garnet
#

you could also change the test to print the content base64-encoded

#

just so we can play with the data it gets

civic yacht
ancient kettle
civic yacht
ancient kettle
#

@civic yacht @still garnet I have another data point...

#

When I comment out TestRemoteCacheRegistry and TestRemoteCacheS3... it passes.

#

Maybe dagger-in-dagger-in-dagger is just too much dagger?

civic yacht
# ancient kettle Maybe dagger-in-dagger-in-dagger is just too much dagger?

That is pretty weird too but did you give a try base64 encoding the bytes so we could see what they are? If it's just the compressed bytes of the actual contents we expect then I would highly suspect this commit (from 2 weeks ago in buildkit) broke it: https://github.com/moby/buildkit/commit/f2f3d8cf94dcf8d560b1d4392a73948a81455e5e

I don't know entirely how it would be inconsistent between different runs, but there's a plausible path since the transport used for the http requests is global to the engine, it's just created on startup, so there could be some strange interactions happening there that vary depending on the order and timing of requests to the engine

ancient kettle
ancient kettle
#

Base64ed:

H++/vQgAAAAAAAAD77+9VMuO77+9NhDvv73vv70r77+977+9IRnDlu+/ve+/vVwWG08eE2zvv71F77+9QDAYBAtabFPvv71QbO+/ve+/ve+/vShf77+9b8mX77+977+977+977+9WiwWAXzvv73vv73vv73vv73vv73vv70i77+9a++/vWfvv73vv71ySe+/vWlrKe+/vV3vv73vv70/77+9Su+/vSHvv73vv73vv73vv73vv73Rkzrvv73vv70Pd++/ve+/vXXvv71U77+977+9WEJSI++/ve+/vQY377+977+977+977+9XFAt77+977+977+9EFPvv71a77+977+9a++/vWEB77+9Ae+/vUzvv73vv73vv71xAu+/ve+/ve+/ve+/vXZDW++/ve+/vR3vv73vv71777+977+9cO+/vSrvv73vv70t2pJUVgQ9DN61OjsO77+977+977+977+977+977+9RSTvv71QWzJ9UU0h77+9SH7EkH/vv715TFkH77+977+9Ue+/vR3vv70XKH9977+977+9PO+/ve+/ve+/ve+/vQcKC++/vRzvv73vv73vv71RHzt42IFm77+94Ki9OlLvv73vv73vv73vv71277+92qjvv73vv70+CHBW77+9W3wE77+9DH3vv71N77+977+977+9Y++/vUfvv71dHjfvv73bhRMzRe+/ve+/vSDvv70rDhPvv70J77+977+9be+/ve+/vVvvv70OVO+/vR1877+977+977+9Ee+/vciAS3zvv713DO+/ve+/vXhV77+9Ou+/ve+/vWvvv71r77+977+9NO+/vSgDRe+/vdq2Alvvv70iYkYFybF6LkHvv73vv73vv70+77+9GHTvv73vv71n77+9DO+/ve+/vUnvv705Pu+/ve+/ve+/vQpw77+9KFLvv71x77+9fgDSq1h1du+/ve+/ve+/vXfvv70+yY5aTmPvv73vv73vv70qd1JuTh3vv73vv73vv73vv73vv71iQu+/vUQbIzXvv70/X++/vSTvv70QOe+/ve+/vSVBDgvvv73vv73Hsu+/ve+/vWo077+9PlXvv73vv73vv70UAhTvv71L77+9Bu+/ve+/vX3vv73vv73vv71gae+/vWMQce+/vdW8Qu+/vSDvv71o77+977+9AzonXu+/ve+/vV7vv73vv70C77+9W2bvv70QCO+/vSXvv71177+977+9aSQc77+9YGQ9JCg/77+9VF3vv73vv73vv70i77+9Z1hYT0tVQu+/vSg+77+977+9d++/ve+/vQfCg1Tvv70a77+9emMoORvvv73vv73vv73vv71DUgBOHO+/vWzvv73vv71B77+977+9V++/ve+/ve+/vVZ4DQ7vv70dUDpm77+9dEDvv71SG92QJ++/vSpxT++/vVLvv70ZBAYG77+9U++/ve+/vTYgMu+/ve+/ve+/vTUh77+977+977+977+9akvvv73vv71XF/G4va7vv70yBmBaCe+/vTTvv71KbNWF77+9Je+/ve+/vXc1bu+/ve+/ve+/vSHvv71WB++/vXdd77+977+9cu+/vVPvv73vv716LmYH77+9KF9477+977+9M0vvv70l77+977+9fe+/ve+/vXTvv73vv73Fqe+/ve+/vVnvv73vv73vv70lD1It77+977+9B08vMDbvv73vv73vv73vv71f77+9cC5V77+977+9E1JX77+9XmJQfe+/ve+/vT3vv73vv71nZe+/ve+/ve+/vSXvv71v77+9d++/ve+/vTY177+9Ju+/vXHvv71fd3zvv71J77+977+95Z2U77+9SO+/vU/vv73vv71D77+977+9Pe+/ve+/vXfbqXA3F++/ve+/vR9e77+977+9Xu+/vUsWHu+/vWMDMe+/ve+/vXdPc01t77+977+977+9S3YOJu+/vXpsIe+/ve+/vS3vv70Y77+9Uu+/ve+/vTTvv73vv73vv70p77+977+977+977+977+977+9UDF+ZO+/vSEa77+977+9PR/vv73UvO+/ve+/vdOLBe+/vU/vv73vv73vv73vv73vv73vv73vv73vv73JuO+/ve+/vWLvv71FMVYacu+/vWlW77+9Ae+/ve+/vU8H77+9BgAA
#

incorrect header check

still garnet
#
[nix-shell:~]$ file out
out: Linux rev -1111494879.14013 ext3 filesystem data, UUID=bfbd3206-605a-09ef-bfbd-34efbfbd4a6c, volume name "\325\205\357\277\275%\357\277\275\357\277\275w5n\357\277\275\357\277\275\357\277\275!\357\277\275V\007\357\277\275w]\357\277\275\357\277\275r\357\277\275S\357\277\275\357\277\275z.f\007\357\277\275(_x\357\277\275\357\277\2753K\357\277\275%\357\277\275\357\277\275}\357\277\275\357\277\275t\357\277\275\357\277\275\305\251\357\277\275\357\277\275Y\357\277\275\357\277\275\357\277\275%\017R-\357\277\275\357\277\275\007O/06" (needs journal recovery) (errors) (compressed) (64bit) (huge files)

ah yes of course. a Linux -1111494879.14013 ext3 filesystem. it doesn't seem to be in good health though. maybe that's the problem!

civic yacht
#

I'm trying it out locally too, this is too bizarre...

still garnet
#

fwiw, ChatGPT isn't able to identify a hexdump of it, but it is able to identify an actual hexdup of a gzip'd README

still garnet
civic yacht
#

This is only supposed to happen when you write C...

civic yacht
# ancient kettle Base64ed: ``` H++/vQgAAAAAAAAD77+9VMuO77+9NhDvv73vv70r77+977+9IRnDlu+/ve+/vVwWG...

Ah okay, I think I know what's happening. I think the base64 before was done after our API had already converted []byte->string, which meant the base64 encode include the escape chars like \x. I put a base64 dump earlier in the spot where we actually get the bytes (https://github.com/sipsma/dagger/blob/4be9d6ee11cb5d3a66ec3e0f846436535009fd5e/core/file.go#L92) and can see that it actually is the gzip of the file we were expecting:

echo 'H4sIAAAAAAAAA4VUy47jNhC8+yuI9SEZw5avwVwWG08eE2yQRbJAMBgEC1psU8RQbIGP8Shfk2/Jl6Wa0thaLBYBfLCk7uqq6iLXa/Vnp7NySd1paym+Xa2mP/JKqyGyjbrv9dGTOtzvD3eKgnWBVJa2WEJSI5eoBjeQx/ukXFAth6zxEFOzWq3Xa/VhAYMB9Eyeh6lxAr2264R2Q1vByR2ppHu60HDBKq+DLdqSVFYEPQzetTo7DvO43/+P1kUkvVBbMn1RTSG7SH7EkH//eUxZB6OjUb8d7hcof33b5Tyk2/2eBwoL0Rzt/qZRHzt42IFmguCovTpSoJPL6Xa12qjN5j4IcFaeW3wEiwx9m0399oFj1kfnXR43m9uFEzNF1eog9isOE/cJpNdth69brA5U8R185MGQEYvIgEt8lncM48J4VaM6rvNrgWupqTT+KANFh9q2AlvhImJGBcmxei5Bka/APo4YdNLFZ8wM5rVJnTk+pauGCnDuKFKdceB+ANKrWHV2uavFd9w+yY5aTmPK1N8qd1JuTh32qa/st2JC9UQbIzWZP1/pJOcQOaXdJUEOC4ilx7Krkmo06T5VoJLAFAIUg0v8BvyZfarsuGBppGMQcZ/VvELXIOJosbgDOidenKxej4gC6FtmoxAI0SX8daSmaSQc72BkPSQoP7tUXZyPyyKlZ1hYT0tVQtooPqmHd7++B8KDVIsasHpjKDkb6v6By0NSAE4c3myx9kHA4VekwetWeA0Opx1QOmaXdECuUhvdkCf1KnFPuVLqGQQGBo9T8ZU2IDLE9vM1IfSjq7lqS8rcVxfxuL2u2TIGYFoJ7jTWSmzVhcIlga13NW66ldUhtVYH93ddl9BynlOd23ouZgeJKF94p80zS+4l7ol9kb504bbFqeGnWb6hniUPUi3J6gdPLzA2k52Djl/qcC5V4IwTUlf8XmJQfZC1Pf7MZ2WY6s4l9W+vd4ThNjWmJqBxvF93fN5J6c7lnZTeSP9PlKdDmOE9ma9326lwNxfW5h9etJBe3EsWHpdjAzH7qXdPc01t+IXnS3YOJup6bCGP8y2xGO9Sy9E01u7LKYzf5/a78lAxfmTvIRq7gj0fsdS87MvTiwWDT46lr47W6vGLybjF/GLuRTFWGnK6aVb/Af32TwezBgAA' | base64 -d | gzip -d 
## What is Dagger?
...
#

So that makes me suspect more that it might be the buildkit commit I linked above. I still don't totally get why it would be inconsistent, but I'll try quick with that buildkit commit fixed and see if it fixes it here

civic yacht
#

Okay yeah that fixed it, can't repro in dagger anymore, will send out a buildkit pr. @ancient kettle no need for this to block the dagger-in-dagger pr, if there's a URL that never returns gzip content that we can use then we could switch to that, but honestly if we even just temporarily comment out that test and have an issue for re-adding once we've picked up a buildkit with the fix in place that's fine, we will still have the other http test that uses a service to give us coverage.

civic yacht
#

Upstream fix here: https://github.com/moby/buildkit/pull/3788/files

I also realized as I wrote tests for that why we are hitting it in dagger, you can reproduce consistently by just running this test case in isolation:

func TestHTTP(t *testing.T) {
    t.Parallel()

    c, ctx := connect(t)
    defer c.Close()

    // do two in a row to ensure each gets downloaded correctly
    url := "https://raw.githubusercontent.com/dagger/dagger/main/TESTING.md"
    contents, err := c.HTTP(url).Contents(ctx)
    require.NoError(t, err)
    require.Contains(t, contents, "tests")

    url = "https://raw.githubusercontent.com/dagger/dagger/main/README.md"
    contents, err = c.HTTP(url).Contents(ctx)
    require.NoError(t, err)
    require.Contains(t, contents, "Dagger")
}

It has to do with some weird stuff involving how buildkit stores cache metadata. It technically wouldn't result in any false cache positives or anything, but it was unoptimal and coincidentally triggered that buildkit bug. Sent out a PR to improve that in dagger, more details in the description here: https://github.com/dagger/dagger/pull/4927

civic yacht
#

@ancient kettle the change in that dagger PR works around the issue without needing to remove the test or wait on upstream, pushed it the your dagger-in-dagger PR and tests pass now

still garnet
#

@civic yacht excellent sleuthing! i thought the other base64 looked a bit funny. lots of repeated patterns, I guess from all the escape characters

ancient kettle
#

@civic yacht Really appreciate you digging into this bug. I concur with @still garnet: excellent sleuthing. 🙏🏻

civic yacht
#

No problem! It was a weird one... Glad we don't have to get blocked on it

civic yacht
ancient kettle
#

@still garnet Turns out... network names have size limits... dev-engine-remote-cache-0 (25 characters) is over that limit.

still garnet
#

time for derc-0!

ancient kettle
#

😆

celest totem
still garnet
#

@civic yacht blatant nerd snipe attempt, feel free to ignore, but curious if you've had similar thoughts: SourcePath is really difficult to deal with! It would be really really nice if there was a way to just turn a llb.State + SourcePath back into a llb.State efficiently. Right now we have to go to great lengths to account for them everywhere, and the code is really convoluted. It seems like one option is to just llb.File(llb.Copy(st, sourcePath, ".", &llb.CopyInfo{CopyDirContentsOnly: true})) instead of keeping track of the source path, but I'm assuming that comes with a performance cost. 🤔

civic yacht
# still garnet <@949034677610643507> blatant nerd snipe attempt, feel free to ignore, but curio...

Yeah I've wanted that in upstream buildkit for a while, it came up a lot in my mind when working on MergeOp since right now you can only efficiently merge the / of input states, merging a subdir of an input requires a copy.

I would greatly prefer we just deal with the complication internally for now since yeah the cost of doing all those copies can become significant in realistic cases, e.g. if the copy involves a lot of large binaries, node modules, java jars, etc.

still garnet
#

yea totally. ok, glad I'm not alone here 😄

civic yacht
#

When I talked to Tonis about this during MergeOp work, I think the idea was that we could give FileOp copy the same optimizations MergeOp has where the new snapshot is created by hardlinking files instead of actual on disk copies (with exceptions for when chown/chmod/etc. is used, but that's always going to require a real copy anyways)

still garnet
#

For context, it's wreaking havoc in https://github.com/dagger/dagger/pull/4932 since now I'm adding more copy operations to change ownership, which requires using a SourcePath in nearly all cases in order to also chown the directory mount itself. I'm inclined to not chown by default at this point due to performance concerns, but I'm following through with it anyway to kick all the cobwebs out of these code paths. I found out Container.WithRootFS doesn't respect the given directory's source path, for example.

still garnet
#

The chown/chmod restriction would be unfortunate though, since overlay has metacopy; theoretically you don't need to do a full copy in that case. Not sure if Buildkit leverages that at the moment

civic yacht
still garnet
#

gotcha, good to know

civic yacht
# still garnet That makes sense - so my proposed workaround would "just work"? I can see why he...

But yes if Buildkit had that then we could just copy freely, MergeOp would actually become unneeded as it would just be a special case of Copy. Though I just remembered the big wrench in all of this and why we didn't pursue it at the time, which is that MergeOp is lazy (meaning you can merge remote layers in registries/caches without having to actually pull them), but FileOp isn't. FileOp could be made lazy but it would be a huge refactor in all likelihood.

ancient kettle
still garnet
#

gonna just disable dupl and gosec for *_test.go files

ancient kettle
#

@civic yacht Thanks for your comments. I’ll try to address them this weekend/Monday. Today has been another day of taking care of the family…

civic yacht
ancient kettle
# civic yacht No worries at all, I'm glad that the family is higher priority than my dumb PR c...

Thanks again for the comments. I addressed all of them except for the internal/mage/engine.go vs internal/mage/util/engine.go one. I think that a refactor of all of that is probably a good idea. We're definitely gathering some cruft there. If you're ok with it, I'd like to separate that out as a separate issue, because it will touch a number of other parts of our CI, as well, and I'd love to get this piece over the line.

Let me know what you think.

still garnet
#

@civic yacht Doing some refactoring in core/ and core/schema/ area, do you remember if the motivation behind this marshal => unmarshal dance was to prevent accidental mutation, or was it for something else?

https://github.com/dagger/dagger/commit/66f2655bc7fe5724e863580edc61a4f8609d569b#diff-50e0105c0556e86026a3120c974380c5cfbb87994f70146fc7281824a5d9cd05R94-R98

My refactor is to try to eliminate all the *Foo <-> FooID <-> fooIDPayload encoding/decoding, since it seems pretty wasteful and repetitive. Example for File: https://github.com/vito/dagger/commit/cc5031ddfd76e6d78dc4d402eb2ece55ffd452c5

But one thing I had in the back of my head was having to worry about mutation, and potentially switch from pointers to just by-value structs to avoid that (e.g. func (*Foo) WithX() *Foo -> func (Foo) WithX() Foo).

I added a test using parallel named queries, but it looks like this marshal/unmarshal dance is allowing the test to pass. I'm considering switching to a parent, ok := parent.(P); if !ok { marshalUnmarshal } just to skip the marshal/unmarshal if the type already matches, and confirmed that it allows the test to fail, at which point I would probably switch away from pointers, but curious if this is otherwise load-bearing

#

Here's the test for mutation:

func TestContainerParallel(t *testing.T) {
    t.Parallel()

    res := struct {
        Container struct {
            A struct {
                EnvVariable string
            }
            B string
        }
    }{}

    err := testutil.Query(
        `{
            container {
                a: withEnvVariable(name: "FOO", value: "BAR") {
                    envVariable(name: "FOO")
                }
                b: envVariable(name: "FOO")
            }
        }`, &res, nil)
    require.NoError(t, err)
    require.Equal(t, res.Container.A.EnvVariable, "BAR")
    require.Empty(t, res.Container.B) // fails with BAR if there is mutation
}
civic yacht
civic yacht
still garnet
#

👍 now I'm curious what other types might flow through there and if any of them need conversion, so I'll add some debug logs

#

thanks!

civic yacht
ancient kettle
# civic yacht SGTM, doing another pass now!

I fixed the TestRemoteCacheRegistry issue. I think I was hitting the time limit (10 seconds, iirc - right @still garnet?) between when the dagger engines were getting spun up, which caused the registry to shut down, losing the data, etc. So, I created a cache volume mount to store the data. 🙂

ancient kettle
still garnet
still garnet
still garnet
ancient kettle
still garnet
#

😄 yeah, pulling on this thread revealed an elephant on the other side

still garnet
#

Whoa, think I just found some low-hanging fruit for speeding up dagger session start time, which has a huge impact on test runtime and dagger run. Flamegraph before/after attached.

The subset of core/integration/ I ran went from 187.7s to 63.3s. 😱 (The subset excludes any tests that need to talk to a registry since those no longer work when running directly on the host, post-Dagger-in-Dagger refactor)

Here's the patch, which we can apply to our fork of graphql-go-tools: https://gist.github.com/vito/f28b442d7f3bd3bee854921a7c4c16f9

Seems like this parse-then-print-then-parse dance is to deduplicate type definitions, but we don't have any dupes in the first place, so directly appending seems to work fine.

#

Those flamegraphs were captured by hacking --cpuprofile into the Go SDK's session start flags, and then collecting them all with go tool pprof *.prof - so that spans all tests

#

This is also on top of https://github.com/dagger/dagger/pull/4973 - haven't done a benchmark without it yet, I was so surprised by that first flamegraph that I went on a detour to figure that out first. But now I expect this PR to be peanuts in comparison 😅

GitHub

(Based on #4932 just to avoid merge conflicts, will clean up later.)
Currently every schema type has a set of types like this:
type File struct {
ID FileID json:"id"
}

type FileID st...

civic yacht
#

(I think it's just on startup, I'm just double checking)

still garnet
north jay
#

It seems like dagger is not cleaning up processes?
Possibly also not killing containers when the client goes away (below I just cancelled a run)

root       99943  0.0  0.1 723092 19500 ?        Sl   18:34   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 9c6ba9ce66366de52df135e29f5edb7e02c10e3f85d691d4ac538730b5dd068c -address /run/containerd/containerd.
root       99964  0.6  0.4 961188 80904 ?        Ssl  18:34   0:06  \_ /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
root      100031  0.0  0.0      0     0 ?        Zs   18:34   0:00      \_ [dnsmasq] <defunct>
root      100032  0.0  0.0   1160   756 ?        S    18:34   0:00      \_ /usr/sbin/dnsmasq -u root --conf-file=/var/run/containers/cni/dnsname/dagger/dnsmasq.conf
root      119007  0.0  0.0      0     0 ?        Z    18:48   0:00      \_ [runc] <defunct>
root      119031  0.0  0.0      0     0 ?        Zs   18:48   0:00      \_ [_shim] <defunct>
root      122623  0.0  0.0      0     0 ?        Z    18:50   0:00      \_ [runc] <defunct>
root      122635  0.0  0.1 731736 19292 ?        Ssl  18:50   0:00      \_ /_shim /bin/sh -ec  ???export DEBIAN_FRONTEND=noninteractive ???UPDATED_PATH=/var/cache/apt/moby/updated ???UPDATED_MAX_AGE=60 ???if [ -z "$(find 
root      122645  0.0  0.0   2888   924 ?        S    18:50   0:00          \_ /bin/sh -ec  ???export DEBIAN_FRONTEND=noninteractive ???UPDATED_PATH=/var/cache/apt/moby/updated ???UPDATED_MAX_AGE=60 ???if [ -z "$(find ${U
root      122647  1.3  0.3  70632 64748 ?        S    18:50   0:00              \_ apt-get install -y qemu qemu-system qemu-utils openssh-client linux-image-kvm
systemd+  122654  0.0  0.0  18656  8548 ?        S    18:50   0:00                  \_ /usr/lib/apt/methods/http
still garnet
wet mason
#

@still garnet LGTM and permissions fixed

#

@still garnet it's weird, looks like they were forcing a plain text export just to re-import back. Looks like a cheaply implemented deep copy?

#

re-import as in parse back

still garnet
#

yeah, I don't quite get what it protected against, I even forced a duplicate type definition in our own schemas and nothing seemed to break even with just naively appending 🤷‍♂️

#

the tests also don't cover duplicate types

wet mason
#

maybe there's some funky pointers?

tawny flicker
wet mason
#

@still garnet in any case, I'd merge and keep it in mind if we see some weird behavior in the future. Probably will never happen. Only possible reason I can think of is those doc.Definitions being pointer-altered somewhere else, and this code was making sure to get an immutable copy

#

Maybe a change made out of despair at some point? 😅

still garnet
#

all those pointer values are created within the method (either parsed from text or the ast.NewDocument(nil) right above) so that all seems reasonably safe to me

#

I wouldn't be surprised if this was just YOLO'd and they didn't anticipate it becoming so load-bearing 😂

wet mason
#

Wouldn't be surprised it was useless. Quality of this library is very low, @civic yacht and I spent a lot of time dealing with weird behavior at the beginning

#

Both graphql-go (almost abandonware) and graphql-go-tools (24 stars project)

#

It's the only way to do dynamic graphql in Go though (for extensions), so that's why it is the way it is

still garnet
#

makes sense - I was considering a codegen approach but figured extensions might get in the way of that

wet mason
#

Yeah, gqlgen+codegen would be the best way. Initial version of dagger had extensions and that required dynamic, which prevented us from using gqlgen

civic yacht
#

Yeah I don't think they anticipated the use case in our test suite, which boils down to spinning up probably close to 100 graphql servers in rapid succession 🙂

wet mason
#

we even tried to tweak gqlgen to allow for dynamic schema, but that was a much worse rabbit hole

wet mason
still garnet
#

looks like it was ~8-10 minutes before, not quite the 3x boost I saw locally but I'll take it

civic yacht
still garnet
#

yeah, dagger run ls is pretty snappy now

tawny flicker
north jay
north jay
civic yacht
# north jay Yeah I was hunting around (as you did) and a little surprised hence the disclaim...

Oh totally, trying to do any of this with Go is always a shit show... I found a bug in runtime.LockOSThread a few years ago after having to diagnose why go routines were suddenly no longer able to find files (they had randomly teleported to a different mount namespace). And I at one point could consistently crash the runtime by just trying to set a pid namespace on a exec.Cmd, but I didn't even bother reporting because it seemed just not worth it

#

This is one case where the whole "rewrite it in Rust" meme has some degree of validity, even if it's not actually practical 🙂

rancid turret
#

Hey @civic yacht and @still garnet, can you guys confirm if fixing exitCode (https://github.com/dagger/dagger/issues/3192) is feasible today, including not messing up the cache on failure? I mean even as a new field, so ignore breaking changes.

still garnet
#

PSA: I've noticed a few of my PR's test builds seem to fail with seemingly unrelated service test failures/timeouts. Investigating now, let me know if you've seen the same

#

Hanging/flaky service tests