#๐ v0.12.6 - 28th August 2024
1 messages ยท Page 1 of 1 (latest)
๐ hello! with the ssh auth stuff in, i'm now motivated by the idea of doing another release soon ๐
context dirs feels like it's in the final stages? so maybe we should aim for early next week? maybe tuesday?
SGTM!
pinging <@&946480760016207902> ๐ (just so everyone's in the thread)
i'm gonna suggest next wednesday (monday is a bank holiday in the uk, and would like at least one day next week to test/get things in order)
we'll include the ssh private modules, --interactive-command from https://github.com/dagger/dagger/pull/8171, and a bunch of misc fixes
โจ v0.12.6 - 28th August 2024
opened https://github.com/dagger/dagger/pull/8234 which fixes #go message and the telemetry.Close() breakage
there's a few remaining prs in the milestone, that would be good to get some eyes on ๐ https://github.com/dagger/dagger/milestone/57 (will be going through all of not-mine in there today)
An engine to run your pipelines in containers. Contribute to dagger/dagger development by creating an account on GitHub.
SGTM! I am aiming to get the resolution of https://github.com/dagger/dagger/pull/8212 merged too, so that we can test the effectiveness of the split with this new release.
Just added it to the milestone so that we track it all together.
I would really like to get this in the Helm chart release: https://github.com/dagger/dagger/pull/8219. I have added it to the milestone & pushed the commit that makes it OK to merge from my POV. Go for it @scarlet nacelle if it works for you too.
I also approved some of your PR so you can merge them tomorrow!
Btw it seems there's something weird with the CI, all my PRs are extra red after a rebase, I'm confused
@coral pilot Context directory CI is green, can I merge it? ๐
(sorry to spam ping you, I know it's a big PR so I don't want to make any mistake)
yep LGTM!
Yep!
There's also a test for it
I'm so happy that this PR is finally merge, almost 2 months of common work haha
I can focus on docs and extending ignore metadata later, I'll work on interface support on TS first because Helder started to work on Python's support
I don't know if doc update should be part of the release but here's an update of the TS doc with package manager config & runtime version config: https://github.com/dagger/dagger/pull/8251
/cc @prime spruce
@fiery bison so we're aligned on dep/$(jq -r .source dep/dagger.json)?
I didnโt write a test for that but you should try it
And let me know
Yeah but is it the way it's implemented?
Can you re explain to me what you mean by dep/$(jq -r .source dep/dagger.json so I'm sure I can answer your question accurately
When defaultPath is a relative path, for example +defaultPath="./foo" then it should be relative to the source directory of the callee module, as defined in that module's source field in dagger.json.
For example:
$ cat ./dep/dagger.json
{
"name": "dep",
"source": "./src"
}
$ ls ./dep/src
go.mod
go.sum
main.go
countries.txt
func (m *Dep) Countries(
ctx context.Context,
// +defaultPath="./countries.txt"
file *dagger.File,
) string {
return file.Contents(ctx)
}
dagger call -m ./dep countries
In this example, the function countries from module ./dep with +defaultPath="./countries.txt" should open ./dep/src/countries.txt and not ./dep/countries.txt (source dir and not root dir)
Oh no, right now it's pointing to the root dir, but I can update that really quick, should be a one liner fix
However I thought we all agreed that absolute path = context dir & relative path = root dir :/
That's why I was checking ๐
Okay, I need to fix that and also fix tests then
@fiery bison re-reading the comments in https://github.com/dagger/dagger/issues/7647, it's definitely source dir, unless there was more discussions later outside of the issue.
I see Helder suggesting source dir, and Alex agreeing
I remember that I switched from source dir to root dir with that one
Maybe you forget to precise source root dir, and I didn't asked any question :/ My bad
In the spec I actually say "source file" which would be the dream, but not technically possible sadly. So in the comments Helder and Alex propose the next best thing which is source dir
Thank you! The end is in sight ๐
I added the fix, tests are in progress ๐
Relative to the source makes sense, but just to check, can a relative path also be "../" if I want to go up to the parent if the source is in .dagger for instance?
if anyone's around can i get a review on https://github.com/dagger/dagger/pull/8217 ?
oops, we also need to bump goreleaser after bumping go to 1.23 https://github.com/dagger/dagger/pull/8256
moving the ci test split out of the milestone - https://github.com/dagger/dagger/pull/8212
since this just affects our test suite/ci, this shouldn't block the release
@fiery bison what's the timeline on the ts optimization work you've been doing? as in, will it land in the next couple hours?
otherwise, i'd like to go ahead with the release even if it's not in there, there's a pretty critical go fix (pinning otel deps) we should be getting out asap
It should land soon, I'm fixing context dir first
@little veldt Fix of the context dir: https://github.com/dagger/dagger/pull/8260
just saw this thread and excited
. Are we planning to release 0.12.6 today? If we're planning, I need to update my demo for tomorrow to new version due to this ssh support.
would love to get a quick approval on https://github.com/dagger/dagger/pull/8251 btw ๐
/cc @silent notch
@little veldt CI's green, ready to be merged: https://github.com/dagger/dagger/pull/8260
somehow managed to mess up the linting about a couple weeks back: https://github.com/dagger/dagger/pull/8261
Fixes these issues on main: https://github.com/dagger/dagger/actions/runs/10596511181/job/29364633106, which was introduced in #8151.
Some of these lints had stopped working, since the paths had go...
Looking now.
It needs a few quick fixes, but it's 95% there. Making them now, then approving & merging.
Does this include v3, or does it mean Yarn v4 and above?
yarn v3 and above
so yep, it includes v3
i'm about half an hour from my eod, so don't have time to cut the release - if anyone else is around and wants to, they can, otherwise i'll pick it up first thing tomorrow
Thursday is a good day to release ๐
Where is this property configured?
Same question for
I explained it just above in the doc page, (it will generate a xx file)
Good to merge from my side.
Thanks!
Unlikely to happen today, most likely tomorrow.
Looking at this now.
https://github.com/dagger/dagger/pull/8236 CI is green, waiting for an approval to be merged
/cc @fluid yew
Btw where is your benchmark workflow? Or is there a way for me to try it?
reviewed
it runs off of main every night (or manually), not out of PRs
it's not doing anything fancy, just an init and then 3 dagger functions in a row:
- the first one to try performance out of the box after
dagger init - the second one to try caching
- the third one after changing
main.tsto check the performance after a code change
you can do the same locally and just look at a trace
Just saw this was merged, sorry that I didnt have a chance to review it until just now. I had some questions which I added in the PR. Let's discuss further in https://discord.com/channels/707636530424053791/1278425034586587300
Okay
hey @fiery bison i'm going to push out https://github.com/dagger/dagger/pull/8236 into the next release
andrea is on holiday, and i'm not fully aware of the context of the pr to approve today - since it's not fixing a regression in the last release, i don't think there's a rush and we can wait till next week?
Yes sure! We can keep it for the next week ๐
There's already plenty of changes for this release
prep pr (release notes, sdk updates, etc): https://github.com/dagger/dagger/pull/8268
could also get this super minor little typo fix in https://github.com/dagger/dagger/pull/8267 (mostly ci flake debugging related)
going for lunch now, once these are merged, i can tag and release
hm the wolfi publishing seems to have broken
re-running, maybe it's an infra fluke?
hm, nope, not a fluke
we're getting io errors?
that's an odd one. maybe an infra thing? bad disk?
cc @silent notch
(38/38) Installing go-1.23 (1.23.0-r0)
2 errors; 703 MiB in 50 packages
Stderr:
ERROR: Failed to create usr/lib/libisl.so.23.3.0: Input/output error
ERROR: isl-0.26-r4: IO ERROR
ERROR: Failed to create usr/bin/ld.gold: Input/output error
ERROR: binutils-gold-2.43.1-r0: IO ERROR
yeah, i can't repro this locally
wolfi container builds and runs fine
hm, doesn't look like we're running out of space
/dev/nvme1n1 1.8T 40G 1.7T 3% /host/var/lib/dagger
from df -h for that node
hm, okay, the job just passed ๐ค
maybe the bad node is gone ๐
okay, i'm gonna tag main now
hmmm
no it happened again, on an entirely different note
node uptime is 6 minutes
suspiciously, this seems to only happen on the wolfi publish?
That is rare, but possible. Two different nodes, highly unlikely. Which nodes are these?
perhaps there's some new update? unfortunately, does wolfi have patch notes?
there's a currently running job on ip-192-168-109-156.us-east-2.compute.internal
An engine to run your pipelines in containers. Contribute to dagger/dagger development by creating an account on GitHub.
job is dagger-g2-v0-12-5-16c-od-wxc9f-runner-l8hss
looking now
looking at the wolfi/os history, i see no indication of what might have changed in the last few hours ๐ค
we have a few half done engine builds in https://github.com/dagger/dagger/pkgs/container/engine
does it make sense to delete those and delete the tag while we investigate?
yes
i don't have permissions to delete packages i don't think
yeah, that is an odd one. I don't think that Wolfi is pinned, but it should be. this may help us: https://github.com/dagger/dagger/pull/7782/files#diff-71da6ba00bc676920605e36f15ded23d1da35d117eff8c671c7dc0aa62e7b539R33-R34
I can see that we had issues publishing here too: https://github.com/dagger/dagger/actions/runs/10615397944/attempts/2 , and then it eventually worked.
on it
i'll do the tag
should be doable from this page i think: https://github.com/dagger/dagger/pkgs/container/engine/versions?filters[version_type]=tagged
wolfi published earlier, just not the gpu variant:
mm, but the cli didn't run at all, since it's dependent on everything succeeding
done, all those versions are now gone
awesome ๐
right okay, we could try pinning wolfi, but what do we pin it to?
ideally we want a hash from something like yesterday
We keep all COMMIT-wolfi & COMMIT-wolfi-gpu images, so I would pick one of those.
checking now.
okay, so the history of c5687d86a6ba78ec2fffcb46be5caaa73f561b54-wolfi pulls in cgr.dev/chainguard/wolfi-base:latest@sha256:72c8bfed3266b2780243b144dc5151150015baf5a739edbbde53d154574f1607
dive registry.dagger.io/engine:ff17731b8ca5e2a86850fef100cb88cf8d239955-wolfi for https://github.com/dagger/dagger/commit/ff17731b8ca5e2a86850fef100cb88cf8d239955 is showing cgr.dev/chainguard/wolfi-base:latest@sha256:72c8bfed3266b2780243b144dc5151150015baf5a739edbbde53d154574f1607
nice ๐ jinx
Another thing that we should do is run this locally and confirm that latest wolfi does indeed fail
doing that now
dagger -m . call --source=.:default engine with-base --image=wolfi container terminal seems to work for me ๐ข
hm, this is currently what :latest points to - will look back a bit further
hm, yeah, this is the same version that appeared to be fine for the last commit merged yesterday
> docker buildx imagetools inspect --raw ghcr.io/dagger/engine:ff17731b8ca5e2a86850fef100cb88cf8d239955-wolfi@sha256:03731c12adeff0682a5a3a4ffb8f74624f1d272bc08baa3b943c84e1011275e2 | jq '.history[0]'
{
"created": "2024-08-28T16:57:36.26278684Z",
"created_by": "pulled from cgr.dev/chainguard/wolfi-base:latest@sha256:72c8bfed3266b2780243b144dc5151150015baf5a739edbbde53d154574f1607",
"comment": "buildkit.exporter.image.v0"
}
some googling indicates that this could potentially also be a networking error
OK, so these could be both disk or network issues. While one bad disk is possible, these failed across 3 different machines, each using the local disk, which makes it very unlikely. I suspect network issues, which are usually transient.
yes, that is my assumption too.
potentially an issue in the wolfi registry? https://status.cgr.dev/
no issue reported yet though
I checked, nothing there.
Small blips usually go unnoticed. All systems should assume 99.9% reliability, which I suspect is the case here.
Want to try again?
i can tag again, yup
Running this locally too.
watching the node via dmesg too
imo, we can take this opportunity to update this process and daggerize more - ideally we should build all the images, and then push all the images
i'll queue that work for tomorrow
looks like it's happened again ๐ค
Indeed. This dmesg output makes me suspect that the networking on the AWS EC2 instances themselves is dropping:
more context:
[ 22.945923] IPv6: ADDRCONF(NETDEV_CHANGE): enia46631b86c3: link becomes ready
[ 22.948513] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 22.991230] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 23.043544] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 24.211421] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 26.797440] pci 0000:00:06.0: [1d0f:ec20] type 00 class 0x020000
[ 26.799595] pci 0000:00:06.0: reg 0x10: [mem 0x00000000-0x00001fff]
[ 26.801762] pci 0000:00:06.0: reg 0x14: [mem 0x00000000-0x00001fff]
[ 26.803937] pci 0000:00:06.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[ 26.806354] pci 0000:00:06.0: enabling Extended Tags
[ 26.808626] pci 0000:00:06.0: BAR 2: assigned [mem 0xc0000000-0xc00fffff pref]
[ 26.811112] pci 0000:00:06.0: BAR 0: assigned [mem 0xc0100000-0xc0101fff]
[ 26.813418] pci 0000:00:06.0: BAR 1: assigned [mem 0xc0102000-0xc0103fff]
[ 26.815775] ena 0000:00:06.0: enabling device (0000 -> 0002)
[ 26.826648] ena 0000:00:06.0: ENA device version: 0.10
[ 26.828435] ena 0000:00:06.0: ENA controller version: 0.0.1 implementation version 1
[ 26.932621] ena 0000:00:06.0: Forcing large headers and decreasing maximum TX queue size to 512
[ 26.938067] ena 0000:00:06.0: ENA Large LLQ is enabled
[ 26.953325] ena 0000:00:06.0: Elastic Network Adapter (ENA) found at mem c0100000, mac addr 02:95:49:22:53:e7
[ 27.225434] ena 0000:00:06.0 eth1: Local page cache is disabled for less than 16 channels
[ 37.200361] xfs filesystem being remounted at /var/lib/kubelet/pods/0604b692-52ad-4c9b-8682-4dda5f4ef5d0/volume-subpaths/dagger-engine-config/dagger-engine/2 supports timestamps until 2038 (0x7fffffff)
[ 37.311880] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 192.113385] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 327.349581] IPv6: ADDRCONF(NETDEV_CHANGE): eniee287a2af97: link becomes ready
[ 327.353455] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
and the node just went away
hm, if we still suspect something ephemeral, is it worth pausing the process until tomorrow/next week?
eth0 should not change state
yes, that sounds reasonable. we should also be able to switch CI runners in cases like this one.
i'll delete the tag? can you delete the engine packages?
tag done too

working on this now ๐
we are re-running the job which failed and watching the instance. we couldn't find any networking issues. no disk issues either. things are pointing to overlayfs. maybe a new kernel?
jumping in team so that others can join us if interested.
also our test publish github job should now attempt to build all the variants as well, so we should hopefully catch the weird EOF issue in PRs now as well if it keeps happening
That bpfcc package installed in the sysadmin pod now comes with a ton of tracers besides just mountsnoop, can see them with ls /sbin/*-bpfcc or in the README here: https://github.com/iovisor/bcc
Suspect if we can get the right one running while that IO error is hit we might find something useful. Some that stick out as potentially related (depending on whether the IO error is coming from a filesystem syscall or network syscall): https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py and https://github.com/iovisor/bcc/blob/master/tools/tcplife.py
Might be tricky with the timing but getting a random IO error while installing packages with apk is otherwise kind of a hopeless situation, so just throwing out as a possibility
I would really like to try that out. Care to show us how to use it? @scarlet nacelle is still looking into this, I am unlikely to have more time today, but would really like to circle back to this tomorrow if you're up for it. Finding 30 mins in your calendar so that we don't forget.
Yeah happy to meet, have to go to a dr appt w/ Alice at 9:45 Pacific, so maybe 9? It's also pretty much just a matter of opening a shell and running opensnoop-bpfcc or tcplife-bpfcc and watching the output, so not a ton to go over
It's the unexpected things that we find out which I am most interested in. I am sure that we will find things to improve in the 30 mins that we get together.
While trying to repro flakes I think I may have hit a similar-ish looking situation: https://github.com/dagger/dagger/actions/runs/10621505762/job/29443632581?pr=8203#step:3:1244
In the testdev workflow, it took like ~20m to build the dev engine (seemed to take a long time to install apk packages) and then when it was trying to get the server version got internal HTTP2 stream errors while downloading go deps (for modules)
So would air on this being some deep networking problem rather than disk if this is indeed the same problem manifesting elsewhere
I didn't catch it while it was happening so didn't get to run any of the trace tools, hopefully I can catch it on re-runs
(If there's more updates I'll make a separate thread so we don't overly pollute the release thread)
One of the questions we had was whether this could be related to some of the commits we added lately. We think not, but we did the following to validate: start a pod with the same wolfi image variant (cgr.dev/chainguard/wolfi-base:latest****) and run the command that is giving us the issue: apk add --no-cache git openssh pigz xz iptables ip6tables . Doing that shows the exact same failure we see on our pipelines. Doing the same thing but using a regular alpine images works every time. First screenshot is alpine, second screenshot is wolfi. Could there be some issue related to https://packages.wolfi.dev/os/x86_64/APKINDEX.tar.gz? I'm not sure. I'll run the tools suggested by @coral pilot while reproing the issue and see if something interesting pops up
see my messages right above, I'm seeing extremely slowness installing apk packages from alpine (not wolfi) and then weird internal networking errors from go later. If it's the same problem, then doesn't seem specific to wolfi
Possible it's unrelated
I'm not able to repro the slowness in alpine, at least when executing it outside of the dagger engine
Both screenshots above are containers running directly on the host
Were you able to use the bcc tools in a sysadmin pod? Getting failures at the moment
Yeah I just hit that too, it worked yesterday when I manually installed the package ๐คทโโ๏ธ I figured out the right symlinks to get the headers in place, let me grab the commands from the shell history, one sec
In case its useful: right now we have a host that we are keeping around for this investigation. It won't way away until we want it to. The host is 192-168-187-242. If you want to add containers on that host, the easiest way is to do something like this: kubectl debug -n dagger-runners -it dagger-od-engines-v0-12-5-engine-lkg6j --image=cgr.dev/chainguard/wolfi-base:latest --target=dagger-engine
mkdir /lib/modules
ln -s /host/lib/modules/5.10.223-211.872.amzn2.x86_64/ /lib/modules/5.10.223-211.872.amzn2.x86_64
ln -s /host/usr/src/kernels /usr/src/kernels
that should put the headers in the right place in the container and get those commands to work ^
To repro what I'm seeing right now, I just am pushing empty commits to a PR: https://github.com/dagger/dagger/pull/8203
Last two pushes:
- First push - Extreme slowness in installing alpine apks in one of the testdev workflows, followed by internal http2 stream errors https://github.com/dagger/dagger/actions/runs/10621505762/job/29443632581?pr=8203#step:3:1244
- Second push - Extreme slowness in installing alpine apks in testdev, but no failures ultimately https://github.com/dagger/dagger/actions/runs/10621995772/job/29445265710?pr=8203
It's inconsistent though, other testdev workflows are fine
๐. I'm 100% getting the slowness on wolfi now
Yeah, its working okay now
Trying to make it fail now with opensnoop on the side and it works every time ๐
This potentially related issue has some good suggestions on root causes (which could all be extremely ephemeral issues, and later comments suggest the problem just went away): https://github.com/moby/buildkit/issues/746#issuecomment-447311499
None of those suggestions are actually that specific to docker afaict either, they could all happen outside docker
The path MTU thing mentioned in particular is something that could result in both slowness and/or bizarre errors
if you happen to get it to happen again, other things to try:
apt install -y traceroute; traceroute packages.wolfi.dev- could be helpful if this is some weird network path thing. may need to run a few times since you won't always get the same path- since you can repro sometimes with a one-off command, plain old strace might be easier than the bpf tools (those tools are mainly helpful when you don't even know what process is gonna break and just need to trace everything):
strace -f --seccomp-bpf <command>
Still getting super random network errors everytime I push to that PR, just saw a brand new one: https://github.com/dagger/dagger/actions/runs/10622859102/job/29448066758?pr=8203#step:3:57
It's a "fun" game because I need to know the node before the gh job dies otherwise I won't know which eks node to pop a shell on
They seem to have dissipated now... I also just tried to re-run the publish job on main and wolfi built successfully: https://github.com/dagger/dagger/actions/runs/10619063519/job/29448999619
I really wouldn't be surprised if this was just a very ill-timed networking blip either with AWS or some other intermediate network that packets get commonly routed through, especially since it seemed to intermittently affect multiple endpoints besides wolfi. Path MTU issues in particular triggered memories of similar problems when I was actually working on this stuff at AWS, you can just randomly lose packets in a black hole if anywhere in the route has a misconfiguration ๐ตโ๐ซ
Separate from above, Helm CI fails everywhere right now because it's looking for a v0.12.6 engine? e.g. https://github.com/dagger/dagger/actions/runs/10624216581/job/29452098917
Side effect of starting the release today but then hitting those errors and stopping?
Yeah woops ๐ญ I'm gonna follow up after the release to try and get this to install the :main engine instead, like the rest of the provisioning tests ๐
hopefully the weird wolfi errors are gone now ๐ค
i've merged the pr to build all the variants before pushing as well, so hopefully even if it is happening, we won't have half-published packages
so i'm gonna go ahead and tag ๐
happy now ๐
some slight issues in the sdk automated release notes - python accidentally had the elixir ones, php tried to publish them to the wrong repo (both easy to fix, just took a bit of manual intervention, will fix for next time)
engine + sdks successful, published docs as well now
cc @daring pier @scarlet nacelle @silent notch dagger playground can now be updated ๐
cc @daring pier @scarlet nacelle @leaden hollow likewise with the daggerverse
dagger-for-github pr here: https://github.com/dagger/dagger-for-github/pull/144 (cc @silent notch @vagrant loom)