#๐Ÿš€ v0.12.6 - 28th August 2024

1 messages ยท Page 1 of 1 (latest)

little veldt
#

๐Ÿ‘‹ hello! with the ssh auth stuff in, i'm now motivated by the idea of doing another release soon ๐Ÿ™‚

#

context dirs feels like it's in the final stages? so maybe we should aim for early next week? maybe tuesday?

daring pier
#

SGTM!

little veldt
#

pinging <@&946480760016207902> ๐Ÿ‘‹ (just so everyone's in the thread)

#

i'm gonna suggest next wednesday (monday is a bank holiday in the uk, and would like at least one day next week to test/get things in order)
we'll include the ssh private modules, --interactive-command from https://github.com/dagger/dagger/pull/8171, and a bunch of misc fixes

#

โœจ v0.12.6 - 28th August 2024

tough vessel
little veldt
silent notch
#

Just added it to the milestone so that we track it all together.

#

I would really like to get this in the Helm chart release: https://github.com/dagger/dagger/pull/8219. I have added it to the milestone & pushed the commit that makes it OK to merge from my POV. Go for it @scarlet nacelle if it works for you too.

fiery bison
#

Btw it seems there's something weird with the CI, all my PRs are extra red after a rebase, I'm confused

fiery bison
#

@coral pilot Context directory CI is green, can I merge it? ๐Ÿ˜„

#

(sorry to spam ping you, I know it's a big PR so I don't want to make any mistake)

rustic gorge
#

@fiery bison did we resolve your question of "relative path in dep"?

fiery bison
#

Yep!

#

There's also a test for it

#

I'm so happy that this PR is finally merge, almost 2 months of common work haha

#

I can focus on docs and extending ignore metadata later, I'll work on interface support on TS first because Helder started to work on Python's support

rustic gorge
#

@fiery bison so we're aligned on dep/$(jq -r .source dep/dagger.json)?

fiery bison
#

And let me know

rustic gorge
#

Yeah but is it the way it's implemented?

fiery bison
#

Can you re explain to me what you mean by dep/$(jq -r .source dep/dagger.json so I'm sure I can answer your question accurately

rustic gorge
# fiery bison Can you re explain to me what you mean by `dep/$(jq -r .source dep/dagger.json` ...

When defaultPath is a relative path, for example +defaultPath="./foo" then it should be relative to the source directory of the callee module, as defined in that module's source field in dagger.json.

For example:

$ cat ./dep/dagger.json
{
 "name": "dep",
 "source": "./src"
}
$ ls ./dep/src
go.mod
go.sum
main.go
countries.txt
func (m *Dep) Countries(
 ctx context.Context,
 // +defaultPath="./countries.txt"
 file *dagger.File,
) string {
 return file.Contents(ctx)
}
dagger call -m ./dep countries

In this example, the function countries from module ./dep with +defaultPath="./countries.txt" should open ./dep/src/countries.txt and not ./dep/countries.txt (source dir and not root dir)

fiery bison
rustic gorge
fiery bison
#

Okay, I need to fix that and also fix tests then

rustic gorge
#

I see Helder suggesting source dir, and Alex agreeing

fiery bison
#

I remember that I switched from source dir to root dir with that one

#

Maybe you forget to precise source root dir, and I didn't asked any question :/ My bad

rustic gorge
#

In the spec I actually say "source file" which would be the dream, but not technically possible sadly. So in the comments Helder and Alex propose the next best thing which is source dir

fiery bison
#

Yeah I know, I got confused by your comment on my PR, that's my bad

#

I'll fix that

rustic gorge
#

Thank you! The end is in sight ๐Ÿ™‚

fiery bison
#

I added the fix, tests are in progress ๐Ÿ˜„

little veldt
#

Relative to the source makes sense, but just to check, can a relative path also be "../" if I want to go up to the parent if the source is in .dagger for instance?

little veldt
little veldt
little veldt
little veldt
#

@fiery bison what's the timeline on the ts optimization work you've been doing? as in, will it land in the next couple hours?
otherwise, i'd like to go ahead with the release even if it's not in there, there's a pretty critical go fix (pinning otel deps) we should be getting out asap

fiery bison
eager river
#

just saw this thread and excited pepe_hands . Are we planning to release 0.12.6 today? If we're planning, I need to update my demo for tomorrow to new version due to this ssh support.

fiery bison
little veldt
silent notch
silent notch
fiery bison
#

so yep, it includes v3

little veldt
#

i'm about half an hour from my eod, so don't have time to cut the release - if anyone else is around and wants to, they can, otherwise i'll pick it up first thing tomorrow

silent notch
silent notch
#

Same question for

fiery bison
silent notch
fiery bison
#

Thanks!

silent notch
fiery bison
#

/cc @fluid yew

#

Btw where is your benchmark workflow? Or is there a way for me to try it?

fluid yew
fluid yew
#

it's not doing anything fancy, just an init and then 3 dagger functions in a row:

  1. the first one to try performance out of the box after dagger init
  2. the second one to try caching
  3. the third one after changing main.ts to check the performance after a code change
#

you can do the same locally and just look at a trace

prime spruce
little veldt
#

andrea is on holiday, and i'm not fully aware of the context of the pr to approve today - since it's not fixing a regression in the last release, i don't think there's a rush and we can wait till next week?

fiery bison
#

Yes sure! We can keep it for the next week ๐Ÿ˜„

#

There's already plenty of changes for this release

little veldt
#

going for lunch now, once these are merged, i can tag and release

little veldt
#

hm the wolfi publishing seems to have broken

#

re-running, maybe it's an infra fluke?

#

hm, nope, not a fluke

#

we're getting io errors?

tough vessel
#

that's an odd one. maybe an infra thing? bad disk? thinkies cc @silent notch

#
(38/38) Installing go-1.23 (1.23.0-r0)
2 errors; 703 MiB in 50 packages
Stderr:
ERROR: Failed to create usr/lib/libisl.so.23.3.0: Input/output error
ERROR: isl-0.26-r4: IO ERROR
ERROR: Failed to create usr/bin/ld.gold: Input/output error
ERROR: binutils-gold-2.43.1-r0: IO ERROR
little veldt
#

yeah, i can't repro this locally

#

wolfi container builds and runs fine

#

hm, doesn't look like we're running out of space

/dev/nvme1n1    1.8T   40G  1.7T   3% /host/var/lib/dagger

from df -h for that node

#

hm, okay, the job just passed ๐Ÿค”

#

maybe the bad node is gone ๐Ÿ‘€

#

okay, i'm gonna tag main now

#

hmmm

#

no it happened again, on an entirely different note

#

node uptime is 6 minutes

#

suspiciously, this seems to only happen on the wolfi publish?

silent notch
little veldt
#

perhaps there's some new update? unfortunately, does wolfi have patch notes?

#

there's a currently running job on ip-192-168-109-156.us-east-2.compute.internal

#

job is dagger-g2-v0-12-5-16c-od-wxc9f-runner-l8hss

silent notch
#

looking now

little veldt
#

looking at the wolfi/os history, i see no indication of what might have changed in the last few hours ๐Ÿค”

silent notch
#

yes

little veldt
#

i don't have permissions to delete packages i don't think

silent notch
little veldt
#

i'll do the tag

little veldt
silent notch
#

wolfi published earlier, just not the gpu variant:

little veldt
#

mm, but the cli didn't run at all, since it's dependent on everything succeeding

silent notch
#

done, all those versions are now gone

little veldt
#

awesome ๐ŸŽ‰

#

right okay, we could try pinning wolfi, but what do we pin it to?

#

ideally we want a hash from something like yesterday

silent notch
#

We keep all COMMIT-wolfi & COMMIT-wolfi-gpu images, so I would pick one of those.

#

checking now.

little veldt
#

okay, so the history of c5687d86a6ba78ec2fffcb46be5caaa73f561b54-wolfi pulls in cgr.dev/chainguard/wolfi-base:latest@sha256:72c8bfed3266b2780243b144dc5151150015baf5a739edbbde53d154574f1607

silent notch
little veldt
#

nice ๐Ÿ˜„ jinx

silent notch
#

Another thing that we should do is run this locally and confirm that latest wolfi does indeed fail

#

doing that now

little veldt
#

dagger -m . call --source=.:default engine with-base --image=wolfi container terminal seems to work for me ๐Ÿ˜ข

little veldt
#

hm, yeah, this is the same version that appeared to be fine for the last commit merged yesterday

#
> docker buildx imagetools inspect --raw ghcr.io/dagger/engine:ff17731b8ca5e2a86850fef100cb88cf8d239955-wolfi@sha256:03731c12adeff0682a5a3a4ffb8f74624f1d272bc08baa3b943c84e1011275e2 | jq '.history[0]'

{
  "created": "2024-08-28T16:57:36.26278684Z",
  "created_by": "pulled from cgr.dev/chainguard/wolfi-base:latest@sha256:72c8bfed3266b2780243b144dc5151150015baf5a739edbbde53d154574f1607",
  "comment": "buildkit.exporter.image.v0"
}
#

some googling indicates that this could potentially also be a networking error

silent notch
#

OK, so these could be both disk or network issues. While one bad disk is possible, these failed across 3 different machines, each using the local disk, which makes it very unlikely. I suspect network issues, which are usually transient.

#

yes, that is my assumption too.

little veldt
#

no issue reported yet though

silent notch
#

I checked, nothing there.

#

Small blips usually go unnoticed. All systems should assume 99.9% reliability, which I suspect is the case here.

#

Want to try again?

little veldt
#

i can tag again, yup

silent notch
#

Running this locally too.

little veldt
silent notch
#

watching the node via dmesg too

little veldt
#

imo, we can take this opportunity to update this process and daggerize more - ideally we should build all the images, and then push all the images

#

i'll queue that work for tomorrow

#

looks like it's happened again ๐Ÿค”

silent notch
#

Indeed. This dmesg output makes me suspect that the networking on the AWS EC2 instances themselves is dropping:

#

more context:

[   22.945923] IPv6: ADDRCONF(NETDEV_CHANGE): enia46631b86c3: link becomes ready
[   22.948513] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   22.991230] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   23.043544] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   24.211421] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   26.797440] pci 0000:00:06.0: [1d0f:ec20] type 00 class 0x020000
[   26.799595] pci 0000:00:06.0: reg 0x10: [mem 0x00000000-0x00001fff]
[   26.801762] pci 0000:00:06.0: reg 0x14: [mem 0x00000000-0x00001fff]
[   26.803937] pci 0000:00:06.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[   26.806354] pci 0000:00:06.0: enabling Extended Tags
[   26.808626] pci 0000:00:06.0: BAR 2: assigned [mem 0xc0000000-0xc00fffff pref]
[   26.811112] pci 0000:00:06.0: BAR 0: assigned [mem 0xc0100000-0xc0101fff]
[   26.813418] pci 0000:00:06.0: BAR 1: assigned [mem 0xc0102000-0xc0103fff]
[   26.815775] ena 0000:00:06.0: enabling device (0000 -> 0002)
[   26.826648] ena 0000:00:06.0: ENA device version: 0.10
[   26.828435] ena 0000:00:06.0: ENA controller version: 0.0.1 implementation version 1
[   26.932621] ena 0000:00:06.0: Forcing large headers and decreasing maximum TX queue size to 512
[   26.938067] ena 0000:00:06.0: ENA Large LLQ is enabled
[   26.953325] ena 0000:00:06.0: Elastic Network Adapter (ENA) found at mem c0100000, mac addr 02:95:49:22:53:e7
[   27.225434] ena 0000:00:06.0 eth1: Local page cache is disabled for less than 16 channels
[   37.200361] xfs filesystem being remounted at /var/lib/kubelet/pods/0604b692-52ad-4c9b-8682-4dda5f4ef5d0/volume-subpaths/dagger-engine-config/dagger-engine/2 supports timestamps until 2038 (0x7fffffff)
[   37.311880] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[  192.113385] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[  327.349581] IPv6: ADDRCONF(NETDEV_CHANGE): eniee287a2af97: link becomes ready
[  327.353455] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
#

and the node just went away

little veldt
#

hm, if we still suspect something ephemeral, is it worth pausing the process until tomorrow/next week?

silent notch
#

eth0 should not change state

#

yes, that sounds reasonable. we should also be able to switch CI runners in cases like this one.

little veldt
#

i'll delete the tag? can you delete the engine packages?

silent notch
#

yes

#

done

little veldt
#

tag done too

tough vessel
little veldt
silent notch
#

we are re-running the job which failed and watching the instance. we couldn't find any networking issues. no disk issues either. things are pointing to overlayfs. maybe a new kernel?

#

jumping in team so that others can join us if interested.

little veldt
#

also our test publish github job should now attempt to build all the variants as well, so we should hopefully catch the weird EOF issue in PRs now as well if it keeps happening

coral pilot
#

That bpfcc package installed in the sysadmin pod now comes with a ton of tracers besides just mountsnoop, can see them with ls /sbin/*-bpfcc or in the README here: https://github.com/iovisor/bcc

Suspect if we can get the right one running while that IO error is hit we might find something useful. Some that stick out as potentially related (depending on whether the IO error is coming from a filesystem syscall or network syscall): https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py and https://github.com/iovisor/bcc/blob/master/tools/tcplife.py

Might be tricky with the timing but getting a random IO error while installing packages with apk is otherwise kind of a hopeless situation, so just throwing out as a possibility

silent notch
coral pilot
#

Yeah happy to meet, have to go to a dr appt w/ Alice at 9:45 Pacific, so maybe 9? It's also pretty much just a matter of opening a shell and running opensnoop-bpfcc or tcplife-bpfcc and watching the output, so not a ton to go over

silent notch
#

It's the unexpected things that we find out which I am most interested in. I am sure that we will find things to improve in the 30 mins that we get together.

coral pilot
#

So would air on this being some deep networking problem rather than disk if this is indeed the same problem manifesting elsewhere

#

I didn't catch it while it was happening so didn't get to run any of the trace tools, hopefully I can catch it on re-runs

#

(If there's more updates I'll make a separate thread so we don't overly pollute the release thread)

scarlet nacelle
#

One of the questions we had was whether this could be related to some of the commits we added lately. We think not, but we did the following to validate: start a pod with the same wolfi image variant (cgr.dev/chainguard/wolfi-base:latest****) and run the command that is giving us the issue: apk add --no-cache git openssh pigz xz iptables ip6tables . Doing that shows the exact same failure we see on our pipelines. Doing the same thing but using a regular alpine images works every time. First screenshot is alpine, second screenshot is wolfi. Could there be some issue related to https://packages.wolfi.dev/os/x86_64/APKINDEX.tar.gz? I'm not sure. I'll run the tools suggested by @coral pilot while reproing the issue and see if something interesting pops up

coral pilot
#

Possible it's unrelated

scarlet nacelle
#

I'm not able to repro the slowness in alpine, at least when executing it outside of the dagger engine

#

Both screenshots above are containers running directly on the host

#

Were you able to use the bcc tools in a sysadmin pod? Getting failures at the moment

coral pilot
scarlet nacelle
#

In case its useful: right now we have a host that we are keeping around for this investigation. It won't way away until we want it to. The host is 192-168-187-242. If you want to add containers on that host, the easiest way is to do something like this: kubectl debug -n dagger-runners -it dagger-od-engines-v0-12-5-engine-lkg6j --image=cgr.dev/chainguard/wolfi-base:latest --target=dagger-engine

coral pilot
#
mkdir /lib/modules
ln -s /host/lib/modules/5.10.223-211.872.amzn2.x86_64/ /lib/modules/5.10.223-211.872.amzn2.x86_64
ln -s /host/usr/src/kernels /usr/src/kernels

that should put the headers in the right place in the container and get those commands to work ^

coral pilot
# scarlet nacelle I'm not able to repro the slowness in alpine, at least when executing it outside...

To repro what I'm seeing right now, I just am pushing empty commits to a PR: https://github.com/dagger/dagger/pull/8203

Last two pushes:

  1. First push - Extreme slowness in installing alpine apks in one of the testdev workflows, followed by internal http2 stream errors https://github.com/dagger/dagger/actions/runs/10621505762/job/29443632581?pr=8203#step:3:1244
  2. Second push - Extreme slowness in installing alpine apks in testdev, but no failures ultimately https://github.com/dagger/dagger/actions/runs/10621995772/job/29445265710?pr=8203
#

It's inconsistent though, other testdev workflows are fine

scarlet nacelle
#

๐Ÿ‘. I'm 100% getting the slowness on wolfi now

#

Yeah, its working okay now

#

Trying to make it fail now with opensnoop on the side and it works every time ๐Ÿ˜†

coral pilot
#

This potentially related issue has some good suggestions on root causes (which could all be extremely ephemeral issues, and later comments suggest the problem just went away): https://github.com/moby/buildkit/issues/746#issuecomment-447311499

GitHub

We enabled buildkit in our project by adding DOCKER_BUILDKIT=1 to our builds in docker-ce, and for one build we consistently get this error with it enabled (and no errors when buildkit is disabled)...

#

None of those suggestions are actually that specific to docker afaict either, they could all happen outside docker

#

The path MTU thing mentioned in particular is something that could result in both slowness and/or bizarre errors

coral pilot
# scarlet nacelle Trying to make it fail now with opensnoop on the side and it works every time ๐Ÿ˜†

if you happen to get it to happen again, other things to try:

  • apt install -y traceroute; traceroute packages.wolfi.dev - could be helpful if this is some weird network path thing. may need to run a few times since you won't always get the same path
  • since you can repro sometimes with a one-off command, plain old strace might be easier than the bpf tools (those tools are mainly helpful when you don't even know what process is gonna break and just need to trace everything): strace -f --seccomp-bpf <command>
coral pilot
coral pilot
#

They seem to have dissipated now... I also just tried to re-run the publish job on main and wolfi built successfully: https://github.com/dagger/dagger/actions/runs/10619063519/job/29448999619

I really wouldn't be surprised if this was just a very ill-timed networking blip either with AWS or some other intermediate network that packets get commonly routed through, especially since it seemed to intermittently affect multiple endpoints besides wolfi. Path MTU issues in particular triggered memories of similar problems when I was actually working on this stuff at AWS, you can just randomly lose packets in a black hole if anywhere in the route has a misconfiguration ๐Ÿ˜ตโ€๐Ÿ’ซ

coral pilot
little veldt
little veldt
#

hopefully the weird wolfi errors are gone now ๐Ÿคž

#

i've merged the pr to build all the variants before pushing as well, so hopefully even if it is happening, we won't have half-published packages

#

so i'm gonna go ahead and tag ๐ŸŽ‰

#

happy now ๐Ÿ˜„

little veldt
#

some slight issues in the sdk automated release notes - python accidentally had the elixir ones, php tried to publish them to the wrong repo (both easy to fix, just took a bit of manual intervention, will fix for next time)

#

engine + sdks successful, published docs as well now

#

cc @daring pier @scarlet nacelle @silent notch dagger playground can now be updated ๐ŸŽ‰

#

cc @daring pier @scarlet nacelle @leaden hollow likewise with the daggerverse