weird new v0.12 hangs | Dagger | Page 1

glass depot Jul 17, 2024, 2:28 PM

#

Context mostly in https://github.com/dagger/dagger/issues/7954

GitHub

Jobs hang strangely after upgrading CI to v0.12.0 · Issue #7954 · d...

After applying #7904, we've seen jobs hang weirdly. See an example: https://github.com/dagger/dagger/actions/runs/9970573229/job/27549768418?pr=7936#step:3:89 https://dagger.cloud/dagger/traces...

#

and a wall of text in #1254757484057464975 message

#

tl;dr - jobs seem to get into hung states, often when they're fully completed - and have either errored/succeeded

#

cc @timber path @solar coyote

#

weird new v0.12 hangs

solar coyote Jul 17, 2024, 2:55 PM

#

@glass depot is there any situation where plain progress will wait client side for things to finish? Similar to draining?

glass depot Jul 17, 2024, 2:57 PM

#

i don't think so

#

yeah, nothing in the plain shutdown implies that to me

solar coyote Jul 17, 2024, 3:05 PM

#

if you catch it in the act, we can probably check engine logs for draining in argocd

glass depot Jul 17, 2024, 3:05 PM

#

good call

#

lemme start up a ton of them

solar coyote Jul 17, 2024, 3:06 PM

#

I'll add a Cloud UI + CLI flag to disable accounting, don't like having that as a confounding variable

glass depot Jul 17, 2024, 3:08 PM

#

okay, so the sdk jobs should show up here:

#

https://argocd.ci.dagger.cloud/applications/dagger-nvme-runners-v0-12-0-4c?resource=

#

which is where we see most of the failures

solar coyote Jul 17, 2024, 3:12 PM

#

waiting should be a good Ctrl+F

glass depot Jul 17, 2024, 3:16 PM

#

https://tenor.com/view/stick-figure-do-something-gif-14811013

Tenor

solar coyote Jul 17, 2024, 3:20 PM

#

ok, well, if I disable accounting the exec codegen failure just hops from withExec to directory, so that part is at least behaving properly - and for some reason directory just doesn't propagate that failure?

glass depot Jul 17, 2024, 3:20 PM

#

hm weird that i can't make it happen by just re-running it

solar coyote Jul 17, 2024, 3:20 PM

#

solar coyote ok, well, if I disable accounting the `exec codegen` failure just hops from `wit...

maybe there's a if errors.Is(err, context.Canceled) in there somewhere...?

#

(also why was the context canceled?!)

glass depot Jul 17, 2024, 3:21 PM

#

🤔 who's sending the cancelled

#

yeah exactly

#

well

#

here's one thought

#

what if it's another job

solar coyote Jul 17, 2024, 3:21 PM

#

yeah lol

#

edge merging?

#

😱

glass depot Jul 17, 2024, 3:21 PM

#

mayyyybe, this would be flightcontrol

#

it's technically correct to edge merge here, what would be weird would be the cancellation of one job affecting this like this

solar coyote Jul 17, 2024, 3:23 PM

#

i recall @timber path pointing out that DagQL had that unfortunate behavior (sharing canceled errors) with its caching implementation, but that flightcontrol was meant to handle that sort of thing

#

(gonna go check if dagql fixed that)

#

(nope, looks like it'd still happen, but since this is an exec foo span this should be in Buildkit, not DagQL)

glass depot Jul 17, 2024, 3:27 PM

#

you know how we changed the services otel thing? could we be ending up with a weird thing like a service being attached in a weird way?

#

other note - i wonder if we should be getting context.Cause for that error

solar coyote Jul 17, 2024, 3:30 PM

#

glass depot you know how we changed the services otel thing? could we be ending up with a we...

what services otel thing?

#

also, the otel tcp proxy listen: might be interesting there, I wonder if this error is happening during cleanup

glass depot Jul 17, 2024, 3:30 PM

#

https://github.com/dagger/dagger/pull/7735

#

that was for v0.12.0

#

i was gonna suggest it could be a red herring, and might have just been that run

#

but i found it again in https://github.com/dagger/dagger/actions/runs/9975355590/job/27564888280

#

tbf, i don't see it in https://github.com/dagger/dagger/actions/runs/9975355581/job/27564888642

solar coyote Jul 17, 2024, 3:33 PM

#

hm, what are you finding in there that points to it?

glass depot Jul 17, 2024, 3:33 PM

#

"process "codegen --output /src --module-context-path /src/bass --module-name bass-sdk --introspection-json-path /schema.json" did not complete successfully: otel tcp proxy listen: context canceled: context canceled"

#

line 209

#

but it does look like those two jobs shared the same dagger runner

solar coyote Jul 17, 2024, 3:34 PM

#

glass depot "process "codegen --output /src --module-context-path /src/bass --module-name ba...

maybe i'm pre-coffee but I don't see the relation to services, or do you just mean that set of changes in general

glass depot Jul 17, 2024, 3:34 PM

#

so potentially there's some multiple client concurrency happening

glass depot Jul 17, 2024, 3:34 PM

#

solar coyote maybe i'm pre-coffee but I don't see the relation to services, or do you just me...

oh sorry, nope, i misunderstood too

solar coyote Jul 17, 2024, 3:34 PM

#

i thought runners only ran one job at a time? (might be wrong)

glass depot Jul 17, 2024, 3:35 PM

#

each runner does, but each runner is on a massive k8s worker, and runners on the same worker share one big dagger instance

solar coyote Jul 17, 2024, 3:35 PM

#

oic

glass depot Jul 17, 2024, 3:35 PM

#

i think you can tell it's the same worker by the 6ztrj

#

in the runner name

glass depot Jul 17, 2024, 3:36 PM

#

solar coyote maybe i'm pre-coffee but I don't see the relation to services, or do you just me...

tbf, i was just going through every commit in v0.12.0 and looking at things that had an impact on otel/etc

#

but also these jobs are noticable in that they all spin up an internal dagger engine to talk to and do their codegen for

solar coyote Jul 17, 2024, 3:36 PM

#

there's possibly relevant changes in there for sure, just not seeing anything yet notsureif

#

oh wait interesting

#

that otel tcp proxy listen: error is during container setup, not cleanup. it never even ran the codegen

glass depot Jul 17, 2024, 3:38 PM

#

🤔 how is it able to do anything lol

solar coyote Jul 17, 2024, 3:38 PM

#

so 1) why was it canceled, 2) why did that cancellation not result in an error for the caller?

#

https://github.com/dagger/dagger/blob/76859d7210715fcc1fbf9a2d805f2aa0b58def60/engine/buildkit/executor_spec.go#L652

#

should see executor setup failed: in the logs for that too

#

@frank bison is there a way to search historical logs for CI runners/engines? Maybe that's in Grafana instead of ArgoCD?

#

whenever I try to find that I end up on https://grafana.ci.dagger.cloud/d/logs/loki-logs?orgId=1&refresh=10s which is empty

solar coyote Jul 17, 2024, 3:46 PM

#

glass depot but i found it again in https://github.com/dagger/dagger/actions/runs/9975355590...

think i found this one in argocd: https://argocd.ci.dagger.cloud/applications/argocd/dagger-nvme-engines-v0-12-0?rollback=-1&view=list&resource=name%3Adagger-nvme-engines-v0-12-0-engine-w47xp&node=%2FPod%2Fdagger-runners%2Fdagger-nvme-engines-v0-12-0-engine-w47xp%2F0&tab=logs

#

you can use the NAME filter in the left sidebar, and then change to the last view here

#

oh, but these logs are truncated

glass depot Jul 17, 2024, 3:47 PM

#

mmmm and no executor setup failed: 😢

#

the pr in https://github.com/dagger/dagger/pull/7960 did it!

solar coyote Jul 17, 2024, 3:58 PM

#

#

poking around in grafana with a one-off dashboard

glass depot Jul 17, 2024, 3:58 PM

#

huh

solar coyote Jul 17, 2024, 3:59 PM

#

if you give me a runner name I can filter to that

glass depot Jul 17, 2024, 3:59 PM

#

dagger-v0-12-0-4c-nvme-6ztrj-runner-lfgzs

#

oooh it's actually currently running a job if that's interesting to you

#

dagger-v0-12-0-16c-nvme-v7r9z-runner-s8j88 is ongoing

glass depot Jul 17, 2024, 4:03 PM

#

glass depot other note - i wonder if we should be getting `context.Cause` for that error

hm, nope, we're already doing this

#

apparently context.Cancelled is the best it gets

#

hm, i'm also still super confused as to why we never see the end of the initialize span in the logs

#

potentially there's a bug somewhere, where we're erroring on the span, but not propagating the error, and instead never ending the span

#

which could explain some of what we're seeing i guess

solar coyote Jul 17, 2024, 4:12 PM

#

but then you'd see the span unfinished in Cloud

glass depot Jul 17, 2024, 4:12 PM

#

hmmm you probably would yes

solar coyote Jul 17, 2024, 4:14 PM

#

can't find any logs for lfgzs in grafana but it's possible i'm messing up the filter somehow

#

this grafana query builder UI is insanely janky

#

flickers and disregards input constantly

#

i just want a giga grep 🙃

#

also timing out a lot, logs are hard tbf

glass depot Jul 17, 2024, 4:18 PM

#

solar coyote but then you'd see the span unfinished in Cloud

the logic that considers a child span terminated if a parent one terminated... could that be relevant here (does cloud have it's own version of this)

solar coyote Jul 17, 2024, 4:19 PM

#

unless you mean something else, that logic was only for draining, so we stop waiting on children if the parent completed

solar coyote Jul 17, 2024, 4:21 PM

#

glass depot `dagger-v0-12-0-16c-nvme-v7r9z-runner-s8j88` is ongoing

can't find this in argocd at all 🤔

glass depot Jul 17, 2024, 4:21 PM

#

ah, okay, aha

glass depot Jul 17, 2024, 4:21 PM

#

solar coyote can't find this in argocd at all 🤔

it might have terminated

solar coyote Jul 17, 2024, 7:13 PM

#

just adding to the crazy pile: https://github.com/dagger/dagger/actions/runs/9975355581/job/27564889749?pr=7955#step:3:128

20  :     exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json
20  :     [0.1s] | generating go module: docusaurus
20  :     [0.1s] | runtime: failed to create new OS thread (have 10 already; errno=12)
20  :     [0.1s] | fatal error: newosproc
20  :     exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json ERROR [0.1s]
20  :     ! process "codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json" did not complete successfully: exit code: 137: context canceled: context canceled

wat

#

two weird errors in this run, yet it succeeded:

https://github.com/dagger/dagger/actions/runs/9977285205/job/27581104432?pr=7958#step:3:103 => same otel tcp proxy listen: context canceled: context canceled issue from above convo

but this one's new: https://github.com/dagger/dagger/actions/runs/9977285205/job/27581104432?pr=7958#step:3:124

21  :     Container.withExec(args: ["go", "build", "-o", "/runtime", "."]): Container!
22  :     exec go build -o /runtime .
22  :     exec go build -o /runtime . ERROR [0.0s]
22  :     ! failed to commit z08zs7919phieyyjjocun7dir to dwnewoi79q73dppxn6ynhdlec during finalize: context canceled: context canceled

#

I still wonder if this is spot instances related somehow... this same PR had a bunch of its jobs canceled, after I just re-ran them for being canceled before. I have no idea where else a context canceled would be coming from

solar coyote Jul 17, 2024, 8:47 PM

#

I put up this goofy PR just to have a stable spot for trying to repro this: https://github.com/dagger/dagger/pull/7965 - nothing so far, only a few runs. (Notably, we're not using spot instances anymore.)

solar coyote Jul 17, 2024, 9:38 PM

#

Think I caught one! Thanks a ton for the engine logs bucket @frank bison @rugged pilot - it helped immediately. 🙏 The "Last Modified" times on each file in the bucket actually helped a lot, in lieu of clearer filenames.

From the engine logs for https://github.com/dagger/dagger/actions/runs/9981275585/job/27584543245?pr=7965, I can see this:

time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"

This was pretty easy to correlate by downloading the raw GHA logs and comparing timestamps. Would be nice if our logs had the trace ID so I could be 100% sure, but considering it's ctxErr="context canceled" it's at least a case where it's interrupted, and the interesting thing is which spans are left behind. I can keep digging from here.

#

Can confirm those spans didn't make it to Cloud, either. Now to figure out why

https://dagger.cloud/dagger/traces/5f6ba78ea08e7c46a0e4bc14149683fb#36dd05401b560b9d

#

Those first two done waiting logs are from a different trace in the same PR: https://dagger.cloud/dagger/traces/7125da6b26fa42452efbbaa699a64d0f#abc2036272d57dee

glass depot Jul 17, 2024, 11:10 PM

#

So just to check - are the errors we saw about the context cancelled otel socket, related to otel draining and "fixed" by the timeout issue? Or is that somewhat separate, and not at all related?

#

If we can even answer that at this point

solar coyote Jul 17, 2024, 11:18 PM

#

Those are still a mystery to me but could be related to cancellation in general, either from spot instance interruptions or hanging -> GHA timeout

frank bison Jul 17, 2024, 11:19 PM

#

solar coyote Think I caught one! Thanks a ton for the engine logs bucket <@628392087880073217...

Made the change to use dates as filename instead and things are not working correctly. There is a very strange panic happening the vector's codebase. I'll roll it back and look into it more tomorrow, you might have lost a few logs in the process!

solar coyote Jul 17, 2024, 11:19 PM

#

One thing I'm not sure of is how aggressively spot instances are reclaimed - will we even see context canceled?

frank bison Jul 17, 2024, 11:22 PM

#

They typically do a graceful termination similar to doing Stop / Terminate instance. But I have seen cases where everything just dies immediately

solar coyote Jul 17, 2024, 11:22 PM

#

@frank bison also is there an easy way to provision runners that use the main engine image? Would be nice to confirm the 10s timeout fix and stabilize CI prior to release

solar coyote Jul 17, 2024, 11:23 PM

#

frank bison They typically do a graceful termination similar to doing Stop / Terminate insta...

Gotcha so it depends on what the engine does during shutdown, and how merciful aws is feeling

frank bison Jul 17, 2024, 11:23 PM

#

For sure! You can target them with the main version

#

In the new generation we also have a main-wolfi variant if you need that specific base image

#

solar coyote Jul 17, 2024, 11:45 PM

#

@frank bison No luck provisioning, but not a huge deal if it's after hours for you: https://github.com/dagger/dagger/actions/runs/9982896887/job/27589476052?pr=7965
Here's my change: https://github.com/dagger/dagger/pull/7965/commits/aae70480bce9d9e0bdd319bee367966a6c37a3c8

frank bison Jul 18, 2024, 12:19 PM

#

@solar coyote sorry for the inconvenience. I thought we had merged a changed to dagger's install.sh script that we have not. This change is a requirement for main based runners to be correctly provisioned. I'm disabling them until we can merge that, should be later today. Will let you know as soon as they are ready!

frank bison Jul 18, 2024, 12:48 PM

#

It would be straigthforward to add runners that target a specific DAGGER_COMMIT from main. So if that helps you debug the hangs let me know and we can set it up

solar coyote Jul 18, 2024, 2:06 PM

#

frank bison It would be straigthforward to add runners that target a specific DAGGER_COMMIT ...

Sounds good - how about 16c1ccf0c2abf055675d616ebe7aec0588b0016e?

frank bison Jul 18, 2024, 2:18 PM

#

done

solar coyote Jul 18, 2024, 2:45 PM

#

frank bison done

Trying it out - nothing yet but maybe still provisioning: https://github.com/dagger/dagger/actions/runs/9993540740/job/27621131295?pr=7965

GitHub

DNM: PR for repro'ing v0.12 hangs · dagger/dagger@1f22588

An engine to run your pipelines in containers. Contribute to dagger/dagger development by creating an account on GitHub.

frank bison Jul 18, 2024, 2:51 PM

#

Just checked. I was missing a nodepool. Just added it, they should be starting

solar coyote Jul 18, 2024, 2:54 PM

#

lol, dang, well that worked but now we're failing on compatibility checks. @glass depot any quick way to bypass this? https://github.com/dagger/dagger/actions/runs/9993540740/job/27621131295?pr=7965

#

oh that check is just in the action .yml, I can rip it out

glass depot Jul 18, 2024, 3:06 PM

#

yeaa

solar coyote Jul 18, 2024, 3:06 PM

#

https://tenor.com/based.gif

Tenor

#

hmm now it seems to get stuck after that check: https://github.com/dagger/dagger/actions/runs/9993959678/job/27622618389?pr=7965 - digging...

#

though, it looks like something is running: https://dagger.cloud/dagger/ci?branch=repro-0.12&ref=36b194cd7cc327fd6d081be94cb73fa90fbb3013 - but only testdev, which is probably a significant difference since it uses its own dev engine

#

ah, it's stuck on echo "{version}" | dagger --silent query thinkspin

rugged pilot Jul 23, 2024, 12:04 PM

#

frank bison <@108011715077091328> sorry for the inconvenience. I thought we had merged a cha...

Finishing this is my first priority for the day, so that you can target main Engines. Will ping you both once the PR is open. I have most of the changes locally, need to wrap them up and open the PR. cc @solar coyote

rugged pilot Jul 23, 2024, 12:32 PM

#

https://github.com/dagger/dagger/pull/8012

rugged pilot Jul 23, 2024, 2:36 PM

#

👆 this is now ready to review. Would love to get it merged today so that our pipelines can start using runs-on: dagger-g2-main-16c cc @solar coyote @frank bison

solar coyote Jul 23, 2024, 2:38 PM

#

rugged pilot 👆 this is now ready to review. Would love to get it merged today so that our pi...

✅

frank bison Jul 24, 2024, 7:42 PM

#

solar coyote Sounds good - how about `16c1ccf0c2abf055675d616ebe7aec0588b0016e`?

@solar coyote Now that the fix of install.sh has been merged, are you okay if I remove the custom-commit runner and add back main and main-wolfi?

#weird new v0.12 hangs