#weird new v0.12 hangs

1 messages ยท Page 1 of 1 (latest)

glass depot
#

tl;dr - jobs seem to get into hung states, often when they're fully completed - and have either errored/succeeded

#

cc @timber path @solar coyote

#

weird new v0.12 hangs

solar coyote
#

@glass depot is there any situation where plain progress will wait client side for things to finish? Similar to draining?

glass depot
#

i don't think so

#

yeah, nothing in the plain shutdown implies that to me

solar coyote
#

if you catch it in the act, we can probably check engine logs for draining in argocd

glass depot
#

good call

#

lemme start up a ton of them

solar coyote
#

I'll add a Cloud UI + CLI flag to disable accounting, don't like having that as a confounding variable

glass depot
#

okay, so the sdk jobs should show up here:

#

which is where we see most of the failures

solar coyote
#

waiting should be a good Ctrl+F

solar coyote
#

ok, well, if I disable accounting the exec codegen failure just hops from withExec to directory, so that part is at least behaving properly - and for some reason directory just doesn't propagate that failure?

glass depot
#

hm weird that i can't make it happen by just re-running it

solar coyote
#

(also why was the context canceled?!)

glass depot
#

๐Ÿค” who's sending the cancelled

#

yeah exactly

#

well

#

here's one thought

#

what if it's another job

solar coyote
#

yeah lol

#

edge merging?

#

๐Ÿ˜ฑ

glass depot
#

mayyyybe, this would be flightcontrol

#

it's technically correct to edge merge here, what would be weird would be the cancellation of one job affecting this like this

solar coyote
#

i recall @timber path pointing out that DagQL had that unfortunate behavior (sharing canceled errors) with its caching implementation, but that flightcontrol was meant to handle that sort of thing

#

(gonna go check if dagql fixed that)

#

(nope, looks like it'd still happen, but since this is an exec foo span this should be in Buildkit, not DagQL)

glass depot
#

you know how we changed the services otel thing? could we be ending up with a weird thing like a service being attached in a weird way?

#

other note - i wonder if we should be getting context.Cause for that error

solar coyote
#

also, the otel tcp proxy listen: might be interesting there, I wonder if this error is happening during cleanup

glass depot
#

that was for v0.12.0

#

i was gonna suggest it could be a red herring, and might have just been that run

solar coyote
#

hm, what are you finding in there that points to it?

glass depot
#

"process "codegen --output /src --module-context-path /src/bass --module-name bass-sdk --introspection-json-path /schema.json" did not complete successfully: otel tcp proxy listen: context canceled: context canceled"

#

line 209

#

but it does look like those two jobs shared the same dagger runner

solar coyote
glass depot
#

so potentially there's some multiple client concurrency happening

glass depot
solar coyote
#

i thought runners only ran one job at a time? (might be wrong)

glass depot
#

each runner does, but each runner is on a massive k8s worker, and runners on the same worker share one big dagger instance

solar coyote
#

oic

glass depot
#

i think you can tell it's the same worker by the 6ztrj

#

in the runner name

glass depot
#

but also these jobs are noticable in that they all spin up an internal dagger engine to talk to and do their codegen for

solar coyote
#

there's possibly relevant changes in there for sure, just not seeing anything yet notsureif

#

oh wait interesting

#

that otel tcp proxy listen: error is during container setup, not cleanup. it never even ran the codegen

glass depot
#

๐Ÿค” how is it able to do anything lol

solar coyote
#

so 1) why was it canceled, 2) why did that cancellation not result in an error for the caller?

#

should see executor setup failed: in the logs for that too

#

@frank bison is there a way to search historical logs for CI runners/engines? Maybe that's in Grafana instead of ArgoCD?

glass depot
#

mmmm and no executor setup failed: ๐Ÿ˜ข

solar coyote
#

poking around in grafana with a one-off dashboard

glass depot
#

huh

solar coyote
#

if you give me a runner name I can filter to that

glass depot
#

dagger-v0-12-0-4c-nvme-6ztrj-runner-lfgzs

#

oooh it's actually currently running a job if that's interesting to you

#

dagger-v0-12-0-16c-nvme-v7r9z-runner-s8j88 is ongoing

glass depot
#

apparently context.Cancelled is the best it gets

#

hm, i'm also still super confused as to why we never see the end of the initialize span in the logs

#

potentially there's a bug somewhere, where we're erroring on the span, but not propagating the error, and instead never ending the span

#

which could explain some of what we're seeing i guess

solar coyote
#

but then you'd see the span unfinished in Cloud

glass depot
#

hmmm you probably would yes

solar coyote
#

can't find any logs for lfgzs in grafana but it's possible i'm messing up the filter somehow

#

this grafana query builder UI is insanely janky

#

flickers and disregards input constantly

#

i just want a giga grep ๐Ÿ™ƒ

#

also timing out a lot, logs are hard tbf

glass depot
solar coyote
#

unless you mean something else, that logic was only for draining, so we stop waiting on children if the parent completed

solar coyote
glass depot
#

ah, okay, aha

glass depot
solar coyote
#

just adding to the crazy pile: https://github.com/dagger/dagger/actions/runs/9975355581/job/27564889749?pr=7955#step:3:128

20  :     exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json
20  :     [0.1s] | generating go module: docusaurus
20  :     [0.1s] | runtime: failed to create new OS thread (have 10 already; errno=12)
20  :     [0.1s] | fatal error: newosproc
20  :     exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json ERROR [0.1s]
20  :     ! process "codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json" did not complete successfully: exit code: 137: context canceled: context canceled

wat

#

two weird errors in this run, yet it succeeded:

but this one's new: https://github.com/dagger/dagger/actions/runs/9977285205/job/27581104432?pr=7958#step:3:124

21  :     Container.withExec(args: ["go", "build", "-o", "/runtime", "."]): Container!
22  :     exec go build -o /runtime .
22  :     exec go build -o /runtime . ERROR [0.0s]
22  :     ! failed to commit z08zs7919phieyyjjocun7dir to dwnewoi79q73dppxn6ynhdlec during finalize: context canceled: context canceled
#

I still wonder if this is spot instances related somehow... this same PR had a bunch of its jobs canceled, after I just re-ran them for being canceled before. I have no idea where else a context canceled would be coming from

solar coyote
solar coyote
#

Think I caught one! Thanks a ton for the engine logs bucket @frank bison @rugged pilot - it helped immediately. ๐Ÿ™ The "Last Modified" times on each file in the bucket actually helped a lot, in lieu of clearer filenames.

From the engine logs for https://github.com/dagger/dagger/actions/runs/9981275585/job/27584543245?pr=7965, I can see this:

time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"

This was pretty easy to correlate by downloading the raw GHA logs and comparing timestamps. Would be nice if our logs had the trace ID so I could be 100% sure, but considering it's ctxErr="context canceled" it's at least a case where it's interrupted, and the interesting thing is which spans are left behind. I can keep digging from here.

glass depot
#

So just to check - are the errors we saw about the context cancelled otel socket, related to otel draining and "fixed" by the timeout issue? Or is that somewhat separate, and not at all related?

#

If we can even answer that at this point

solar coyote
#

Those are still a mystery to me but could be related to cancellation in general, either from spot instance interruptions or hanging -> GHA timeout

frank bison
solar coyote
#

One thing I'm not sure of is how aggressively spot instances are reclaimed - will we even see context canceled?

frank bison
#

They typically do a graceful termination similar to doing Stop / Terminate instance. But I have seen cases where everything just dies immediately

solar coyote
#

@frank bison also is there an easy way to provision runners that use the main engine image? Would be nice to confirm the 10s timeout fix and stabilize CI prior to release

solar coyote
frank bison
#

For sure! You can target them with the main version

#

In the new generation we also have a main-wolfi variant if you need that specific base image

solar coyote
frank bison
#

@solar coyote sorry for the inconvenience. I thought we had merged a changed to dagger's install.sh script that we have not. This change is a requirement for main based runners to be correctly provisioned. I'm disabling them until we can merge that, should be later today. Will let you know as soon as they are ready!

frank bison
#

It would be straigthforward to add runners that target a specific DAGGER_COMMIT from main. So if that helps you debug the hangs let me know and we can set it up

solar coyote
frank bison
solar coyote
frank bison
#

Just checked. I was missing a nodepool. Just added it, they should be starting

solar coyote
#

oh that check is just in the action .yml, I can rip it out

glass depot
#

yeaa

solar coyote
#

ah, it's stuck on echo "{version}" | dagger --silent query thinkspin

rugged pilot
rugged pilot
rugged pilot
#

๐Ÿ‘† this is now ready to review. Would love to get it merged today so that our pipelines can start using runs-on: dagger-g2-main-16c cc @solar coyote @frank bison

frank bison