#weird new v0.12 hangs
1 messages ยท Page 1 of 1 (latest)
Context mostly in https://github.com/dagger/dagger/issues/7954
After applying #7904, we've seen jobs hang weirdly. See an example: https://github.com/dagger/dagger/actions/runs/9970573229/job/27549768418?pr=7936#step:3:89 https://dagger.cloud/dagger/traces...
and a wall of text in #1254757484057464975 message
tl;dr - jobs seem to get into hung states, often when they're fully completed - and have either errored/succeeded
cc @timber path @solar coyote
weird new v0.12 hangs
@glass depot is there any situation where plain progress will wait client side for things to finish? Similar to draining?
if you catch it in the act, we can probably check engine logs for draining in argocd
I'll add a Cloud UI + CLI flag to disable accounting, don't like having that as a confounding variable
okay, so the sdk jobs should show up here:
which is where we see most of the failures
waiting should be a good Ctrl+F
ok, well, if I disable accounting the exec codegen failure just hops from withExec to directory, so that part is at least behaving properly - and for some reason directory just doesn't propagate that failure?
hm weird that i can't make it happen by just re-running it
maybe there's a if errors.Is(err, context.Canceled) in there somewhere...?
(also why was the context canceled?!)
๐ค who's sending the cancelled
yeah exactly
well
here's one thought
what if it's another job
mayyyybe, this would be flightcontrol
it's technically correct to edge merge here, what would be weird would be the cancellation of one job affecting this like this
i recall @timber path pointing out that DagQL had that unfortunate behavior (sharing canceled errors) with its caching implementation, but that flightcontrol was meant to handle that sort of thing
(gonna go check if dagql fixed that)
(nope, looks like it'd still happen, but since this is an exec foo span this should be in Buildkit, not DagQL)
you know how we changed the services otel thing? could we be ending up with a weird thing like a service being attached in a weird way?
other note - i wonder if we should be getting context.Cause for that error
what services otel thing?
also, the otel tcp proxy listen: might be interesting there, I wonder if this error is happening during cleanup
that was for v0.12.0
i was gonna suggest it could be a red herring, and might have just been that run
but i found it again in https://github.com/dagger/dagger/actions/runs/9975355590/job/27564888280
tbf, i don't see it in https://github.com/dagger/dagger/actions/runs/9975355581/job/27564888642
hm, what are you finding in there that points to it?
"process "codegen --output /src --module-context-path /src/bass --module-name bass-sdk --introspection-json-path /schema.json" did not complete successfully: otel tcp proxy listen: context canceled: context canceled"
line 209
but it does look like those two jobs shared the same dagger runner
maybe i'm pre-coffee but I don't see the relation to services, or do you just mean that set of changes in general
so potentially there's some multiple client concurrency happening
oh sorry, nope, i misunderstood too
i thought runners only ran one job at a time? (might be wrong)
each runner does, but each runner is on a massive k8s worker, and runners on the same worker share one big dagger instance
oic
tbf, i was just going through every commit in v0.12.0 and looking at things that had an impact on otel/etc
but also these jobs are noticable in that they all spin up an internal dagger engine to talk to and do their codegen for
there's possibly relevant changes in there for sure, just not seeing anything yet 
oh wait interesting
that otel tcp proxy listen: error is during container setup, not cleanup. it never even ran the codegen
๐ค how is it able to do anything lol
so 1) why was it canceled, 2) why did that cancellation not result in an error for the caller?
should see executor setup failed: in the logs for that too
@frank bison is there a way to search historical logs for CI runners/engines? Maybe that's in Grafana instead of ArgoCD?
whenever I try to find that I end up on https://grafana.ci.dagger.cloud/d/logs/loki-logs?orgId=1&refresh=10s which is empty
you can use the NAME filter in the left sidebar, and then change to the last view here
oh, but these logs are truncated
mmmm and no executor setup failed: ๐ข
the pr in https://github.com/dagger/dagger/pull/7960 did it!
huh
if you give me a runner name I can filter to that
dagger-v0-12-0-4c-nvme-6ztrj-runner-lfgzs
oooh it's actually currently running a job if that's interesting to you
dagger-v0-12-0-16c-nvme-v7r9z-runner-s8j88 is ongoing
hm, nope, we're already doing this
apparently context.Cancelled is the best it gets
hm, i'm also still super confused as to why we never see the end of the initialize span in the logs
potentially there's a bug somewhere, where we're erroring on the span, but not propagating the error, and instead never ending the span
which could explain some of what we're seeing i guess
but then you'd see the span unfinished in Cloud
hmmm you probably would yes
can't find any logs for lfgzs in grafana but it's possible i'm messing up the filter somehow
this grafana query builder UI is insanely janky
flickers and disregards input constantly
i just want a giga grep ๐
also timing out a lot, logs are hard tbf
the logic that considers a child span terminated if a parent one terminated... could that be relevant here (does cloud have it's own version of this)
unless you mean something else, that logic was only for draining, so we stop waiting on children if the parent completed
can't find this in argocd at all ๐ค
ah, okay, aha
it might have terminated
just adding to the crazy pile: https://github.com/dagger/dagger/actions/runs/9975355581/job/27564889749?pr=7955#step:3:128
20 : exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json
20 : [0.1s] | generating go module: docusaurus
20 : [0.1s] | runtime: failed to create new OS thread (have 10 already; errno=12)
20 : [0.1s] | fatal error: newosproc
20 : exec codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json ERROR [0.1s]
20 : ! process "codegen --output /src --module-context-path /src/docusaurus/dagger --module-name docusaurus --introspection-json-path /schema.json" did not complete successfully: exit code: 137: context canceled: context canceled

two weird errors in this run, yet it succeeded:
- https://github.com/dagger/dagger/actions/runs/9977285205/job/27581104432?pr=7958#step:3:103 => same
otel tcp proxy listen: context canceled: context canceledissue from above convo
but this one's new: https://github.com/dagger/dagger/actions/runs/9977285205/job/27581104432?pr=7958#step:3:124
21 : Container.withExec(args: ["go", "build", "-o", "/runtime", "."]): Container!
22 : exec go build -o /runtime .
22 : exec go build -o /runtime . ERROR [0.0s]
22 : ! failed to commit z08zs7919phieyyjjocun7dir to dwnewoi79q73dppxn6ynhdlec during finalize: context canceled: context canceled
I still wonder if this is spot instances related somehow... this same PR had a bunch of its jobs canceled, after I just re-ran them for being canceled before. I have no idea where else a context canceled would be coming from
I put up this goofy PR just to have a stable spot for trying to repro this: https://github.com/dagger/dagger/pull/7965 - nothing so far, only a few runs. (Notably, we're not using spot instances anymore.)
Think I caught one! Thanks a ton for the engine logs bucket @frank bison @rugged pilot - it helped immediately. ๐ The "Last Modified" times on each file in the bucket actually helped a lot, in lieu of clearer filenames.
From the engine logs for https://github.com/dagger/dagger/actions/runs/9981275585/job/27584543245?pr=7965, I can see this:
time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:03Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=2 client=8r5ggzwpletc6t94ncasj6514 ctxErr="context canceled" draining=true immediate=false spanIDs="[abc2036272d57dee e08c7085d4daf182]" spanNames="[Function.withDescription TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"
time="2024-07-17T21:00:04Z" level=debug msg="done waiting" activeLogs="map[]" activeSpans=1 client=5ynv7sx3qabs6tui47hmyqapa ctxErr="context canceled" draining=true immediate=false spanIDs="[36dd05401b560b9d]" spanNames="[TypeDef.withFunction]"
This was pretty easy to correlate by downloading the raw GHA logs and comparing timestamps. Would be nice if our logs had the trace ID so I could be 100% sure, but considering it's ctxErr="context canceled" it's at least a case where it's interrupted, and the interesting thing is which spans are left behind. I can keep digging from here.
Can confirm those spans didn't make it to Cloud, either. Now to figure out why
Those first two done waiting logs are from a different trace in the same PR: https://dagger.cloud/dagger/traces/7125da6b26fa42452efbbaa699a64d0f#abc2036272d57dee
So just to check - are the errors we saw about the context cancelled otel socket, related to otel draining and "fixed" by the timeout issue? Or is that somewhat separate, and not at all related?
If we can even answer that at this point
Those are still a mystery to me but could be related to cancellation in general, either from spot instance interruptions or hanging -> GHA timeout
Made the change to use dates as filename instead and things are not working correctly. There is a very strange panic happening the vector's codebase. I'll roll it back and look into it more tomorrow, you might have lost a few logs in the process!
One thing I'm not sure of is how aggressively spot instances are reclaimed - will we even see context canceled?
They typically do a graceful termination similar to doing Stop / Terminate instance. But I have seen cases where everything just dies immediately
@frank bison also is there an easy way to provision runners that use the main engine image? Would be nice to confirm the 10s timeout fix and stabilize CI prior to release
Gotcha so it depends on what the engine does during shutdown, and how merciful aws is feeling
For sure! You can target them with the main version
In the new generation we also have a main-wolfi variant if you need that specific base image
@frank bison No luck provisioning, but not a huge deal if it's after hours for you: https://github.com/dagger/dagger/actions/runs/9982896887/job/27589476052?pr=7965
Here's my change: https://github.com/dagger/dagger/pull/7965/commits/aae70480bce9d9e0bdd319bee367966a6c37a3c8
@solar coyote sorry for the inconvenience. I thought we had merged a changed to dagger's install.sh script that we have not. This change is a requirement for main based runners to be correctly provisioned. I'm disabling them until we can merge that, should be later today. Will let you know as soon as they are ready!
It would be straigthforward to add runners that target a specific DAGGER_COMMIT from main. So if that helps you debug the hangs let me know and we can set it up
Sounds good - how about 16c1ccf0c2abf055675d616ebe7aec0588b0016e?
done
Trying it out - nothing yet but maybe still provisioning: https://github.com/dagger/dagger/actions/runs/9993540740/job/27621131295?pr=7965
Just checked. I was missing a nodepool. Just added it, they should be starting
lol, dang, well that worked but now we're failing on compatibility checks. @glass depot any quick way to bypass this? https://github.com/dagger/dagger/actions/runs/9993540740/job/27621131295?pr=7965
oh that check is just in the action .yml, I can rip it out
yeaa
hmm now it seems to get stuck after that check: https://github.com/dagger/dagger/actions/runs/9993959678/job/27622618389?pr=7965 - digging...
though, it looks like something is running: https://dagger.cloud/dagger/ci?branch=repro-0.12&ref=36b194cd7cc327fd6d081be94cb73fa90fbb3013 - but only testdev, which is probably a significant difference since it uses its own dev engine
ah, it's stuck on echo "{version}" | dagger --silent query 
Finishing this is my first priority for the day, so that you can target main Engines. Will ping you both once the PR is open. I have most of the changes locally, need to wrap them up and open the PR. cc @solar coyote
๐ this is now ready to review. Would love to get it merged today so that our pipelines can start using runs-on: dagger-g2-main-16c cc @solar coyote @frank bison
โ
@solar coyote Now that the fix of install.sh has been merged, are you okay if I remove the custom-commit runner and add back main and main-wolfi?