#Container Publish idle at the final step

1 messages Β· Page 1 of 1 (latest)

heavy slate
#

Hi,

I am currently building a dagger pipeline to dockerize Nodejs app and on one of the app the last step which is a simple npm install is in a idle status.

I can run the pipeline on my laptop and no issues at all but on our github runner it just pause, without error or messages.

I am new to dagger and I feel like their is a simple issues that I don't see.

func (i *Image) Build(ctx context.Context, addresses []string) error {
    for _, addr := range addresses {
        _, err := i.Ctr.Publish(ctx, addr)
        if err != nil {
            return err
        }
    }

    return nil
}

Thanks in advance πŸ™‚

hazy prairie
#

hm, how many addresses do you have?

heavy slate
#

3

#

The weird part is that I don't even have the logs of the npm install

hazy prairie
#

hm, do you have any logs from your gitlab pipeline at all? what step does it hang at?

heavy slate
#

It is on self hosted actions runners, no logs or errors. Btw the CPU usage goes to 0 at the moment the pipeline pause

hazy prairie
#

hmmm that does look strange - how are your dagger runners setup?

#

or is just using the dagger-for-github action?

heavy slate
#

They run in kubernetes with a dind as a sidecar and the action runner in another container with the var DOCKER_HOST setup

gusty cosmos
#

Jed, this seems psuedo related to what we are seeing over here at Motion on our Gitlab runners in K8s. Every so often, we just have the dagger engine lock up and CPU goes to 0.

hazy prairie
heavy slate
#

Sadly not...
I launch dagger with this command: dagger --progress plain run go run main.go

hazy prairie
#

ugh

#

ok, i'm not sure then πŸ€” if there's any way you can kill the dagger engine with SIGQUIT, and grab the logs, that'll give us a stack trace - then we can see what the engine is actually doing

#

which is a pain, but also, that definitely help track down what's actually stalling

heavy slate
#

Okay amazing it works haha, here is the panic message:
panic: Post "http://127.0.0.1:38831/query": EOF

hazy prairie
#

woah

#

well that's... not supremely useful honestly πŸ˜„

gusty cosmos
#

We've also seen 502s when hitting the dagger query endpoint

heavy slate
#

I guess it just a random call to the graphql api ?

hazy prairie
#

@edgy vector @pine widget any idea what these are from? i do see them occassionally in our own ci, not sure if you have any more insight than i do

pine widget
#

sounds like we'd still need the goroutine dump from the engine on SIGQUIT right? that EOF error looks like it's just hitting the session's forwarded endpoint from dagger run (guessing by the random port)

heavy slate
hazy prairie
#

ah, i think that's the stack trace for the cli

#

i think we'd need the engine, which would be running in a docker container

edgy vector
#

docker kill -s SIGQUIT "$(docker ps -a -q -f 'name=dagger-engine-')" will get it for the engine in the container

hazy prairie
#

aha thanks ❀️

heavy slate
#

okay soooo this is the end of the stack trace

goroutine 7822 [select]:
runtime.gopark(0xc002ea0fa0?, 0x3?, 0x30?, 0xec?, 0xc002ea0f6a?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc002ea0e08 sp=0xc002ea0de8 pc=0x43ebce
runtime.selectgo(0xc002ea0fa0, 0xc002ea0f64, 0xc002ea0e90?, 0x0, 0xaea8d2?, 0x1)
    /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc002ea0f28 sp=0xc002ea0e08 pc=0x44f085
github.com/moby/buildkit/util/progress.(*progressReader).Read.func1()
    /go/pkg/mod/github.com/moby/buildkit@v0.13.0-beta3/util/progress/progress.go:123 +0xb9 fp=0xc002ea0fe0 sp=0xc002ea0f28 pc=0xae96f9
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc002ea0fe8 sp=0xc002ea0fe0 pc=0x471e21
created by github.com/moby/buildkit/util/progress.(*progressReader).Read in goroutine 7768
    /go/pkg/mod/github.com/moby/buildkit@v0.13.0-beta3/util/progress/progress.go:120 +0x13b

hazy prairie
#

do you have the whole thing?

heavy slate
#

Yes

edgy vector
#

(Just leaving stream of consciousness notes looking through the stack trace for posterity)

One thing that's extremely odd looking atm is goroutines 339 and 340 are both created by buildkit's newScheduler but there should only be 1 scheduler in the process thinkspin ... Maybe I'm misreading or there's an obscure possibility the stack trace is somehow inaccurate (highly unlikely, but who knows) but I currently have no explanation for how that could happen. If that is indeed somehow happening, there's a possibility it could lead to inconsistent graph state errors. Not sure if the deadlock is related to that or not.

edgy vector
#

@hazy prairie if online and available, does the above make any sense to you? I can't see how we'd end up creating two schedulers. My instinct is I'm somehow misreading what the stack trace is saying, so would appreciate a second pair of eyes if you're around

#

I can actually repro it on a dev engine (just ./hack/dev and SIGQUIT) so will do some debugging on that

hazy prairie
#

uh yeah, that's bizarre

#

how can newScheduler be called twice?

edgy vector
hazy prairie
#

πŸ€” why do we have two solvers in the first place?

edgy vector
hazy prairie
#

i can't imagine this is causing the index at state X issue (since that's also seen in upstream buildkit)
but potentially could cause inconsistent graph state

edgy vector
#

Yeah agreed

hazy prairie
#

hm or maybe not... the graph states are internal to the solver

#

but maybe somewhere we're mixing usage of things between the solvers expecting that this would work

edgy vector
#

Okay, it looks like the only places that llbSolver call it's internal generic solver are methods that we never use. So I'm going to rule this as "odd and worth fixing out of caution but probably not the source of our woes"

hazy prairie
#

is there a reason we need the job? can we not just call Solve/Status on the llbsolver?

#

ohhh i see we need the bridge for the job πŸ˜› to even call solve

edgy vector
#

Yeah, also llbSolver.Solve doesn't return the cache ref, which we want

hazy prairie
#

argh yeah - maybe we just need to upstream a hack so we can pass a custom Solver into the LLBSolver opts

edgy vector
hazy prairie
#

splitting hairs, but feels like a getter func is probably the route of least friction, since there's no public fields on that struct atm, but bleh

#

anyways, back to bank holiday for me salute

#

i'll make a note to find time to do an upstream pr once i'm back next week πŸŽ‰

#

(if you don't beat me to it)

edgy vector
hazy prairie
#

love a funky stack trace πŸ˜„

edgy vector
hazy prairie
#

it feels like because of nesting this option is just entirely broken 😒

edgy vector
hazy prairie
#

agh, yeah 😦 i guess that's the best (only) option, but it also confusingly means that max-parallelism isn't really "max" anymore

edgy vector
hazy prairie
#

(unless you're doing an errgroup or something)

#

potentially we could "transfer" the resource - so the first concurrent nested call is treated as not a +1, but other ones are

edgy vector
#

Yeah was just going to say that, we may just need to accept that. If all the goroutines are making calls it still works, but if one is making a call and another is doing busy work then it's not super accurate anymore

hazy prairie
#

some insane internal logic required πŸ˜›

heavy slate
#

Oh wow that's some serious investment! Thanks alot to both of you 🫢

heavy slate
#

Hi Guys, I come back to you to share with you a good news. We found out what was that issue. It is as simple as: the yarn install process in buildkit was in an idle state (because of a firewall blacklist and the process silently fail). So yep... The issue is not in dagger πŸ™‚

unique locust