#hmmm anyone else seeing this services

1 messages ยท Page 1 of 1 (latest)

limpid fiber
#

I did git checkout v0.4.1, ./hack/dev bash and then go test ./core/integration/ -run ContainerExecServices -count=1 -v to see if it was there too.

Interestingly, I got stuck on a health check for 5 minutes, but then it finally passed? Here's what ps aux shows when run inside the dagger-engine.dev container:

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:41 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
  171 root      0:00 [dnsmasq]
  172 root      0:00 /usr/sbin/dnsmasq -u root --conf-file=/run/containers/cni/dnsname/dagger/dnsmasq.conf
 1232 root      0:00 [runc]
 1246 root      0:00 [_shim]
 3982 root      0:00 sh
 5535 root      0:00 ps aux
#

So yeah I get zombies for the runc+_shim procs, but nothing still running leaked in that case

#

But the 5 minutes for the health check to pass is weird

gilded juniper
#

hm yeah strange. worth noting -run ContainerExecServices actually runs a few tests, not just one, in case that's related. but I haven't seen it take that long

limpid fiber
#

It looks like the 5 minutes was due to just this test: PASS: TestContainerExecServicesChained (349.42s)

#

All the other tests passed in a reasonable amount of time

gilded juniper
#

gotcha, that makes more sense; a 5 minute runtime is still pretty slow, but that's definitely one of the heavier tests, and I wonder if it's affected by that accidentally-quadratic slowness you found. I'll keep looking into this and see if it regressed or something ๐Ÿ‘

gilded juniper
#

super odd. added a time.Sleep(15 * time.Second) to the end of the simplest services test, and I can see the 10 second grace period go by and cancel the vertex. the logs show 'CANCELED', but the process tree goes from screenshot A to B - the outer dagger-shim process dies, but the inner runc process lives on. despite us setting Pdeathsig: SIGKILL and forwarding all signals. wat

viscid spear
gilded juniper
#

can't reproduce on v0.4.1, trying v0.4.2 now

gilded juniper
#

@limpid fiber this is looking like a v0.4.2 regression. super easy to reproduce, at least. wonder if it's related to the Buildkit bump? maybe something changed with how they handle interrupts? (will keep looking, just an async heads-up)

#

going to try reverting the bump (locally)

limpid fiber
#

And they are definitely highly interrelated with the issue at hand

gilded juniper
#

yeah, reverting the bump fixes it

#

kicking myself for not adding a test that services actually stop. I didn't want to have to deal with the 10 second grace period ๐Ÿ˜…

limpid fiber
#

Thankfully that's just a matter of a go.mod override now

gilded juniper
limpid fiber
#

Awesome, that makes it much simpler to debug then!

#

I usually just go right to trying to repro bugs like this in a buildkit integ test. If you just report a problem upstream the response is pretty much always just "needs a repro"

gilded juniper
#

makes sense. i'll whittle it down to a minimal repro and PR a failing test first. did something similar for the scheduler issue, though this one's hopefully much simpler

#

i wonder if this is something that only affects us, since we wrap runc?

limpid fiber
#

Also Cory (the author of that PR) is responsive on the buildkit slack if asking him anything would be helpful

limpid fiber
#

We end up more or less just calling runc the same way buildkit does, so I can't think of anything extremely obvious that would go wrong due to our shim, but it's definitely in the cards

gilded juniper
#

whoa, weird, they changed it from calling runc kill <id> to sending SIGKILL to the runc process, which in our case is our shim wrapping runc. those seem like very different things...? (even without our wrapper)

#

(hopping into buildkit slack)

viscid spear
gilded juniper
#

just confirmed this issue happens with Bass too, and I'm not wrapping runc there so yeah, the scope of this breakage seems beyond Dagger

gilded juniper
floral mist