#hmmm anyone else seeing this services
1 messages ยท Page 1 of 1 (latest)
I did git checkout v0.4.1, ./hack/dev bash and then go test ./core/integration/ -run ContainerExecServices -count=1 -v to see if it was there too.
Interestingly, I got stuck on a health check for 5 minutes, but then it finally passed? Here's what ps aux shows when run inside the dagger-engine.dev container:
/ # ps aux
PID USER TIME COMMAND
1 root 0:41 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml --debug
171 root 0:00 [dnsmasq]
172 root 0:00 /usr/sbin/dnsmasq -u root --conf-file=/run/containers/cni/dnsname/dagger/dnsmasq.conf
1232 root 0:00 [runc]
1246 root 0:00 [_shim]
3982 root 0:00 sh
5535 root 0:00 ps aux
So yeah I get zombies for the runc+_shim procs, but nothing still running leaked in that case
But the 5 minutes for the health check to pass is weird
hm yeah strange. worth noting -run ContainerExecServices actually runs a few tests, not just one, in case that's related. but I haven't seen it take that long
It looks like the 5 minutes was due to just this test: PASS: TestContainerExecServicesChained (349.42s)
All the other tests passed in a reasonable amount of time
gotcha, that makes more sense; a 5 minute runtime is still pretty slow, but that's definitely one of the heavier tests, and I wonder if it's affected by that accidentally-quadratic slowness you found. I'll keep looking into this and see if it regressed or something ๐
super odd. added a time.Sleep(15 * time.Second) to the end of the simplest services test, and I can see the 10 second grace period go by and cancel the vertex. the logs show 'CANCELED', but the process tree goes from screenshot A to B - the outer dagger-shim process dies, but the inner runc process lives on. despite us setting Pdeathsig: SIGKILL and forwarding all signals. 
What is the issue? From @RonanQuigley Dagger does not respect SIGINT (ctrl + c). Able to replicate on Mac and Linux. Log output n/a Steps to reproduce run a pipeline and hit ctrl-c. It seems like S...
can't reproduce on v0.4.1, trying v0.4.2 now
doubtful, not sure what's even being Ctrl+C'd there actually, it's up to the program using Dagger to trap/handle signals appropriately
@limpid fiber this is looking like a v0.4.2 regression. super easy to reproduce, at least. wonder if it's related to the Buildkit bump? maybe something changed with how they handle interrupts? (will keep looking, just an async heads-up)
going to try reverting the bump (locally)
Yeah I am pretty sure (maybe double check though) that these commits were newly introduced when we did the most recent upgrade:
https://github.com/moby/buildkit/pull/3722
https://github.com/moby/buildkit/pull/3658
And they are definitely highly interrelated with the issue at hand
yeah, reverting the bump fixes it
kicking myself for not adding a test that services actually stop. I didn't want to have to deal with the 10 second grace period ๐
Might be worth trying against a buildkit where you've reverted just those two commits
Thankfully that's just a matter of a go.mod override now
reverting just https://github.com/moby/buildkit/pull/3722/commits/b76f8c02482e11b3b480e0c6ddf54cc91a667730 fixes it
Awesome, that makes it much simpler to debug then!
I usually just go right to trying to repro bugs like this in a buildkit integ test. If you just report a problem upstream the response is pretty much always just "needs a repro"
makes sense. i'll whittle it down to a minimal repro and PR a failing test first. did something similar for the scheduler issue, though this one's hopefully much simpler
i wonder if this is something that only affects us, since we wrap runc?
Also Cory (the author of that PR) is responsive on the buildkit slack if asking him anything would be helpful
Yeah that's absolutely possible
We end up more or less just calling runc the same way buildkit does, so I can't think of anything extremely obvious that would go wrong due to our shim, but it's definitely in the cards
whoa, weird, they changed it from calling runc kill <id> to sending SIGKILL to the runc process, which in our case is our shim wrapping runc. those seem like very different things...? (even without our wrapper)
(hopping into buildkit slack)
The issue I linked might actually just be a bug with our nodejs sdk in the connect function. Definitely unrelated
just confirmed this issue happens with Bass too, and I'm not wrapping runc there so yeah, the scope of this breakage seems beyond Dagger
I guess this also probably affects the playground since spawned containers are not likely to be killed if we close the user session due to timeouts ๐