#exit 4294967295
1 messages · Page 1 of 1 (latest)
https://github.com/sipsma/buildkit/blob/3187d2d056de7e3f976ef62cd548499dc3472a7e/executor/runcexecutor/executor.go#L314-L314 That is one suspect place where a uint32 conversion is being made @still patio
how reliably does this happen for you? does it happen locally?
Yeah it happens every time I try locally
well at least we have that!
Haha yeah with enough printlns the truth will be revealed
@tall drum it seems like this can only come from here, right? https://github.com/sipsma/buildkit/blob/3187d2d056de7e3f976ef62cd548499dc3472a7e/vendor/github.com/containerd/go-runc/monitor.go#L66
seems to be the only place a *runc.ExitError is ever constructed.
but everywhere seems to set 255 as a default, so where are we even getting -1 or 4294967295? 
correction, the *runc.ExitError is constructed by the caller from the value returned by Wait, but I think it's still just this one path...
I'm so mad right now... I added a println above the uint32 conversion and now I can't get it to repro anymore........
But yeah I was looking at those lines, the only possibility is that ws.ExitStatus() returns -1 if it is not actually exited, but that's not possible (right?)
there could of course be other uint32 conversions somewhere
I hope it's not something stupid involving command context with the context getting canceled and causing the process wait to return an error when the process has not exited yet
hmmm.....
yeah this seems like a possibility too. I wonder what happens if an overridden Cancel returns an error and you call Wait()
// (If the command exits with a non-success status, or Cancel
// returns an error that wraps os.ErrProcessDone, Wait and similar methods
// continue to return the command's usual exit status.)
oh, they don't set Cancel anyway
omg I finally repro'd it again, yeah runcExitError.Status is -1 on this line when it happens: https://github.com/sipsma/buildkit/blob/3187d2d056de7e3f976ef62cd548499dc3472a7e/executor/runcexecutor/executor.go#L314-L314
nice!
is this a SIGCHLD bubbling up in a weird way or something? is does it possible for Wait() to be interrupted without the process actually exiting? sounds strange, but we're at the layer where we do strange things
time for chatgpt
Well it looks like buildkit's dep on go-runc is v1.0.0 which is extremely old: https://github.com/containerd/go-runc/tree/v1.0.0
So I was looking through commits made since then, found at least one interesting one: https://github.com/containerd/go-runc/commit/1caf1a73de233bdfe8bbdd860c42b0d263fabe12
In Linux, the wait() system call is used to make a parent process wait for its child processes to complete. Under normal circumstances, wait() will only return when a child process exits, either by terminating normally or by being killed by a signal.
However, there are certain situations in which wait() may return without the child process actually exiting:
- Signal interruption: If a signal is delivered to the parent process while it is waiting, and the parent process has a signal handler installed for that signal, the wait() system call may be interrupted and return prematurely. In this case, the return value would be -1, and errno would be set to EINTR. You can handle this case by checking for EINTR and restarting the wait() call if necessary.
- Using waitpid(): If you are using the waitpid() function with the WNOHANG option, it will return immediately if no child process has exited. In this case, the return value will be 0, indicating that no child process has changed state. To properly handle this situation, you can use a loop to continuously call waitpid() until a valid return value is obtained.
In summary, while it is possible for wait() to return without a child process actually exiting, this typically occurs due to signal interruptions or specific options used with waitpid(). By checking the return value and errno, you can handle these cases and ensure your parent process waits for the child process to exit as intended.
seems like those situations would be handled for us 🤔
oh interesting. need to keep that in mind whenever I use Pdeathsig now huh
so that fixes a case where you exec() with pdeathsig in thread A, something else runs in that thread, and the thread exits while the rest of the program runs?
Yeah I had no idea either
i don't remember the rules for when threads spawn/exit in Go, I want to say it could fluctuate with I/O though?
to keep it nonblocking
Yeah the runtime will make new threads and throw old ones away as it needs, for i/o and some internals like gc (I think)
Also, we picked this commit up with the upgrade too 🙂 https://github.com/moby/buildkit/pull/3738#discussion_r1147894969
Which basically means that the number of threads that the runtime has to throw away is probably much higher when CNI is in use....
oo good find
Well I pushed a commit to upgrade go-runc to the latest version in our go.mod and CI just passed for the first time... https://github.com/dagger/dagger/actions/runs/4580696812/jobs/8089648119?pr=4879
I'll give it a few more go's since it seemed to be somewhat random locally
But promising...
nice. brb grepping for Pdeathsig in all of my code
Apparently the go runtime will actually only kill threads in the specific case where you leave the thread locked: https://github.com/golang/go/issues/27505
Okay after re-running both engine test and test-race 3 times each, I think the problem is probably gone 🤞, should be good to merge on Monday: https://github.com/dagger/dagger/pull/4879
Also sent an upstream PR for fixes, but we don't need to wait on that to merge ours: https://github.com/moby/buildkit/pull/3765
awesome!
oh I see the connection now - the CNI change in buildkit never unlocks the OS thread. (the go-runc change does.)
and this explains why people could see it randomly before - it always could have happened, because our go-runc dependency didn't lock the OS thread. and then the bump just made it way more likely to happen because way more threads were exiting.
This was great to read! Will add my comment in the PR - Discord is ephemeral.
great job on this fix, I validated it with Bass over the weekend. 👍
(after going on a huge adventure to support pointing Bass to Buildkit forks, which involved adding support for Dockerfiles and a bunch of other stuff. I'm now in an extremely deep rabbit hole, but a lot of features are coming out, and the amount of Bash in my repo is decreasing, so whee)
Thanks, good to hear it's fixed there too! Surprised this has been lurking for so long barely noticed
The PR needs an approval if it looks good https://github.com/dagger/dagger/pull/4879