#Engine not flushing all events on sigint

1 messages ยท Page 1 of 1 (latest)

dusk terrace
#

Hello core engineers ๐Ÿ˜‡

I am currently exploring this bug: https://linear.app/dagger/issue/DEV-3286/services-timeout-or-run-forever-in-dagger-cloud

Basically, whenever we run a pipeline: dagger run go main.go, with such pipeline:

package main

import (
    "context"
    "os"

    "dagger.io/dagger"
)

func main() {
    ctx := context.Background()

    // create Dagger client
    client, err := dagger.Connect(ctx, dagger.WithLogOutput(os.Stderr))

    if err != nil {
        panic(err)
    }
    defer client.Close()

    _, err = client.Container().From("alpine").WithExec([]string{"sleep", "123"}).Sync(ctx)
    if err != nil {
        panic(err)
    }

}

When we sigint before the sleep ends, we seem to exit prior flushing all events, leading to this output:

dagger run go run .                  
โ–ˆ [0.48s] connect
โ”ฃ [0.20s] starting engine
...
โ”‚ โ”ป โ”‚ 
โ–ˆ   โ”‚ [14.26s] ERROR go run .
โ–“โ—€โ”€โ”€โ•ฏ [11.82s] exec sleep 123
โ”ป     
โ€ข Cloud URL: https://dagger.cloud/runs/a04410bb-c785-49e0-b09f-83f0e4c5db78
โ€ข Engine: 622e40844c39 (version devel ())
โง— 14.78s โœ” 10 โœ˜ 1
signal: terminated

and to an incorrect state on dagger.cloud. The journal looks like this, with events that were properly completed

{"vertexes":[{"id":"[root-vtx]","name":"go run .","started":{"seconds":1707353206,"nanos":615720000},"error":"signal: terminated"}],"sent":{"seconds":1707353220,"nanos":737305000}}
{"vertexes":[{"id":"[root-vtx]","name":"go run .","started":{"seconds":1707353206,"nanos":615720000},"completed":{"seconds":1707353220,"nanos":737359000},"error":"signal: terminated"}],"sent":{"seconds":1707353220,"nanos":737360000}}
{"groups":[{"id":"_root","started":{"seconds":1707353206,"nanos":74872000},"completed":{"seconds":1707353220,"nanos":763963000}}],"sent":{"seconds":1707353220,"nanos":763970000}}
#

However, one doesn't receive the completed status:

{"vertexes":[{"id":"sha256:ed57004cab309ee67073955605d5d614fe471741f30ebea3333d637127ee6e19","name":"exec sleep 137","inputs":["sha256:5b26c7493396a25190a99c8178e9d514967548bfb9aea000dd00eb0919944ea3","sha256:d992cee8196052d190c4e3ab98c46199d4fbf1f5cc301c778732bde899deddfd"],"started":{"seconds":1707353208,"nanos":809862793}}],"sent":{"seconds":1707353208,"nanos":810253501}}

However, when I go inside the docker mac vm and directly send the sigint to the sleep process (in this case), the completed vertex is correctly populated:

{"vertexes":[{"id":"sha256:c302c130f0f8c377ae559b6497492d8310672567a96e70138d904e4faca7af06","name":"exec sleep 137","inputs":["sha256:5b26c7493396a25190a99c8178e9d514967548bfb9aea000dd00eb0919944ea3","sha256:d992cee8196052d190c4e3ab98c46199d4fbf1f5cc301c778732bde899deddfd"],"started":{"seconds":1707353466,"nanos":465541509}}],"sent":{"seconds":1707353466,"nanos":466022551}}
{"vertexes":[{"id":"sha256:c302c130f0f8c377ae559b6497492d8310672567a96e70138d904e4faca7af06","name":"exec sleep 137","inputs":["sha256:5b26c7493396a25190a99c8178e9d514967548bfb9aea000dd00eb0919944ea3","sha256:d992cee8196052d190c4e3ab98c46199d4fbf1f5cc301c778732bde899deddfd"],"started":{"seconds":1707353466,"nanos":465541509},"completed":{"seconds":1707353478,"nanos":115074251},"error":"process \"sleep 137\" did not complete successfully: exit code: 255"}],"sent":{"seconds":1707353478,"nanos":115178417}}

There seems to be a mismatch in behavior when we sigint from the terminal.

Now, as this part of the code is a bit blurry to me, how would you print / debug step by step ? Shall I check the session first ?

#

The aim would be able to learn how to track such bugs, especially because we cannot put prints everywhere to track the behavior (can we ?) ๐Ÿ‘ผ ๐Ÿ™

hasty sinew
#

potentially related

hasty sinew
# dusk terrace However, one doesn't receive the completed status: ``` {"vertexes":[{"id":"sha2...

Ok, so a few things happening here (doing a knowledge core dump):

  • CTRL+C at the terminal is going to do a few things: dagger run should receive SIGINT - however, SIGINT is also sent directly to the session I think (since it's in the same progress group). I raised something similar to this recently in #1201911337744339046 message. This then should trigger the session to hang-up: https://github.com/dagger/dagger/blob/2700c93ebbc9978dd5e1c7f1c48e63ce5c647981/cmd/dagger/session.go#L60-L64 (and then the defer triggers, and we stop our session call to dagger). This then triggers the dagger server shutdown.
  • Sending SIGINT directly to the sleep process is completely different - this isn't going through dagger run/etc, we're just killing the spawned process, so the progress is expected to work perfectly fine in this case.

A couple things that could be going wrong:

  • For some reason, we might never get the completed progress event in the dagger server at any point - you can confirm this by adding some logging messages in the server, and watching to see if it never appears in the server logs, just like it doesn't make it to the progrock journal (which is written by the client). If this is happening, then we're probably not correctly getting all the events from the buildkit solver (maybe closing our channel too early) - some logic in the dagger server/controller is wrong, and we need to work out why that event never gets found.
  • For some reason, we might get the event, but it never reaches the client (this actually feels kinda likely to me). The client has probably already hung up from the session, which means it never gets those final progress events. If this is happening, then the fix would be to re-order the code in EngineSession so that we close the session before disconnecting the cli client. (but also, a couple things: we need to make sure that we wouldn't run cache exporters here, and i'm not sure this code actually blocks and waits for the progress stream to flush, we would need to do that - but you could temporarily hack it and test by adding some time.Sleeps in to see if that fix would be effective)
dusk terrace