#Dagger CI Logs Confusing

1 messages · Page 1 of 1 (latest)

opaque terrace
#

I rolled out Dagger (with Depot) to speed up our CI checks, but the main comment I'm getting from developers is that the actual errors are hard to track down.

For some context on how we have things set up:

  • we have a single github action that calls dagger call ci
  • that ci function in dagger does an await Promise.all with ~17 other dagger functions to perform the various checks, using .exitCode()
  • we take heavy advantage of the cache, to avoid re-running checks if they previously passed with the same input files

When something fails, the last few lines of output are something like:

Full trace at https://dagger.cloud/redacted


Error: input: foobar.ci process "tsx --no-deprecation --tsconfig /src/dagger/tsconfig.json /src/dagger/src/__dagger.entrypoint.ts" did not complete successfully: exit code: 1

Stderr:
Error: process "mix ash.codegen --check" did not complete successfully: exit code: 1

The Error: input: bit has been confusing devs, but the last line usually tells you the actual command that failed. Finding why it failed is a bit trickier, though. You need to scroll up, sometimes quite a bit, and it's generally interleaved with other logs, so it's hard to tell if it's right or just output from the other jobs cancelling. Dagger Cloud makes it somewhat easier, but it's still a bit tricky since there's a lot going on.

So I guess my questions are:

  • Is there a better way to have things set up? We're using a single github action job mostly to make sure we're taking full advantage of the cache + doing as much as we can in parallel. Makes for less boilerplate on the github actions side of things too.
  • Are there any configs or flags I'm missing that could get that failed check's output printed again at the end of the run?
  • Or any workarounds in the dagger code that could do similar?
clever marsh
#

@steel thistle @quartz jetty @fathom matrix FYI

fathom matrix
#

If you have any example where the Cloud viz is not helping feel free to DM me the trace ID so we can take a look please

opaque terrace
#

It helps a bit, but there's still a few identical looking blocks of logs, with the one they'd care about being in the middle.
Some thoughts:

  • is this worse because I'm using .exitCode() instead of .stdout() maybe?
  • is there any way to parse dagger output programmatically? then maybe I could grab the initial error and comment it directly on the PR, for example
#

Looks like .stdout() doesn't really change things

fathom matrix
# opaque terrace Looks like `.stdout()` doesn't really change things

is this worse because I'm using .exitCode() instead of .stdout() maybe?

Sure, you can probably structure the output of your functions to add more context there like stdout, stderr, exitCode. Having said that, what we do ourselves is mostly using Cloud given that somtimes, it's not straightforward to understand where the error is coming from with a single output

#

i.e: if your pipeline uses services and for some reason a service failed to start, that's generally difficult to see from the exec output

#

is there any way to parse dagger output programmatically? then maybe I could grab the initial error and comment it directly on the PR, for example

Not currently, there's no core API for this since all the pipeline telemtry is produced in OTEL format.

opaque terrace
#

Is there an example of how I could add more context?
For example, with these:

  @func()
  async fooGraphqlSchema(): Promise<number> {
    return this.fooBuildEnv()
      .withExec(["mix", "ash.codegen", "--check"])
      .exitCode();
  }
  @func()
  async ci(): Promise<void> {
    await Promise.all([
      this.fooGraphqlSchema(),
    ]);
  }
opaque terrace
steel thistle
#

I just ran into this myself, and got a similar error via TS SDK. However, in cloud when I click the dropdown cell where the error occured I see the actual logs

Are you seeing something else (or nothing) in the log section?

#

Oh hm, actually I have an even bigger issue. My exit 1 is being swallowed up. It does not show up as an error in the main trace page. (I think this is what you are saying as well)

Would be great to dig in here @quartz jetty or @fathom matrix because I have a fresh example that we can easily dissect

opaque terrace
#

My issue is less that an error isn't there, but more that there's so many unrelated errors/logs that it's really hard to find what actually caused the whole thing to die.
(By 'what actually', I mean more the relevant logs, like specific test/lint/whatever issues, rather than just the process call)

steel thistle