#Intermittent failures when loading dependencies?

1 messages ยท Page 1 of 1 (latest)

marsh dirge
#

We've been seeing intermittent failures in different pipelines which we find difficult to reproduce and I'm hitting my head against a wall trying to find out what's going on.. Any pointers on how to debug would be very welcomed ๐Ÿ™‚

The code that seems to trigger the issue, is when a dependent module is trying to get loaded (a dependent private module referenced by a git url).
I'm attaching what seems to be the engine logs of the session having the issue:.

What the CLI reports is always something similar to the second attachment

#

I've seen this failure across many different pipelines in random codepaths, same error though. Just now I've managed to dig into the engine logs to capture the debug/error logs from the session.

Is there anything I could be looking deeper to understand what's erroring here?
Seems like it's git returning an error when trying to access the remote dep?
git error: exit status 128

past coyote
marsh dirge
#

Oh! That's great. ty ty ty

#

I'll upgrade our engines as soon as it's out ๐Ÿ™‚
I'm sure it's something on our end somehow... but quite hard to get anything out of the logs ๐Ÿ˜…

past coyote
marsh dirge
#

Some more data as I was finalizing some dashboards to help me pinpoint the issue, I just found out it started to happen way more often since we upgraded to 0.19.4 - all those spikes started after we upgraded, 03/11 ๐Ÿค” (dashboard is showing how many times git error: exit status 128 appears in the engine logs)

past coyote
past coyote
#

here's an example I'm currently getting in my console output:

9   : git(url: "github.com/pepe/pepa"): GitRepository!
10  : โ”† git(url: "https://github.com/pepe/pepa"): GitRepository!
12  : โ”† โ”† git(
12  : โ”† โ”† โ”† url: "https://github.com/pepe/pepa"
12  : โ”† โ”† โ”† keepGitDir: true
12  : โ”† โ”† โ”† httpAuthUsername: "marcosnils"
12  : โ”† โ”† โ”† httpAuthToken: setSecret(name: "d00eb6557060f6f3a4468408177f2aa0098e507e194610b79e30cdaa5f3df128"): Secret!
12  : โ”† โ”† ): GitRepository!
13  : โ”† โ”† git ls-remote --symref https://github.com/pepe/pepa
13  : โ”† โ”† git ls-remote --symref https://github.com/pepe/pepa ERROR [0.3s]
13  : โ”† โ”† ! git error: exit status 128
12  : โ”† โ”† git ERROR [0.3s]
12  : โ”† โ”† ! git error: exit status 128
10  : โ”† git ERROR [0.6s]
10  : โ”† ! select: git error: exit status 128
9   : git ERROR [0.6s]
9   : ! load: select: git error: exit status 128
12  : โ”† git(
12  : โ”† โ”† url: "https://github.com/pepe/pepa"
12  : โ”† โ”† keepGitDir: true
12  : โ”† โ”† httpAuthUsername: "marcosnils"
12  : โ”† โ”† httpAuthToken: setSecret(name: "d00eb6557060f6f3a4468408177f2aa0098e507e194610b79e30cdaa5f3df128"): Secret!
12  : โ”† ): GitRepository!
13  : โ”† โ”† git ls-remote --symref https://github.com/pepe/pepa ERROR [0.3s]
13  : โ”† โ”† ! git error: exit status 128
13  : โ”† โ”† [0.3s] | remote: Repository not found.
13  : โ”† โ”† [0.3s] | fatal: repository 'https://github.com/pepe/pepa/' not found
marsh dirge
#

Mmm, I can't see anything else.
The output I pasted above is the workflow logs. These workflows run with the --progress dots when running on CI (Github Actions), would that be hiding some of these spans?

#

I'll enable the --progress plain for a workflow that seems to recurringly have the issue and double check the outputs again ๐Ÿ™

past coyote
# marsh dirge I'll enable the `--progress plain` for a workflow that seems to recurringly have...

@marsh dirge in the --progress dots output seems like the logs get printed at the very top instead of the end.

i.e:

dagger --progress dots -M -c 'git github.com/pepe/pepa'
โ–ถ ...........

โ–ผ exec docker ps -a --format {{.Names}}
dagger-engine.dev
dagger-engine-v0.19.4
api-inngest-1
api-db-1
api-clickhouse-1

โ–ผ exec docker start dagger-engine-v0.19.4
dagger-engine-v0.19.4

โ–ถ ..

โ–ผ connecting to engine
01:00:45 INF connected name=f8f0e8482a5d client-version=v0.19.4 server-version=v0.19.4

โ–ถ ...........

XXXXX.X

โ–ผ git ls-remote --symref https://github.com/pepe/pepa
remote: Repository not found.
fatal: repository 'https://github.com/pepe/pepa/' not found
โ–ถ connect 0.4s
โœ” loading type definitions 0.4s
โœ˜ git 0.6s ERROR
! load: select: git error: exit status 128

โ–ผ git(url: "github.com/pepe/pepa"): GitRepository! 0.6s ERROR
! load: select: git error: exit status 128
โ•ฐโ•ดโ–ผ git(url: "https://github.com/pepe/pepa"): GitRepository! 0.6s ERROR
  ! select: git error: exit status 128
  โ•ฐโ•ดโ–ผ git(
    โ”‚ โ”† url: "https://github.com/pepe/pepa"
    โ”‚ โ”† keepGitDir: true
    โ”‚ โ”† httpAuthUsername: "marcosnils"
    โ”‚ โ”† httpAuthToken: setSecret(name: "d00eb6557060f6f3a4468408177f2aa0098e507e194610b79e30cdaa5f3df128"): Secret!
    โ”‚ ): GitRepository! 0.3s ERROR
    ! git error: exit status 128
    โ•ฐโ•ดโ–ผ git ls-remote --symref https://github.com/pepe/pepa 0.3s ERROR
      ! git error: exit status 128

Full trace at https://dagger.cloud/marcos-test/traces/8aff83c119c69f1e856d6f2652c407b0
Error: load: select: git error: exit status 128

^ as you can see form that output, the repository not found error is at the very top instead of being at the end.

marsh dirge
#

Mmmm. There's no trace of the actual error... In our case, it does not have to do with implicitly calling any git functionality....
It happens when trying to load dependencies, not sure if that makes any difference

past coyote
#

@marsh dirge ok, thx. Now I know where to look better. This code path is different than the one I was trying since as you mentioned, this happens at the module runtime phase ๐Ÿ™

#

@marsh dirge just to confirm. Is the last trace from 19.4 or 19.5?

#

@marsh dirge also, have you checked if --progress plain shows anything different?

marsh dirge
#

was 0.19.4

#

and I haven't had a case reproduced with --progress plain yet ๐Ÿ™

#

I'll post here as soon as I have one with more logs

sterile bay
#

๐Ÿ‘‹ We've finally been able to get some better logs about this!

A recap of our setup, we have a single dagger engine that ephemeral GHA runners connect to.

Here's what we are seeing:

  1. A new job starts, we start an ssh agent in the runner pod with our ssh key:
Agent pid 358
Identity added: /home/runner/.ssh/identity (/home/runner/.ssh/identity)
  1. We can see the engine use the socket (357) to collect the shas of each refs in our github repo
30  : โ”† โ”† โ”† git(
30  : โ”† โ”† โ”† โ”† url: "ssh://git@github.com/<myorg>/<myrepo>"
30  : โ”† โ”† โ”† โ”† keepGitDir: true
30  : โ”† โ”† โ”† โ”† sshAuthSocket: Host.unixSocket(path: "/tmp/ssh-o8ZYwEurcTJ2/agent.357"): Socket!
30  : โ”† โ”† โ”† โ”† commit: "4b425bc12b572f7d4049cc51dabf131b95de00b3"
30  : โ”† โ”† โ”† ): GitRepository!
32  : โ”† โ”† โ”† โ”† git ls-remote --symref ssh://git@github.com/<myorg>/<myrepo>
32  : โ”† โ”† โ”† โ”† [1.2s] | ref: refs/heads/main    HEAD
32  : โ”† โ”† โ”† โ”† [1.2s] | 0c54915dd571055a20ff6f68097a4e3133de82ea    HEAD
32  : โ”† โ”† โ”† โ”† [1.2s] | e2867a3d5aa29e01a094d435b8b5deb84741ff4e    refs/pull/1/head
32  : โ”† โ”† โ”† โ”† [1.2s] | 00f28a1689c25103570bf5d569bc2ea717408980    refs/pull/10/head
#
  1. Finally the call fails as the engine tries to fetch the refs using what appears to be an old socket (306) which was cached:
93  : โ”† โ”† โ”† git(
93  : โ”† โ”† โ”† โ”† url: "ssh://git@github.com/<myorg>/<myrepo>"
93  : โ”† โ”† โ”† โ”† keepGitDir: true
93  : โ”† โ”† โ”† โ”† sshAuthSocket: Host.unixSocket(path: "/tmp/ssh-CA2Sm7RydmH2/agent.306"): Socket!
93  : โ”† โ”† โ”† โ”† commit: "4b425bc12b572f7d4049cc51dabf131b95de00b3"
93  : โ”† โ”† โ”† ): GitRepository!
94  : โ”† โ”† โ”† โ”† Host.unixSocket(path: "/tmp/ssh-CA2Sm7RydmH2/agent.306"): Socket!
94  : โ”† โ”† โ”† โ”† Host.unixSocket DONE [0.0s]
95  : โ”† โ”† โ”† โ”† git remote metadata
96  : โ”† โ”† โ”† โ”† git ls-remote --symref ssh://git@github.com/<myorg>/<myrepo> ERROR [0.5s]
96  : โ”† โ”† โ”† โ”† ! git error: exit status 128
96  : โ”† โ”† โ”† โ”† [0.5s] | git@github.com: Permission denied (publickey).
96  : โ”† โ”† โ”† โ”† [0.5s] | fatal: Could not read from remote repository.
96  : โ”† โ”† โ”† โ”† [0.5s] | 
96  : โ”† โ”† โ”† โ”† [0.5s] | Please make sure you have the correct access rights
96  : โ”† โ”† โ”† โ”† [0.5s] | and the repository exists.

Since the 306 socket does not exist anymore the git call is not authorized.

This seems like a bug where the 306 socket should not have been cached?
It took around 30 minutes for the engine to stop trying to use this old socket.

Is there anything we can do locally to mitigate this? These particular runs were on 0.19.8 (we are now running 0.19.10, since yesterday, we got reports of this every other day)

past coyote
# sterile bay ๐Ÿ‘‹ We've finally been able to get some better logs about this! A recap of our ...

hey Vincent! thx for reporting. Would you mind opening an issue if possible?

This reminds me a recent thread here #1463115662808449105 message where a user was having some caching related questions when the SSH socket changes. What I'd advise as a temporary fix is to change your GHA workflows so you always create or symlink the SSH socket to the same path in the machine. This will not only fix your issue but should also enable better caching in the case of multiple workflows cloning the same repo + commit since the engine will be able to re-use the cache

sterile bay
#

I've implemented your suggested workaround! Will see if you get some reports in the next few days ๐Ÿ™‚ I'll create an issue with the details from above in the meantime !

sterile bay
clever mountain
clever mountain
clever mountain
#

Ok yeah I'm still unable to repro after trying for a while, so I'll wait for you to get back on the above ^ I'm sure there's just some very particular ordering or setup required, just can't seem to find it yet

sterile bay
#

I've commented some more information on the issue! We've just rolled out the workaround you proposed with a single socket name. I'll keep an eye for reports over the next few days.

clever mountain
#

Just to update, I know exactly what's going wrong here. It wasn't quite fixed by the earlier work, but it's all connected to my current project on the caching system.

Have you continued to see the issue after the workaround?

If yes, I can try to fast track a patch to fix.
If not, I will just continue working on the "real" fix, which should solve this and quite a few related problems at once

clever mountain
sterile bay
#

We've not seen the issue since using a predictable socket name by the way! Thank you for being so on top of things!