#failed to add socket from source client

1 messages Β· Page 1 of 1 (latest)

sour juniper
#

When we upgrade the dagger engine from v0.18.12 -> v0.18.16 we started to see the below errors:

unexpected status 200: get or init client: initialize client: failed to add client resources from ID: failed to add socket from source client yk2roxtw4nq7kau0lz5gzstu5: socket xxh3:cf078eddb126861b not found in other store

failed to return error: get or init client: initialize client: failed to add client resources from ID: failed to add socket from source client yk2roxtw4nq7kau0lz5gzstu5: socket xxh3:cf078eddb126861b not found in other store 
original error: get parent name: get or init client: initialize client: failed to add client resources from ID: failed to add socket from source client yk2roxtw4nq7kau0lz5gzstu5: socket xxh3:cf078eddb126861b not found in other store

It seems to be related to calls to the dag.Git function where the git url is a private repository:

type Failing struct {
}

func (f *Failing) Failure(socket *dagger.Socket) <returnType> {

dag.Git("git@github.com:org/privte-repo.git", dagger.GitOpts{
        SSHAuthSocket: socket,
    })}

We have only managed to reproduce this in our CI (Github actions that runs on k8s custom runner running linux)
Locally on a Macbook Pro m2, we cannot reproduce it.

Does this make sense to any of you?

sour juniper
rustic bear
sour juniper
#

Thanks. We are reusing the k8s node and its disk so this might be it.

rustic bear
sour juniper
#

Will do. Let me get back to you tomorrow when I have some data.

sour juniper
#

We unfortunately still see the same error. I have tried running v0.18.16 and also v0.18.17.

I can confirm that the engine was provisioned on a new k8s node with an attached nvme.

This is the log we see in the engine

time="2025-09-05T06:54:55Z" level=error msg="failed to serve request" error="get or init client: initialize client: failed to add client resources from ID: failed to add socket from source client myy3re2s7wt8lkjj9ukm6xmk7: socket xxh3:1531e81c51d6de01 not found in other store" method=POST path=/query
rustic bear
#

It seems to be related to calls to the dag.Git function where the git url is a private repository:

were you able to isolate it to this call and reproduce the error consistently in your CI?

sour juniper
#

We have some trouble isolating it and are working on that. I will hopefully have something useful tomorrow

rustic bear
radiant gust
#

Picking up from Martin here. We have not been succesful in reproducing locally. However, we have gained some insights namely that we can see that the problem surfaces when jobs are executed concurrently. In our real world example there are no problem as long as only one job executes at a time. Obviously this is rarely the case so we see ~90% of pipelines fail.

I have created this minimal example which when run on our k8s hosted engine and Github Action Runners will fail consistently on 18.13 and upwards.

I realize it is dificult when it can not be reproduced locally but I hope this is somewhat useful for you.

#

This corresponds to our module doing some validations.

package main

import (
    "context"
    "dagger/reproduce/internal/dagger"
)

type Reproduce struct {
    Container *dagger.Container
}

func New(
    ctx context.Context,
) *Reproduce {

    return &Reproduce{
        Container: dag.Container().From("alpine:latest"),
    }
}

func (s *Reproduce) Reproduce(
    ctx context.Context,
    dir *dagger.Directory,
) (*dagger.Container, error) {
    ctr := s.Container.
        WithDirectory("/dir", dir).
        WithExec([]string{"ls"})

    return ctr, nil
}
#

This module checks out something dynamic that the above module will use in validation (in our case semgrep rules).

package main

import (
    "context"
    "reproduce/tests/internal/dagger"
)

type Tests struct {
}

func (t *Tests) ReproduceSocketError(ctx context.Context, socket *dagger.Socket) error {
    repo := dag.Git("git@github.com:dagger/dagger.git", dagger.GitOpts{
        SSHAuthSocket: socket,
    })
    root := repo.Branch("main").Tree(dagger.GitRefTreeOpts{
        DiscardGitDir: true,
    })

    _, err := dag.Reproduce().
        Reproduce(root).Sync(ctx)

    return err
}
#

This is the Github Action Workflow that reproduces it by executing to builds concurrently.

name: dagger
on:
  push:
    branches:
      - "**"
  pull_request:
  workflow_dispatch:

concurrency:
  group: ${{ github.event_name }}-${{ github.ref }}

jobs:
  dagger:
    name: reproduce-${{ matrix.run }}
    runs-on:
      group: selfhosted-hosted-runner-in-k8s
    strategy:
      matrix:
        run: [1, 2]
    steps:
      - uses: actions/checkout@v4
        name: Checkout
      - uses: private-action/setup-ssh # This starts an ssh-agent and adds the SSH key to the agent and exports SSH_AUTH_SOCK
        with:
          key: ${{ secrets.SSH_PRIVATE_KEY }}
      - name: Test reproduce run ${{ matrix.run }}
        uses: dagger/dagger-for-github@8.0.0
        with:
          # renovate: datasource=github-releases depName=dagger/dagger versioning=semver
          version: 0.18.16
          engine-stop: false
          verb: call
          module: ./tests/reproduce
          args: reproduce-socket-error --socket="$(echo $SSH_AUTH_SOCK)"
rustic bear
radiant gust
#

Hi @rustic bear,

Is there somewhere I can follow along with the progress on this? Is it better if I create an issue in github?

rustic bear
#

it's next on my issue queue πŸ™

radiant gust
#

Sounds great πŸ™

rustic bear
#

ok, I was able to repro locally

#

I have rough idea where this might be coming from

rustic bear
#

@radiant gust @sour juniper we found what's happening here. It's a combination of operation + digest caching which is introducing a very particular edge case in this scenario with unix sockets. The fix is not trivial since it involves chaing some internals about how this changing system works.

Having said that, there is a solid stopgap you can use to prevent this from happening which is to make a small change in your private-action/setup-ssh action so it always creates the SSH_AUTH_SOCKET in the same path of the system instead of picking a random one as it does by default. So if you use ssh-agent -a /tmp/dagger_pipeline.sock or something similar, that should fix the concurrency issues you're seeing

#

I'm opening an issue in the Dagger repo to track this one out. cc @plush hawk

radiant gust
#

Thank you very much. This sounds a very reasonable work around. Will test it asap!

radiant gust
#

I can confirm that the workaround works!