#Secret not found in source store

1 messages ยท Page 1 of 1 (latest)

undone pecan
#

Getting an error only in Gitlab CI with a remote engine via _EXPERIMENTAL_DAGGER_RUNNER_HOST:

unexpected status 200: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client c1r2j4sfm4tz7bykfgc8ia0q5: secret xxh3:79be3d891a92c7d9 not found in source store
  1. I don't get this error running locally with a local engine.
  2. For some reason one of three jobs ran fine on Gitlab, the other two got this error. Is there a race condition or locking mechanism or something for secrets?
undone pecan
#

This really does seem like the engine will only allow a secret to be used once at a time. Three jobs, first succeeded two failed. Re-ran one of the failed jobs after the first finished and it worked. Re-ran the third during the second, it failed. Re-ran the third after the second finished and it succeeded?!

undone pecan
#

I should add that these three jobs are all being provided the same AWS creds as secrets... Some possibly the engine stores them as one secret despite three jobs making requests to the engine?

undone pecan
#

๐Ÿ‘† Still hitting this on every pipeline, is this a known issue?

undone pecan
#

It's just hit me at 11:30pm that this is probably caused by the fact I've used a static key for the secret - all three jobs that run in parallel use the same key. If I put something dynamic in that key there's a good chance this just vanishes?

undone pecan
#

That was not the problem. The problem is with secrets provided as parameters to functions (in this case AWS creds) and the way the engine stores/retrieves them.

undone pecan
#

@onyx cradle Unsure if you're all back from your retreat, but you fixed the user default issues so drawing your attention to this one. Perhaps you know who might be best to look at this?

onyx cradle
#

@undone pecan we got back today. Sorrry

undone pecan
#

Is anyone able to point me in the right direction here, after user defaults this is the last bump before Dagger can replace existing infrastructure pipelines

fast gazelle
undone pecan
#

Yep, every pipeline without fail.

#

Same error each time, same secret ID in the failed jobs

#

Haven't noticed it until now because our first Dagger module/pipeline deploys to one env at a time, this one deploys with a matrix job simultaneously

fast gazelle
undone pecan
#

Yep, though I'll be heading out in 50-60 mins

fast gazelle
fast gazelle
#

@undone pecan are you using v0.19.3?

undone pecan
#

Yep

#

Just updated to it today

#
1   : connect
1   : [0.0s] | cloud url=https://dagger.cloud/traces/setup
2   : โ”† starting engine
2   : โ”† starting engine DONE [0.0s]
3   : โ”† connecting to engine
3   : โ”† [0.0s] | 16:15:09 INF connected name=26d1ab60ef73 client-version=v0.19.3 server-version=v0.19.3
fast gazelle
fast gazelle
#

@undone pecan any chance you can check if this still happens in v0.19.4? Haven't had the time to really go deep in this one but I'm just wondering if by any chance 19.4 might have fixed something here ๐Ÿ™

undone pecan
#

Have to be Monday morning, doubt I could get the MRs approved to update the engine now, it's gone 5pm in Europe. Will get back to you

#

Does 0.19.4 contain any changes to secrets management?

fast gazelle
undone pecan
#

Agreed, will test on Monday morning

undone pecan
#

@fast gazelle Precisely the same error on fetching secrets on 0.19.4

#

Secret IDs:

114 : โ”† accessKey: Address.secret: Secret! = q9woxakv8c7e8f8uewt1psxw9
114 : โ”† secretKey: Address.secret: Secret! = lkv8k10158azuzcatlldu1r3s
114 : โ”† sessionToken: Address.secret: Secret! = jd6wrtw282q7m2a0x28ma7wqs

Error:

failed to return error: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client qlxskvyy47ria9wkjhmtqs6be: secret xxh3:772dbdf7915b4096 not found in source store 
original error: get parent name: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client qlxskvyy47ria9wkjhmtqs6be: secret xxh3:772dbdf7915b4096 not found in source store
Stderr:
unexpected status 200: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client qlxskvyy47ria9wkjhmtqs6be: secret xxh3:772dbdf7915b4096 not found in source store

This id: 772dbdf7915b4096 is no-where else in the entire log (with -vvvv) but in that error message

undone pecan
#

In both failed jobs, the secret ID is the same (772dbdf7915b4096) but the source client ID differs: q1fbhwluuvf2cwp4omx6hm9dg and qlxskvyy47ria9wkjhmtqs6be. Both jobs run agaisnt the same remotely-hosted engine.

Running two jobs locally simultaneously against the same engine seems to use the cached secrets from the first instance in the second, even after dagger core engine local-cache prune there's cached secrets in the second running instance, but both finish successfully. Aside from a splitsecond difference in start times, the only difference I see is that they're using the same secret values in both instances, where the gitlab matrix jobs aren't

fast gazelle
undone pecan
#

I've tried a fair bit to repro this locally without success. A minimum repro module is fairly quick to do but I suspect you'll need a Gitlab runner and a matrix job to see the same error?

undone pecan
#

@fast gazelle Any time to progress this? Running through the window to have Dagger be part of this process unable to progress whilst this is an issue

bronze panther
#

Also, do you happen to have a cloud trace where this error happened? could be very helpful if so

bronze panther
#

@undone pecan I tried a couple things with invoking functions in parallel and passing secrets around to see if I could trick anything into a scenario like what you're hitting, but no success so far.

Things that would be helpful

  1. cloud trace; one where the error is hit would be best but even one where it succeeded would help (by showing what the call flow is supposed to be)
  2. if the code in question happens to be open source, links to it
  3. if the above is not possible, a detailed description of the functions that are invoked (including any dependency functions they call), the arguments of those functions, where secrets are passed/stored, etc. Basically as many details as possible to fill in the lack of trace/code
undone pecan
#

I'm quite certain you're going to need a Gitlab runner to recreate this and exactly the setup we're using, I've been unable to reproduce any other way.

Secrets are provided with env://.

The code isn't open source, but I can provide traces to both failed and successful executions.

The short version is >1 job executed at the same time pointed at the same engine. Each job uses (pre-configured) OIDC to get temporary AWS credentials, and each passes those credentials to Dagger with env://. One job succeeds, the others fail. The failed jobs succeed if run one at a time afterwards.

I'll try and produce a minimal reproduction of the error today, and if I can I'll share the module with you, as well as the traces from our actual Gitlab jobs.

undone pecan
#

Ok had some time for this again this afternoon @bronze panther. Job ran with -vvvv so trace logs should contain IDs.

Three jobs run in a gitlab matrix job:

  1. (Success) https://dagger.cloud/mjb/traces/d8515f1e1cec2fdab0b9140118e090db
  2. (Failure) https://dagger.cloud/mjb/traces/a9cbbaef617a8b1daf903de751cb55bf
  3. (Failure) https://dagger.cloud/mjb/traces/ac5dbfeff2799b885c5a62372daaa2fa

If I rerun 2&3 one at a time they will succeed:

  1. (Success) https://dagger.cloud/mjb/traces/715460be283f108df1c204e0e322017b
  2. (Success) https://dagger.cloud/mjb/traces/360e04eccce4846f3c795a7c165afb8d

No changes are made before re-running. They just cannot simultaneously succeed.

I can send you full plaintext gitlab job log files (also with -vvvv) if you need them, DM me on Discord.

#

For your last point, there are four secrets: three AWS cred vars (env://) and one git cred file (file://). I don't know which is problematic, I can get rid of some operations and remove the git creds file and see if I still hit the same error.

#

Ok tried that, removed the file:// parameter and mountedsecret and still same outcome, one success two failures. it's seemingly random which one succeeds, this time it was a different one than last time.

#

Same error though:

Stdout:
failed to return error: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client 2fndwhv60193q0fdrh28nl17a: secret xxh3:c4c02a02c91008e5 not found in source store 
original error: get parent name: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client 2fndwhv60193q0fdrh28nl17a: secret xxh3:c4c02a02c91008e5 not found in source store
Stderr:
unexpected status 200: get or init client: initialize client: failed to add client resources from ID: failed to add secret from source client 2fndwhv60193q0fdrh28nl17a: secret xxh3:c4c02a02c91008e5 not found in source store
fast gazelle
# undone pecan I'm quite certain you're going to need a Gitlab runner to recreate this and exac...

I'm quite certain you're going to need a Gitlab runner to recreate this and exactly the setup we're using, I've been unable to reproduce any other way.

this is the strangest part. It seems as if this only manifests when the clients are in different hosts since I've also tried reproducing it locally with parallel calls without success. My next step was going to be setting up github runners with a shared engine but that requires a bit more work since I need to provision an engine in a VM and then connect all the pieces together

undone pecan
#

Yeah I think that's going to be necessary, but I've only encountered this with exactly that setup - multiple hosts running simultaneously to a shared engine. I'm also unable to rule out something funky with how we use OIDC to get AWS creds because I don't know enough about how the engine stores them, but is it possible they're overwriting from each job so the only success is the last job to overwrite the secret (because it has the right secret ID, the one it created)?

#

That each job succeeds fine one at a time makes me think something odd is happening where secrets are overwriting somewhere

bronze panther
# undone pecan Ok had some time for this again this afternoon <@949034677610643507>. Job ran wi...

Thank you! Figured out one first important point: the secret not being found is actually not any of the AWS secrets, it's the secret that's intrinsically created by dagger for your gitlab auth token when we pull https://gitlab.com/plg-tech/plg/sandbox/poc/infra-deployer.git (as automatically read from your host config)... (this one in the trace)

I bet that helps explain why this was only happening on your CI infra and not locally. I will work on a repro and fix

bronze panther
# fast gazelle is it this span?

I realized it's actually not quite the module git ref, it's the git ref for the directory passed as an argument to .withConfig. The secret created in both cases is the same in a single run but different across different runs (guessing each machine has a different auth token for the same repo).

The problem is that even though the auth tokens differ, the git repo is the same. So the second/third client get a cache hit from the first, but then eventually encounter a reference to the auth token from the other machine that they don't know about and error out.

e.g.

#

Just thinking through cleanest fix possible short term

#

It honestly may be to just ignore the error in this particular case, but gotta make sure that doesn't make it possible to leak secrets. I don't think it would because by the time this error is hit, you already have access to the git repo based on its content hash. And if you know its content hash you know its content, which is as good as having the secret used to pull it

fast gazelle
bronze panther
#

In theory anyways, there's nuance to it

#

i.e. I suppose if a user had a token that gave god access to literally everything and then used it to pull a repo that's empty, you could guess the content hash and possibly get the token somehow? Although I can't even think through how the last step would be possible

fast gazelle
bronze panther
fast gazelle
bronze panther
fast gazelle
bronze panther
#

we have support for marking secrets as optional during that, which may be what I go with here

#

just trying to setup a repro in our integ tests right now by adding a second auth token to our private gitlab repo

bronze panther
undone pecan
#

Well I've just caught up, extremely glad to hear you've found an issue!

bronze panther
undone pecan
bronze panther
undone pecan
#

@bronze panther I think your fix has resolved one problem and created another. The jobs no longer error out immediately, but run indefinitely when they reach a point they're trying to use the git creds to fetch from private repositories

undone pecan
#

Only an issue in Gitlab, not locally, so it's likely still something to do with directory cache hits and the secret?

fast gazelle
#

@undone pecan mind sharing a trace please?

undone pecan
#

Will do shortly

undone pecan
#

May have spoken too soon, seems to be working now... Potentially another cache issue, just retrying some jobs

#

Running another test now

#

Yep, happened again:

โ”‚ Error: Failed to install provider
โ”‚ 
โ”‚ Error while installing hashicorp/aws v6.15.0: open
โ”‚ .terraform/providers/registry.terraform.io/hashicorp/aws/6.15.0/linux_arm64/terraform-provider-aws_v6.15.0_x5:
โ”‚ text file busy

Trace for ๐Ÿ‘† : https://dagger.cloud/mjb/traces/284454c51b5db6d2071a086c0084bd6d

#

This might be a problem with how I've set up the Terraform cache rather than a Dagger issue, I'm caching the local .terraform rather than a global provider cache, because the local cache also covers modules

fast gazelle
undone pecan
#

I incorrectly assumed that would be the default

#

I'll test

undone pecan
fast gazelle
undone pecan
#

Is the only consequence of "private" that I'll use more disk space?

#

Each Dagger job will still cache hit/miss as normal, just with it's own separate cache?

fast gazelle
#

but it won't always create a new cache volume for each client, if it already has an existing one, it'll re-use it

undone pecan
#

I'll try that first

#

Hmmm, going to test locked first

fast gazelle
#

if the cache volume is usually not super large and the operations against it are generally lightweight, I'd personally start with PRIVATE

undone pecan
#

It's providers/modules for Terraform. About 300mb, but a lot of small files from the modules

#

I was hoping Locked would be performant and not cause errors

#

I was right, all three succeeded

#

Thanks for the suggestion! Hadn't seen that before

#

I'll do some more testing Monday but this may finally be over the line