#0.20.6 & constructor args
1 messages ยท Page 1 of 1 (latest)
๐งต
@heady haven note that if 0.20.5 didn't have the issue, it's a much narrower set of possible causes (workspace-plumbing was already merged in 0.20.5)
cc @daring sapphire @ebon brook
I swore I reproduced this locally on my box, but when I retried it things worked. It's possible I repro'd wrong, but the ci running in cluster definitely didn't. I've been running around and fighting flux, so I haven't verified in-cluster yet
@pallid matrix yeah if you can provide a repro that would help us tons! I'll try it myself and see if i can repro
Ok, I have repro'd in our cluster. I'm not using --help in cluster though. I'm trying to run the command dagger call <args> all and the initialization args that used to run in 20.5 aren't recognized in 20.6
Could this be incompatibility between client 20.5 and engine 20.6? That is a difference in cluster vs my local.
could be. Just verified constructor arugments work with both v0.20.6 client and server. Checking with v0.20.5 client now
just to double check you're calling your function like dagger --constructor-arg value call $myfn correct?
yep, seems like it breaks with client v0.20.5 and server v0.20.6
I'd upgrade since both v0.20.5 and v0.20.4 were bumpy releases @pallid matrix
they had a few regressions from the workspaces work which were fixed in v0.20.6
@pallid matrix Do you mind sharing the actual names of the args? Wondering if there's a naming conflict maybe ๐ค
I doesn't seem to. My arg is called --api-src
(sorry I meant @pallid matrix )
Mine was like --pipeline-repositry
in any case I was able to repro @gentle cradle
bumping the client to match v0.20.6 in the engine fixes it
20.4 and 20.5 were bumpy releases and we accidentally introduced some regressoins there ๐ฌ
will 20.6 client work with a 20.6 engine where a module has asked for like 20.3 functionality?
yes, we do have backwards compatibility
you should be able to verify it locally though
Ah, good point. It has worked locally with older engine defs in my local env.
I think I'm seeing a pretty massive performance degredation in 0.20.6. I have a set of module with tests, and a CI job that runs through all tests in a kube cluster with dagger-engine deployed (all tests use that common deployed engine). In 0.20.5, I could run through the entire suite in about ~30m if cache was stale, <10m if cache existed. Sometimes as fast as 3-4m. Since upgrading to 0.20.6, I can't get a run through the suite at all as it takes >1 and times our in our CI. Worse, I would expect to see some benefits of caching if I re-run the job, but i don't seem to as even a re-run takes >1 hr and fails.
Am I missing something in config related to cache or performance changes in helm charts maybe?
cc @woeful stream and @past stirrup ๐
Can you share the command used to trigger the degraded test run?
(Not for repro, but looking for clues as to which codepaths are activated, that might be responsible for the regression)
I have a test script that does 2 things:
pushd module/tests dagger install --progress=report .. dagger call --progress=dots all popd
And iterates over all our modules. What I have noticed, is when I run this locally, all of the module tests dagger.json and go.[mod|sum] are getting updated (I noticed changes in my local git) whereas before that did not happen. Currently, most modules have a dagger version of 0.20.3 in their dagger.json.
It feels like more is occuring in this process than used to with the changes I'm noticing in local files.
I split up our CI run into multiple separate jobs, 1 for each module, and a single module's test suite is taking between 7-15minutes. Even modules where the testing complexity is low.
@pallid matrix any chance you could share a dagger cloud trace URL? That would help enormously
Unfortunately not. It's all run local
My gut based upon what I'm seeing speed-wise locally is that dagger install step is updating modules and dagger.json, and those dep updates prevent cache hits and overall slows things down.
You can still export traces to dagger cloud even when running locally
Oh wait you dagger install on each run?
Yes, for each module's tests to ensure it is running against the latest module version.
So these are tests for dagger modules? If so, wouldn't you want each module's tests to live inside that module's repo?
They do. We have a repo, and each module is in a dir in that repo with with it's tests:
<module>/tests
The tests are a separate dagger module that uses the module at level ... I've run across code changes in the module (api changes usually) not being recognized in the tests unless I tell dagger to go update the dep (ie the module itself). I'm currently doing that with dagger install ..
This is new also. In my parallel job invocation, I got this failure:
ERROR: Job failed (system failure): pod "dagger/runner-dgmlauop4-project-474-concurrent-2-thv3idzy" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting"
Never seen that before either.
I'm also seeing really high load on the engine.
If a trace would help, how do I send one to dagger cloud from my local setup?
So these are local deps? module A imports module B in the same repo?
Correct.
dagger login -> once logged in, it happens automatically. You'll see a trace URL. You can also press w from the TUI, or call dagger -w
So these are never pinned, you always get A and B for the same snapshot of the repo. So you absolutely do not need to run dagger install each time
(probably not the root cause of this regression, but you never know)
Generally running dagger install at runtime is a bad idea and indicative of a problem somewhere. In some cases, the problem might be on our side - ie. you actually have no choice but to do it that way as a workaround for a limtiation of dagger. Hard to tell without seeing your code
I can take out the install step and see if:
A) it solves the speed issue
B) If no issues arise down the line
If that solves the problem and no issues pop up, I'm happy to chaulk this up to PEBKAC. ๐
That is what I thought, but I do recall having issues at one point with changes not making it to the test module. That was local devel though, where I probably had to do a develop/install because I had local generated code.
Hey, if you could have a self contained repro that would greatly help tracking it down ๐ Something showing the regression as a standalone between 0.20.5 and .6
I'll try to see if I can create one. So far, the dagger install step does not seem to be a cause. I am seeing the engine under heavy load and I'm not sure why. It could be that is the issue.
Hrm, I'm no longer sure there is a performance degradation in 0.20.6. I reverted to 0.20.5 and run the parallel jobs I run in 0.20.6, and it does not do well at all. 5/15 pass, with the failures being various timeouts/broken pipes/etc.
We're running dagger as a statefulset behind a kubernetes service with cache on ebs gp3. There are no requests/limits on the dagger-engine helm chart we apply, and the engine should be on its own node (aside from cluster required pods like istio). I discovered that even though limits were not set, defaults of requests 100m/256Gi and limits of 500m/1Gi were applied. I bumped those to 3000m/5Gi and 3500m/10Gi and I saw the benefits of caching again (the 5 prior jobs completed almost instantly), but the engine is struggling under the load from the past 10 failed jobs.
The limits increase test was done under 0.20.5 and I still need to do the same with 0.20.6. Is there a resource sizing guide for the dagger engine? I would've thought 3cpu would more performant running 10 jobs in parallel. How do you size the resources for the engine?
Could you try on 0.20.3 ? My guess is that we might have introduced something on 0.20.4 ๐
With the original resource limits (100/500, 256/1) or the expanded limits?
As you want, wouldnt be surprised for it to work with the original resource ๐
0.20.3 is performing much better. I dropped to the original resources (100/500, etc) and 13/15 passed and 2 failed in about 30m runtime. The runtime is looking like it would be similar to when I just had a single job iterating through all the module tests one by one. Once this finishes I'll try again and see if caches hit and failures are retried successfully.
I didn't see the load in the dagger engine pod (measured via uptime) hit as high a number as it did in 0.20.5 or 0.20.6.
We have a theory. If you had a way to send us a before / after trace on dagger cloud, that would be super helpful
I can probably open things up to do that. 0.20.3 and which other version?
You mentioned i just have to do a dagger login to send a trace. I assume i need an account or something?
Yes. If you don't have an account, just call dagger -w it will send you to the web UI, with option to setup if needed (also works by pressing w while inside the TUI)
All traces are private by default, so you can safely share a URL, only your org and Dagger employees with support "super-admin" access will be able to access the trace
Do you want a trace from a single one of the parallel jobs, traces for all the parallel jobs, or a trace of when I ran all of the tests in one sequential sequence?
@gentle cradle ^^
Anything works as long as they're a slower one and a faster one, and they're the same command
Update: I'm following my educated guess on a possible root cause, while waiting for the traces.
If anyone has a theory on the root cause, or is already working on a fix, please let me know!
I created some traces, but the performance wasn't the same as what I reported. Trying again this morning with the original process
Ok, I've got 2 series of traces, one using 20.6 and another using 20.3. I don't see a way to share the set, only specific traces though.
@gentle cradle ^^
Yes individual trace URLs are the available mechanism for sharing at the moment
0.20.3 traces:
https://dagger.cloud/robert-rati/traces/4a4066cc215c9df08341e6f2dd44238f
https://dagger.cloud/robert-rati/traces/90bec48cd1417e3950572e459e0652aa
https://dagger.cloud/robert-rati/traces/1622f12966f59b3733b913a565e7e71c
https://dagger.cloud/robert-rati/traces/90309e0b697e04817208ec5f4a791bf3
https://dagger.cloud/robert-rati/traces/74bae58fb5e2760ec992878412c04c9d
https://dagger.cloud/robert-rati/traces/c70deaa6dc5453f40b6934e422899c8a
https://dagger.cloud/robert-rati/traces/787fd5898fa7bf80b4536400e9fffe26
https://dagger.cloud/robert-rati/traces/976f15b781ef8dbb163a11f8645fe37c
https://dagger.cloud/robert-rati/traces/28e73aa4462f9dcdcda9d4674dc3cf3f
https://dagger.cloud/robert-rati/traces/0e96fd99511f86ae0749fbc866f37caf
https://dagger.cloud/robert-rati/traces/ace815b8fa43c15de0196a84e51ceec7
https://dagger.cloud/robert-rati/traces/d6d46dba864bc00554dce2f4bfdab847
https://dagger.cloud/robert-rati/traces/16f763150d75fdefcfa0c12882bde393
https://dagger.cloud/robert-rati/traces/965189aca689bd04d92be5a82e9fa33e
https://dagger.cloud/robert-rati/traces/3151480bfaf13038647de7be2b884ce8
0.20.6 traces:
https://dagger.cloud/robert-rati/traces/3cf919aff4a454d3b98cf7890826e751
https://dagger.cloud/robert-rati/traces/8afc6e3bcb6e3362cc14549290bd2cf9
https://dagger.cloud/robert-rati/traces/36c869e3ec28ce3421d3cd4acd1defd6
https://dagger.cloud/robert-rati/traces/0845e62f51772dddbb14ebf89dcba2e0
https://dagger.cloud/robert-rati/traces/e514c45e1cbae79df8ac7ed817ed8e9d
https://dagger.cloud/robert-rati/traces/654f6d6c7b13748d2ab945c35e66b616
https://dagger.cloud/robert-rati/traces/cf3c5f89d4e8e7894eee09eb48475249
https://dagger.cloud/robert-rati/traces/bc2402f7f94d6ffdf9e6c6546c9a3731
https://dagger.cloud/robert-rati/traces/459004d3f127e16b4348152c6214c31f
https://dagger.cloud/robert-rati/traces/fcd3f832ca9a7466b19979bb6ac74f22
https://dagger.cloud/robert-rati/traces/2b5f353fa82651b63ce442cd537c5e38
https://dagger.cloud/robert-rati/traces/3b176b71938f2f0fde249b4c3797c058
https://dagger.cloud/robert-rati/traces/92709738cf467c6bd7c8c7701d17b680
https://dagger.cloud/robert-rati/traces/09eab8e5a90afe3dba415aeb9f4646c2
https://dagger.cloud/robert-rati/traces/d17d288be7aed1729c2cd65da0a09d95
@gentle cradle ^^
Erik have made a PR that could fix one of the root cause: https://github.com/dagger/dagger/pull/13117
@woeful stream Were the traces I provided of any use?
I'll triple check later today or Monday ๐ We've hit some internal things that we dug, and I'll correlate later. Thanks for providing those ! ๐