#0.20.6 & constructor args

1 messages ยท Page 1 of 1 (latest)

gentle cradle
#

๐Ÿงต

#

@heady haven note that if 0.20.5 didn't have the issue, it's a much narrower set of possible causes (workspace-plumbing was already merged in 0.20.5)

#

cc @daring sapphire @ebon brook

pallid matrix
#

I swore I reproduced this locally on my box, but when I retried it things worked. It's possible I repro'd wrong, but the ci running in cluster definitely didn't. I've been running around and fighting flux, so I haven't verified in-cluster yet

daring sapphire
#

@pallid matrix yeah if you can provide a repro that would help us tons! I'll try it myself and see if i can repro

ebon brook
#

I can't repro in v0.20.6 FWIW

#

I see the constructor args in --help

pallid matrix
#

Ok, I have repro'd in our cluster. I'm not using --help in cluster though. I'm trying to run the command dagger call <args> all and the initialization args that used to run in 20.5 aren't recognized in 20.6

#

Could this be incompatibility between client 20.5 and engine 20.6? That is a difference in cluster vs my local.

ebon brook
#

just to double check you're calling your function like dagger --constructor-arg value call $myfn correct?

ebon brook
#

I'd upgrade since both v0.20.5 and v0.20.4 were bumpy releases @pallid matrix

#

they had a few regressions from the workspaces work which were fixed in v0.20.6

gentle cradle
#

@pallid matrix Do you mind sharing the actual names of the args? Wondering if there's a naming conflict maybe ๐Ÿค”

ebon brook
gentle cradle
#

(sorry I meant @pallid matrix )

pallid matrix
#

Mine was like --pipeline-repositry

ebon brook
#

in any case I was able to repro @gentle cradle

#

bumping the client to match v0.20.6 in the engine fixes it

#

20.4 and 20.5 were bumpy releases and we accidentally introduced some regressoins there ๐Ÿ˜ฌ

pallid matrix
#

will 20.6 client work with a 20.6 engine where a module has asked for like 20.3 functionality?

ebon brook
#

you should be able to verify it locally though

pallid matrix
#

Ah, good point. It has worked locally with older engine defs in my local env.

pallid matrix
#

I think I'm seeing a pretty massive performance degredation in 0.20.6. I have a set of module with tests, and a CI job that runs through all tests in a kube cluster with dagger-engine deployed (all tests use that common deployed engine). In 0.20.5, I could run through the entire suite in about ~30m if cache was stale, <10m if cache existed. Sometimes as fast as 3-4m. Since upgrading to 0.20.6, I can't get a run through the suite at all as it takes >1 and times our in our CI. Worse, I would expect to see some benefits of caching if I re-run the job, but i don't seem to as even a re-run takes >1 hr and fails.

Am I missing something in config related to cache or performance changes in helm charts maybe?

ebon brook
gentle cradle
#

(Not for repro, but looking for clues as to which codepaths are activated, that might be responsible for the regression)

pallid matrix
#

I have a test script that does 2 things:

pushd module/tests dagger install --progress=report .. dagger call --progress=dots all popd

And iterates over all our modules. What I have noticed, is when I run this locally, all of the module tests dagger.json and go.[mod|sum] are getting updated (I noticed changes in my local git) whereas before that did not happen. Currently, most modules have a dagger version of 0.20.3 in their dagger.json.

It feels like more is occuring in this process than used to with the changes I'm noticing in local files.

pallid matrix
#

I split up our CI run into multiple separate jobs, 1 for each module, and a single module's test suite is taking between 7-15minutes. Even modules where the testing complexity is low.

gentle cradle
#

@pallid matrix any chance you could share a dagger cloud trace URL? That would help enormously

pallid matrix
#

Unfortunately not. It's all run local

#

My gut based upon what I'm seeing speed-wise locally is that dagger install step is updating modules and dagger.json, and those dep updates prevent cache hits and overall slows things down.

gentle cradle
#

Oh wait you dagger install on each run?

pallid matrix
#

Yes, for each module's tests to ensure it is running against the latest module version.

gentle cradle
#

So these are tests for dagger modules? If so, wouldn't you want each module's tests to live inside that module's repo?

pallid matrix
#

They do. We have a repo, and each module is in a dir in that repo with with it's tests:

<module>/tests

The tests are a separate dagger module that uses the module at level ... I've run across code changes in the module (api changes usually) not being recognized in the tests unless I tell dagger to go update the dep (ie the module itself). I'm currently doing that with dagger install ..

#

This is new also. In my parallel job invocation, I got this failure:

ERROR: Job failed (system failure): pod "dagger/runner-dgmlauop4-project-474-concurrent-2-thv3idzy" is disrupted: reason "EvictionByEvictionAPI", message "Eviction API: evicting"

Never seen that before either.

#

I'm also seeing really high load on the engine.

#

If a trace would help, how do I send one to dagger cloud from my local setup?

gentle cradle
gentle cradle
gentle cradle
# pallid matrix Correct.

So these are never pinned, you always get A and B for the same snapshot of the repo. So you absolutely do not need to run dagger install each time

#

(probably not the root cause of this regression, but you never know)

#

Generally running dagger install at runtime is a bad idea and indicative of a problem somewhere. In some cases, the problem might be on our side - ie. you actually have no choice but to do it that way as a workaround for a limtiation of dagger. Hard to tell without seeing your code

pallid matrix
#

I can take out the install step and see if:
A) it solves the speed issue
B) If no issues arise down the line

If that solves the problem and no issues pop up, I'm happy to chaulk this up to PEBKAC. ๐Ÿ˜„

pallid matrix
woeful stream
pallid matrix
#

I'll try to see if I can create one. So far, the dagger install step does not seem to be a cause. I am seeing the engine under heavy load and I'm not sure why. It could be that is the issue.

pallid matrix
#

Hrm, I'm no longer sure there is a performance degradation in 0.20.6. I reverted to 0.20.5 and run the parallel jobs I run in 0.20.6, and it does not do well at all. 5/15 pass, with the failures being various timeouts/broken pipes/etc.

We're running dagger as a statefulset behind a kubernetes service with cache on ebs gp3. There are no requests/limits on the dagger-engine helm chart we apply, and the engine should be on its own node (aside from cluster required pods like istio). I discovered that even though limits were not set, defaults of requests 100m/256Gi and limits of 500m/1Gi were applied. I bumped those to 3000m/5Gi and 3500m/10Gi and I saw the benefits of caching again (the 5 prior jobs completed almost instantly), but the engine is struggling under the load from the past 10 failed jobs.

The limits increase test was done under 0.20.5 and I still need to do the same with 0.20.6. Is there a resource sizing guide for the dagger engine? I would've thought 3cpu would more performant running 10 jobs in parallel. How do you size the resources for the engine?

woeful stream
pallid matrix
#

With the original resource limits (100/500, 256/1) or the expanded limits?

woeful stream
pallid matrix
#

0.20.3 is performing much better. I dropped to the original resources (100/500, etc) and 13/15 passed and 2 failed in about 30m runtime. The runtime is looking like it would be similar to when I just had a single job iterating through all the module tests one by one. Once this finishes I'll try again and see if caches hit and failures are retried successfully.

I didn't see the load in the dagger engine pod (measured via uptime) hit as high a number as it did in 0.20.5 or 0.20.6.

gentle cradle
pallid matrix
#

I can probably open things up to do that. 0.20.3 and which other version?

You mentioned i just have to do a dagger login to send a trace. I assume i need an account or something?

gentle cradle
#

All traces are private by default, so you can safely share a URL, only your org and Dagger employees with support "super-admin" access will be able to access the trace

pallid matrix
#

Do you want a trace from a single one of the parallel jobs, traces for all the parallel jobs, or a trace of when I ran all of the tests in one sequential sequence?

pallid matrix
#

@gentle cradle ^^

gentle cradle
gentle cradle
#

Update: I'm following my educated guess on a possible root cause, while waiting for the traces.

If anyone has a theory on the root cause, or is already working on a fix, please let me know!

pallid matrix
#

I created some traces, but the performance wasn't the same as what I reported. Trying again this morning with the original process

pallid matrix
#

Ok, I've got 2 series of traces, one using 20.6 and another using 20.3. I don't see a way to share the set, only specific traces though.

pallid matrix
#

@gentle cradle ^^

gentle cradle
pallid matrix
#
#
#

@gentle cradle ^^

woeful stream
pallid matrix
#

@woeful stream Were the traces I provided of any use?

woeful stream