#Unexpected cache invalidation

1 messages · Page 1 of 1 (latest)

river hatch
#

Hi, I’m having trouble understanding how Dagger caching works.

In the snippet below, the only input is my-module/package.json, but the cache is invalidated (and npm install re-run) whenever I modify any file in my-module. Is this expected?

dagger <<EOF
container |
  from node |
  with-workdir /work |
  with-file package.json \$(
    directory |
    with-directory / my-module |
    file package.json
  ) |
  with-exec npm install
EOF

Are there examples of achieving a behavior similar to Docker, where npm dependencies are only downloaded when package*.json changes?

Thanks

Docker Documentation

An overview on how to optimize cache utilization in Docker builds.

river hatch
#

Unexpected cache invalidation

cursive zinc
# river hatch Hi, I’m having trouble understanding how Dagger caching works. In the snippet b...

Hey @river hatch! When you do a with-directory we are essentially mounting the whole thing, even if you only request for package.json later. To make sure the step doesn't re-run what we recommend is that you only mount the package manager's files (package.json and package.lock), run npm install and after that mount your code. Is what we do for example for our internal go applications. In your case, you could change your script to:

dagger <<EOF
container |
  from node |
  with-workdir /work |
  with-file package.json my-module/package.json |
  with-exec npm install
EOF

I'm assuming that you are running the dagger shell in a directory where there is a subfolder called my-module

#

Quick example with a hello.out file in the working directory

river hatch
#

Thanks, I was able to cache much more effectively with fine grained dagger.File/Directory arguments using +defaultPath and +ignore comments.

That said I'm not sure how to organize my code. When each function declares their inputs using +defaultPath/+ignore comments, it becomes hard to compose them.

For example, I have CheckX() that operates on directory X, CheckY() that operates on directory Y, and CheckZ() that operates on directory Z . I then have a CheckAll() that calls each individual check functions. Thus, it must declare a+defaultPath/+ignore combination to select directories X, Y, and Z and then pass subdirectories to the individual Check*() functions. I feel like this code is hard to maintain and duplicated information.

#

An alternative is to specify that in the constructor of the module (example below). This let functions access the files/directories they operate on, regardless of whether they are called by the user, or by another function.

func New(
    // +defaultPath="."
    // +ignore=["*", "!/X/"]
    CheckXSource *dagger.Directory,
    // +defaultPath="."
    // +ignore=["*", "!/Y/"]
    CheckYSource *dagger.Directory,
    // +defaultPath="."
    // +ignore=["*", "!/Z/"]
    CheckZSource *dagger.Directory,
) *Module {
    return &Module{
        checkXSource: CheckXSource,
        checkXSource: CheckXSource,
        checkXSource: CheckXSource,
    }
}

The downside of the method is that the constructor quickly becomes complex, that the relationship between a function and the files it operates on is less clear and harder to keep consistent (as things are declared far appart), and that all *Sources are populated, no matter if they are used. I feel like this last point affect performance with larger DAG but I haven't measured it.

Do I understand correctly ? Are there alternative ways to write that kind of code ? Is this pattern (fine-grained inputs + composing functions) common or should I structure things differently ? Thanks again for your help.

cursive zinc
#

Great questions!

I personally like the approach of a CheckAll function receiving via parameter the directories it needs to run all the checks:

func (m *Module) CheckAll(
    // +defaultPath="X"
    xSource *dagger.Directory,
    // +defaultPath="Y"
    ySource *dagger.Directory,
    // +defaultPath="Z"
    zSource *dagger.Directory,
) ...

Dagger functions are designed to be fully isolated. This is why they are so reproducible and why "it works locally and in CI" is true. Its hard to achieve it without this kind of isolation. However, it does come at a cost. Because of this isolation module developers need to do a little bit of extra work to make sure the function accepts the data it needs to run. This meant that the DX of calling those functions was quite verbose, this is where +defaultPath and local defaults (https://github.com/dagger/dagger/pull/11034) come in. They make it so that calling the function is as easy as dagger call <function>. But it is still up to module developers to specify what the function needs. In your example it means that if CheckAll is in charge of running all checks, then its also in charge of understanding what each of the checks need.

signal pond
#

@river hatch the solution in your case is probably to make X, Y and Z different modules. Then each module can encapsulate the domain specific paths and ignore rules, and you can express the dependencies as dagger module dependencies

#

cc @cursive zinc

signal pond
#

Then you can have a top-level module which install X, Y and Z as dependencies:

dagger install ./X
dagger install ./Y
dagger install ./Z
func (m *TopLevelModule) CheckAll() {
  dag.X().Check()
  dag.Y().Check()
  dag.Z().Check()
}
river hatch
#

Thank you both for your answers, they helped me understand the possibilities and tradeoffs better