#Data science use case

vocal glade · 2022-09-23T22:05:43.261Z

Data science use case | Dagger | Page 1

1 messages · Page 1 of 1 (latest)

vocal glade Sep 23, 2022, 10:05 PM

Hi there! I will start a thread here to answer your questions one by one

vocal glade Sep 23, 2022, 10:51 PM

How do I manage concurrency in CI with Dagger ? For example, I'm using Circle CI with separate jobs for linting, building the docs and testing the code. They all get executed in parallel over different containers with their own resource pool. How would that fare if I invite dagger in ? Could I still benefit from multiple resource pools ?

At the moment, each dagger pipeline is executed on a single host. Dagger automatically which steps can be run concurrently, and will do so automatically. But by default all steps are running on the same machine, so if one of the steps maxes out the CPU or I/O on that machine, you won't get the full benefits of concurrency. There are many improvements we can make on this front, but they are not yet available out of the box.

However, since Dagger can be embedded in your existing CI system (including CircleCI), it might be possible to piggyback on CircleCI's resource pool system, so that each Dagger pipeline is in fact running on a different pool. I don't know enough about how CircleCI does this, but would be very interested in learning more.

Further down the road, we plan to add clustering capabilities. First to allow load-balancing of pipelines on multiple machines; then eventually to split up the individual steps of each pipeline across machines. This will take a while though.

How do I cache stuff (like deps, venv, etc) ? Do I still have to use the CI provider's caching mechanism ? To be honest that's 1/3 of all I have in the workflow (alongside fetching code, secrets, and recording artefacts)

This is one of the killer features of Dagger compared to "regular" CI systems. You don't need custom caching logic: every step of every dagger pipeline benefits from Dagger's caching out of the box.

So in practice, you only need to configure Dagger's caching once (ie. where to store and retrieve the cache data). Once that's done, all your pipelines will get magically cached and you can remove that 1/3 of your workflow 🙂

If I disregard the CI completely, would dagger be a better fit to replace Airflow and other schedulers ?

It depends what you use Airflow for today

I was about to say that the cue lang was really putting me off, but I read that cloak could make it available in other languages; so, everything's fine 🙂

Ha ha, yes that is common feedback. CUE is very powerful but it's not for everyone. As we learned, a lot of people want to replace their "artisanal scripts" with modern pipelines, but would rather use a language they already know. Hence project cloak!