Hey everyone, we are close to our production launch of Dagger as the main CI build system between 800-900 jobs per day. For the context, we are running Dagger on GitHub Actions self-hosted runners on EKS with remote Dagger engines architecture. I'm looking for an efficient way to load-test our architecture to choose the appropriate instance types for the remote engines. Has anyone been through this before? Keen to hear your REX. thanks!
#Dagger cache sizing before production launch
1 messages · Page 1 of 1 (latest)
cc @karmic tapir will very likely give you some pointers here.
Hey @verbal oasis!!
We went, and continue to go through, that process here at Dagger. Choosing an instance type for us has been mostly done via trial and error, we don't group our workloads in any special kind of way so we always looking for a unified instance that can be serve the majority of workloads. This resulted in going with compute optimized machines that have builtin NVMe into them, right now we are using mostly c6id instances. These are the main factors we look into when choosing the instance type:
- Job parallelization: there is a sweet spot for the overall size of the instance that depends on the jobs themselves. For us, it has served us well to go with relatively big instances that can accommodate almost all jobs of a given commit. We have some jobs that are too heavy and we run them in isolation for better performance
- Resource requests and limits: on each node we run a single dagger engine that has no resource limits. What we instead do is define resource requests on the runner pods and let Karpenter choose the instance size based on all the requests that the runner pods have. The engine is then able to run free with all the resources of that machine. Not sure if by remote engines you mean connecting via TCP to a different host, for that the story might be a bit different
- Money: our CI is quite heavy and active, we try to go with instances that might not be the absolute top of the line but that are good enough
The biggest most significant performance improvement we've had is to rely on instances with the fastest possible storage. Using instances with NVMe will make a big difference no matter what. As for compute, its quite likely that you'll want to leverage fast CPUs. Right now we are using Intel, but we are going to try with AMD, we have a feeling that given the faster per-core frequency it will work better for us. Will report back on that
When it comes to load testing, nothing beats running the real thing in production. Would that be possible for you? For us what we ended up doing is configuring alternative runners that run all the jobs once per day using github scheduled workflows. This definitely causes some annoying YAML duplication and increasis spending, but in my experience CI has always been difficult to predict, so running the real thing for a few days to have enough data points to compare is the best bet you can make
Hi , i'm on the team with @verbal oasis , for now our solution is to run engine on dedicaced host ( we already chose the c6id instance type ) and a bunch of gha hot runner on classic harware (mostly c6). For now it's complicated to run on real load ( we move from another ci , workflow redefined... ) thanks for your insight 🙂