#Dagger connection is slow

1 messages · Page 1 of 1 (latest)

gritty lily
#

I share you some screenshots on step where I saw some strange waiting time

#

checkout step, mot of the time is waiting dagger

#

notify step half of the step time on the dagger connection:

#

generate step plan:

#

Some metrics from node and pod for the pipeline run
Nodes:

#

pod running on the node:

#

metrics of the dagger engine:

#

autoscaling-runner-set-29-l-... are running only ¨light steps" like s3 sync to repo, terraform plan + apply, generate summary from small files

the issue is also on other size of worker

novel crow
#

cc @sharp garden

gritty lily
#

Actually our pipelines are running with dagger 0.19.2

#

different ci steps I shared are running from a self hosted github action runner and all these steps are in the same job so the same dagger engine

gritty lily
#

another point I forget to share, when a job is starting there is a prehook script which wait the dagger engine up with this function:

function wait_for_dagger() {
  while ! dagger core container > /dev/null 2>&1
  do
    echo 'Waiting for dagger to be up'
    sleep 1
  done
}
gritty lily
#

👋 up for this subject
Also I have a question, dagger engine is exposing a prometheus endpoint and if yes do you think it could help to better understand the situation ?

gritty lily
#

👋 I'm trying to upgrade to 0.19.8 and I have often a lost of time of 30s on the connection to the engine at the end we are losing lot of minutes on that per run.
I will have more time to rework on that but if you can help me or give me an idea where to check it will be awesome
I can try to reshare more informations

gritty lily
#

The issue could be a kind of module cold start ?

gritty lily
#

I enable more logs, I can thee in reality the part which is taking time is the module loading whichi is taking often between 30-40s

#

I tried to simplify some module by removing external dependencies but it doesn't change so much

livid dock
#

do you have a trace URL you can share? Would make it very easy to see the root cause of the issue

#

connecting to the engine should not take 30s, if that's really the cause then it's specific to your infra configuration

#

loading a module on a cold cache can definitely feel slow. but that would not be under "connect to engine" in the telemetry

gritty lily
#

I don't have but I can try to setup.
I think I misinterpreted the dagger output now I enabled the plain output mode instead of dot, the issue is on the loading module

livid dock
#

Do you have a warm cache in between runs?

gritty lily
#

not yet I begin to explore that but I have several modules I thought doing a warmup on a module using the python sdk will reduce the loading but not or the warm up should be done differently

#

I tried to do something like that: dagger -m /home/runner/ functions > /dev/null 2>&1 || true
but it's only working for this module, I'm saying that because all my modules are written in python so I imagine it will be reduce if the sdk is loaded/cached

#

Actually I have some standalone module like we could group under a category "runner toolkit", then when the repo si clone (monorepo), each there is one dagger module per apps or in some case a shared dagger modules (eg for our libraries)

#

each of these dagger modules for apps or library are using a common module for testing / building our python apps

#

there is something recommended for warming the cache ?

livid dock
#

@gritty lily the most important question is whether you have persistent storage to keep the cache

gritty lily
#

no actually is running on ephemeral nodes managed by karpenter and the dagger data is on the local nvme of the node

novel crow
#

@gritty lily are you around? is there any chance you can show me this really quick?

#

I have a few minutes so we can check it out together

gritty lily
#

I can be available in 1 hour or I can block some time another day

novel crow
gritty lily
novel crow
novel crow