#Hello! The first step is to upgrade, we'
1 messages ยท Page 1 of 1 (latest)
Hi, I just upgraded 4 days ago so the python sdk init is on the latest release and I tried to relaunch some commands in local
OK - so it's still slow then ๐ sorry about that
copying <@&946480760016207902> since performance is something we all care about
@tight blade next question: is it also very slow on warm cache? Or only on cold cache?
In other words what's the time difference between 1st and 2nd run
yes I can still observe the 30-40s between the cli is connected to the engine and seeing my first commands / actions from my dagger functions
warm cache is fixing the issue but when I tried it was not working in our case because we have different modules for our monorepo so if I remember well it's avoiding to have the sdk in cache
if it can be easier to understand I can try to do a kind of ascii schema in order to represent our dagger modules organizations
Sure that would be great. You say "avoiding to have the sdk in cache" -> you mean preventing? Or is your system intentionally not keeping sdk in cache?
yes sorry I mean preventing
As long as you're loading the same version of the SDK from all your modules, there's no reason for the SDK to not be cached.
If your engine is running many different kinds of modules, at high volume, you could have a situation where the different modules evict each other from the cache more rapidly
if I have a gha pipeline and this one trigger 5 pipelines and all these pipelines are using the python sdk, there is a way to accelerate the sdk initialization ?
Yes, normally in this situation Dagger will cache the python sdk loading for all 5 pipelines. But of course the cache has to be available on the next run - which depends on your CI infra configuration
yes in my case nodes are ephemeral but maybe I could have something for warming the sdk but I think I tried that few month ago and the result of the warm part was working only at the module level of my repo but not all cross modules
I generate that if it can help to understand
warming is only useful if the cache is persisted
yes I was thinking if it's better to initalize the sdk one time at the boot instead of X per jobs
because sometime we can have 5-6 pipelines per nodes after that I think karpenter is booting a new node
The answer to that depends on your CI infra config, so I don't know
Looks like your real problem is lack of cache persistence
ok so without cache this kind of waiting time, is normal ? The only things to do is to try to find a way to cache ?
Actually we are running in EKS using the local nvme of the instance
do you have something to recommend or explore in order to reduce the init time ?
do you have a dagger.cloud trace link for this run or a similar one?
No I can try to add a personal token but actually we are not using dagger cloud in the company
One option is to use the dang scripting language instead of Python. Dang is optimized for speed, there is no codegen phase, and no third party tools to install and configure. If your modules are only orchestrating dagger API calls, and not relying on native python libraries, then it's pretty easy to switch (and you can switch one module at a time)
It won't solve your caching problem, but it will remove the python sdk loading time
that would help quite a bit since it would show more definitively what specifically is slow, it's hard to tell from just that output in the screenshot
Also @tight blade Dagger Cloud has experimental engine hosting, with automatic scale-out and cache persistence. If you want to evaluate that, we can give you early access (we use it in prod for our own CI)
for dang it seems difficult because sometime we are using native python libraries or have some code logic and the company is developing in python
@fossil owl I will set a token temporally in order to have some traces and share something with more information
@weak scaffold yes it could be something as we were thinking to move from github action due to some limitation in the worfklow design possibilities (just maybe I need to have an idea of the future cost more or less and have an idea wha I have to change between my current setup and using the engine hosting)
FYI you can also use Dagger Cloud as a complete Github Actions replacement. End-to-end dagger-native CI infra, from git event to check execution. It's an extra layer on top of cloud-hosted engines.
How about we give you a demo, and we can talk about your use case at the same time? I'll DM you a scheduling link
Ok for me
@fossil owl I have a little issue in my gha workflow I don't know why the dagger token is not available in all my steps, I have to go I will investigate on that tomorrow and share some links
Ok now I have traces and looking at the details if I'm correctly reading the trace the issue is not on the sdk init
I have one trace here: https://dagger.cloud/Dudesons/traces/72f14068730a442f4499f7b2e812f7ff?listen=5c76e0c88cf9a2b7 (everything seems good at least at the beginning)
but from the GHA UI I still see an important delay
line 28->29
I just realized maybe I misread the github action output where all dots printed are commands / processing done between line 28->29 and just the engine is working on different actions like installing packages, setting en vars, etc ...
yeah I wonder if it's just that there are no logs to print during that time period, and GHA doesn't print any dots until the line completes? (i.e. their UI doesn't flush on ..... - only on .....\n)
Yeah I don't believe this is the SDK loading being slow, the load module . step took 1.9s there. Seems like almost all the time is spent running your actual code, this pytest step https://dagger.cloud/Dudesons/traces/72f14068730a442f4499f7b2e812f7ff?listen=5c76e0c88cf9a2b7&listen=385225002e642a75&listen=1ec49988d9fab757#1ec49988d9fab757
But I do see what you're saying about the various withEnv steps before that taking quite a bit, though it seems like each of those are bottlenecked by various actually that's just one of them, others don't seem to do much and still take 3suv sync calls
So that overhead might not be the SDK loading per-se, but just the overhead of invoking a python module function at runtime
cc @twin wren @snow zodiac I have a vague memory of talking about how the python SDK re-does a bunch of work every time it gets invoked and that we there were theories on how to fix it? I could be totally misremembering though
The python SDK is analyzing the code at runtime yes. Contrary to Go for instance where we are generating the entrypoint during codegen, meaning all the future calls will use the generated code instead of doing it again. I don't know if we want to go that way for python, but that could be something interesting.
That's quite confusing because the traces shows that the job took 5m28 with the test function taking 5m20.
So is there some time that isn't recorded?
As I understand from the thread, the python sdk loading takes too long but based on the traces it's only 2seconds
Something very weird that I'm seeing in the trace is this:
As I understand from the thread, the python sdk loading takes too long but based on the traces it's only 2seconds
Yes the trace revealed that the python SDK loading is not slow, it's all cached. I'm asking about thewithEnvsteps where it seems like potentially not much is happening but they all take 3s each anyways
Why the load sdk runtime keeps appearings?
because of lazy loading
it is cached and doesn't do anything, but it hits that codepath each time
Ohhh okay make sense
@tight blade do those withEnv steps do something expensive directly in your code (i.e. not calling dagger APIs but instead calling some python library or similar)? Just wondering the 3s overhead is that vs. just the python function being slow to invoke
yes I spot the WithEnv call which take lot of time it's function from our internal python module:
@function
async def with_env(
self,
name: Annotated[str, Doc("the targeted container")],
value: Annotated[str, Doc("the targeted container")] | None,
value_secret: Annotated[dagger.Secret, Doc("the targeted container")] | None,
) -> Self:
"""Add environment variable for a container"""
if not value and not value_secret:
raise Exception("value or value_secret should be set")
if value:
self._ctr = (
self.
_ctr.
with_env_variable(name, value)
)
else:
self._ctr = (
self.
_ctr.
with_secret_variable(name, value_secret)
)
return self
and we where using like that in apps pipeline:
ctr = await (
dag.
python(
self.pipeline_id,
self.source,
base_workdir="/applications",
sub_path=self._app_name,
worskpace_pyproject=self.worskpace_pyproject,
worskpace_uv_lock=self.worskpace_uv_lock,
libraries=self.libraries_source,
build_system_packages = self.__SYSTEM_PACKAGES
).
with_discover_python_version().
install().
with_env(name="MONGO_MAIN_URI", value_secret=mongo.uri()).
with_env(name="CORE_ROUTE_BASE", value="http://core-api.service:9000/").
with_env(name="CATALOG_APP_URL", value="http://catalog-app-url").
with_env(name="CASHER_HOST", value="http://casher_url").
with_env(name="INDUS_ROUTE_BASE", value="http://core-api.service:9000/indus").
with_env(name="CORE_API_ENV_NAME", value="some-env").
with_env(name="CORE_API_SERVICE_NAME", value="some-service").
with_env(name="CORE_API_SERVICE_VERSION", value="some-version").
with_env(name="DISABLE_AWS_AUTHENTICATION", value="1").
container().
sync()
)
now I change it a new call in our python module:
@function
async def with_envs(
self,
source: Annotated[dagger.EnvFile, Doc("Collection of environment variables to set")],
) -> Self:
"""Add multiple environment variables at once from an EnvFile.
This is more efficient than calling with_env() multiple times,
as it applies all variables in a single operation.
"""
self._ctr = (
self.
_ctr.
with_env_file_variables(source)
)
return self
and the pipeline look like that:
ctr = await (
dag.
python(
self.pipeline_id,
self.source,
base_workdir="/applications",
sub_path=self._app_name,
worskpace_pyproject=self.worskpace_pyproject,
worskpace_uv_lock=self.worskpace_uv_lock,
libraries=self.libraries_source,
build_system_packages=self.__SYSTEM_PACKAGES,
python_version=python_version.strip()
).
install().
with_envs(
dag.
env_file().
with_variable("CORE_ROUTE_BASE", "http://core-api.service:9000/").
with_variable("CATALOG_APP_URL", "http://catalog-app-url").
with_variable("CASHER_HOST", "http://casher_url").
with_variable("INDUS_ROUTE_BASE", "http://core-api.service:9000/indus").
with_variable("CORE_API_ENV_NAME", "some-env").
with_variable("CORE_API_SERVICE_NAME", "some-service").
with_variable("CORE_API_SERVICE_VERSION", "some-version").
with_variable("DISABLE_AWS_AUTHENTICATION", "1").
with_variable("MONGO_MAIN_URI", await mongo.uri().plaintext())
).
container().
sync()
)
I have this trace: https://dagger.cloud/Dudesons/traces/0a3484164d363b2680c345d013b58637?listen=0d086806be80f5ed&listen=1b395ece412ee816
where I can see this is reducing the time on env var setup
but to be honnest I don't really undestand why my initial function was doing a side effect like that
Okay nice, yeah that's a good workaround. Now it's just that single .withEnvs taking ~3s rather than a bunch of individual ones taking ~3s.
The fact that the combined step also takes 3s is pretty good evidence that it is indeed just inherent overhead of python functions. So not the loading step but the actual runtime invocation overhead.
Like we were discussing above there's definitely some improvements we can make in this area, but in the meantime I think your workaround is good
ok there is somewhere in the documentation or issue where I can find some overhead like that in the python sdk in order to review our internal modules ?
Because previously I was developing module in golang but in the actual where I'm working people are coding in python so I'm trying to stay on the python sdk
Not currently, sorry, it's mostly an implementation detail at this point and one that will improve as we get bandwidth to address it
ok no problem it was sure to be sure.
So the idea actually is to reduce the number of chaining functions on my custom modules in python or I keep like I have and sometime I do some workaround like I did ?