#CI/CD Pipeline Performance

1 messages · Page 1 of 1 (latest)

fresh wraith
#

My team is in the process of migrating our CI/CD pipelines to using Dagger.

When running integration tests, we use the as_service method to spin-up the required services (i.e. mongo, minio, etc). The CI/CD also runs on self-hosted EC2 runners (within a Github action) with quite a bit of resources.

When comparing run durations of some of our tests against the old way (spinning up the services using docker compose), it appears that our execution times are taking significantly longer (i.e. an action that takes ~10 seconds seconds with docker compose method takes ~60 seconds with the as_service method.

Are there tunable parameters to ensure that the dagger container has access to the relevant resources from the EC2 instance (is it using the appropriate cores + memory)? Wondering if there are any steps I can take to help traige the performance issues I am having?

I previously submitted this: https://discord.com/channels/707636530424053791/1413316800950567072 in order to tease out some of these issues. One of the tools we use is Ray which prefers the /dev/shm size to be large enough otherwise there could be performance issues. I have used the recommended actions in that thread and it appears to not complain about /dev/shm anymore (not sure if the cache volume mounting would fix the performance related issues though).

I am using the Python SDK as an FYI (0.18.16 is the dagger engine version).

fresh wraith
proud osprey
fresh wraith
#

I sent a trace privately to minimize any unwanted logs being shown.

I don't know how helpful the spans will be, I am most curious about the giant span of the testing being run and why the logic within that test is running slower than if it was tested against the infrastructure spun-up from docker-compose.

i.e. in the last span, you'll see a log

Ray Job Status CREATE_SYNTHESIS_PLAN_EX ea8b2af5-ccde-4767-877d-52c2000bda1b: SUCCEEDED in 70.31369185447693 seconds

That is what I am trying to tease out, why exactly 70 seconds versus the normal 10 seconds from the docker-compose infrastucture.

Another side note, I did kill that test early so it didn't run all of the tests. The 70 second time duration was an indicator that it needs to be re-attempted with a different solution.

proud osprey
# fresh wraith My team is in the process of migrating our CI/CD pipelines to using Dagger. Whe...

there's another aspect impacting the performance here considerably which is that in the stopgap thread that you've shared above, /dev/shm is backed by a cache volume. This, in most systems gets translated to using the underlying primary disk that the machine has attached which in most CI runners, that means some sort of remote network block storage disk.

One thing that you can try is changing the snippet so it uses a withMountedTemp instead so it's effectively backed by a tmpfs device

func (m *Shm) Test(ctx context.Context) (string, error) {
    return dag.Container().
        From("alpine:latest").
        WithNewFile("/entrypoint.sh", `#!/bin/sh
set -e
mount --bind /cache /dev/shm
exec "$@"`, dagger.ContainerWithNewFileOpts{Permissions: 0755}).
        WithEntrypoint([]string{"/entrypoint.sh"}).
        WithMountedTemp("/cache", dagger.ContainerWithMountedTempOpts{Size: 1 << 40 * 2}).
        WithExec([]string{"sh", "-c", "df", "-h", "/dev/shm"}, dagger.ContainerWithExecOpts{UseEntrypoint: true, InsecureRootCapabilities: true}).
        Stdout(ctx)
}

^ this has the caveat though that/dev/shm won't be persistent across WithExec executions. On each WithExec the /dev/shm tmpfs will be reset

#

in my example above, I've assigned 2GB to the tmpfs device so you'd need to adjust that to your needs

fresh wraith
#

That was my next thread to pull on. I did attempt that in an earlier iteration with a 5GB allocation of space, but it error'ed out due to Int32, I didn't go back to re-attempt with a 2 GB restraint

#

On the note about persistance across withExec,

If call a withExec command using useEntrypoint: true on the above snippet, won't it re-mount the cache via the entrypoing shell script, thus persisting the mount? Or is the concern about the data that is held in /dev/shm across execs? I don't necessarily need to persist that data across withExec, just want to ensure that the cache is large enough

proud osprey
#

Hopefully 2gb is enough for your test. We can fix that and make that argument a GrpahQL float so it allows for larger values at that layer

#

In the engine it is an int value which can definitely hold a larger value

proud osprey
# fresh wraith That was my next thread to pull on. I did attempt that in an earlier iteration w...

Oddly enough in Go I was able to set a number larger than 2GB

func (m *Shm) Test(ctx context.Context) (string, error) {
    return dag.Container().
        From("alpine:latest").
        WithNewFile("/entrypoint.sh", `#!/bin/sh
set -e
mount --bind /cache /dev/shm
exec "$@"`, dagger.ContainerWithNewFileOpts{Permissions: 0755}).
        WithEntrypoint([]string{"/entrypoint.sh"}).
        WithMountedTemp("/cache", dagger.ContainerWithMountedTempOpts{Size: 1 << 40 * 10}).
        WithExec([]string{"sh", "-c", "df", "-h", "/dev/shm"}, dagger.ContainerWithExecOpts{UseEntrypoint: true, InsecureRootCapabilities: true}).
        Stdout(ctx)
}

Was python complaining in your case that the type couldn't be larger than Int32?

fresh wraith
#

Correct, the error is something like this:

graphql.error.graphql_error.GraphQLError: Int cannot represent non 32-bit signed integer value: 5368709120
fresh wraith
fervent steeple
proud osprey
fresh wraith
#

Unfortunately the performance has not changed.

I did have to change the entrypoint to include sudo

`#!/bin/sh
set -e
sudo mount --bind /cache /dev/shm
exec "$@"`

Otherwise I see mount: /dev/shm: must be superuser to use mount.

I don't believe that would impact performance.

fresh wraith
#

This is the response from the df command

/dev/shm mount response: Filesystem     1K-blocks     Used Available Use% Mounted on
overlay        101430960 31125652  70288924  31% /
overlay        101430960 31125652  70288924  31% /.init
/dev/root      101430960 31125652  70288924  31% /etc/hosts
overlay        101430960 31125652  70288924  31% /data/ray
tmpfs              65536        0     65536   0% /dev
tmpfs            2097152        0   2097152   0% /cache

Not sure if this helps. /data/ray is the spill-over disk space (which I mounted). Data will be written to disk in the case that the data is too large. It will spill essentailly everything to disk in the case that /dev/shm is small.

proud osprey
#

since as you can see from the image, it's currently an overlay mount which will have a performance penalty as part of the overlay CoW writes

fresh wraith
#

I am calling it using the cache_volume method already

#

with_mounted_cache I mean

#
container = container.with_mounted_cache(
    "/data/ray",
    dagger.dag.cache_volume("ray_data"),
    owner="1000:100",
)

(The ray image uses the ray user that has user ID of 1000 and group id of 100)

proud osprey
fresh wraith
#

It doesn't show /dev/ray, but I assume you mean/data/ray, which is shown as an overlay. It is being mounted with that cache call above.

fresh wraith
#

After removing the ownership of the cache, it now shows up as /dev/root instead of overlay.

proud osprey
fresh wraith
#

Yes, performance is still a problem.

proud osprey
#

In practice performance should be the same as running via docker-compose. At the same time, it's hard for us to do any kind of deeper investigation without having a way we can repro this. I'm wondering if there's anything we could do @fresh wraith so we could get a hold of some sort of repro so we can help you by checking things from our side

#

not sure if the performance issues might be related to /dev/shm on itself or maybe some way in your you're running your stack dependent services

fresh wraith
#

Actually, I misspoke about the /dev/root. I had switched it to be with_mounted_temp instead of with_mounted_cache which is why it is appearing as /dev/root instead of overlay.

I still believe it is something with /dev/shm.

Let me get back to you on a repo, it might just be too much time lost producing a proof of concept repo with a comparable infrastructure & CI/CD.

proud osprey
fresh wraith
#

Is it possible to make a feature request to make the /dev/shm size configurable like what can be done in docker?

As well as another feature request to remove the int32 restriction when specifying the size of the temp mount?

I think for now I am attempting to see if I can reduce the durations without going through the process of producing a repo for you to test with, but it's possible I might give up an start making one.

proud osprey
#

having said that, the limitation of the int32 still persists. I'll check with @meager wedge and @astral tangle tomorrow and address it ASAP 🙏

fresh wraith
#

I thought the mount --bind was because you can't mount to /dev?

Do I need to change /cache to be /dev/shm by using the following?

WithMountedTemp("/dev/shm", dagger.ContainerWithMountedTempOpts{Size: 1 << 40 * 2}).

#

But correct, I don't need it to be shared across different dagger services.

proud osprey
#

now checking if I can somehow monkeypatch the python graphql library to allow you to pass a larger int value until we make the changes in the core SDK

proud osprey
#

@fresh wraith not sure how big your pipeline currently is but one quick stopgap you could try which unblocks this 2GB limit is migrating some part of the pipeline to Go instead of python. In go, this int32 check is not enforced and it allows you to set a higher value for /dev/shm

#

if you're using AI, it's actually quite good at these kind of tasks

#

I've spent some time trying to make this monkeypatch thing work but python is kind of giving me a hard time 😬

proud osprey
#

@fresh wraith ok, was able to vendor it using python. Here's how to do it.

  1. Get the graphql-core dependency: git clone --depth=1 git@github.com:graphql-python/graphql-core.git --branch v3.2.6 _vendor/graphql-core
  2. Modify the src/graphql/type/scalars.py of graphql-core and set the GRAPHQL_MAX_INT constant to your preferred value
  3. Modify your module's pyproject.toml as follows:
[project]
name = "test"
version = "0.1.0"
requires-python = ">=3.13"
dependencies = ["dagger-io", "graphql-core==3.2.6"] # >> add dep here

[build-system]
requires = ["uv_build>=0.8.4,<0.9.0"]
build-backend = "uv_build"

[tool.uv.sources]
graphql-core = { path = "_vendor/graphql-core" } # >>> add source folder here
dagger-io = { path = "sdk", editable = true }

^ this will allow dagger to use your vendored version of the graphql-core dependency instead of the one it installs 🙏

fresh wraith
#

Thanks @proud osprey trying that out now!

proud osprey
fresh wraith
#

I was able to vendor it (lots of diffs in my PR haha). Looking forward to removing the vendored source code if float is available as a data-type for the cache size.

I think one issue that was causing issues is that we use open telemetry in our repository (and it looks like dagger uses open telemetry as well). We don't actually stand-up an otel collector though. It looks like it's doing a lot of retrying to send events/logs, so I updated my services to stand-up a basic collector and things appear to be faster. I tried setting env vars to avoid emitting otel but I always ended up seeing otel failure to export logs.

I did also end up increasing the /dev/shm size to 10GB mixed with some other things (reserve more RAM for object memory to avoid spilling to disk) so it's hard to verify which one ended up reducing the test duration. The test duration was essentially cut in half though so it's managable now.

proud osprey
fresh wraith
#

Sounds good!

Is as_service essentially calling a docker run command under the hood? Could be useful if I could pass the optional --shm-size argument for those services to tune it that way.

wooden vapor
fresh wraith
#

Ahhh, thank you!