#Really strange bugs started happening in CI, like `apt-get` not in PATH on a `ubuntu:22.04` image.

1 messages · Page 1 of 1 (latest)

jade cypress
#

Curious if anyone has seen anything like this before lol. Seemed to show up out of nowhere and haven't been able to reproduce locally, so my suspicions are around the docker daemon...

unique hatch
#

👋 does it happen consistently? Or it's just random?

unique hatch
#

by any change in the engine logs of this pipeline could you see the digest that got used for that image?

#

maybe ubuntu:22.04 pushed an incorrect image for a moment?

#

particularly since you're using a 22.04 tag which could be potentially incorrectly replaced

jade cypress
#

Thanks for responding, sorry I missed it.

It's been happening pretty consistently all day, yeah, but not locally.

#

by any change in the engine logs of this pipeline could you see the digest that got used for that image?

What would I look for?

#

I should clarify: It is happening all day but we have like 8 or so different dagger pipelines running in parallel in 1 dagger run, and it happens in some but not others. So it's.. persistent but not consistent.

unique hatch
# jade cypress > by any change in the engine logs of this pipeline could you see the digest tha...

So, the dagger logs while the pipeline is running should output something like:

11: resolve docker.io/library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48 
11: resolve docker.io/library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48 [0.01s]
11: pull docker.io/library/alpine:latest DONE

11: pull docker.io/library/alpine:latest
11: > in from > from alpine
11: > in sync
11: pull docker.io/library/alpine:latest DONE

11: pull docker.io/library/alpine:latest CACHED
11: > in from > from alpine
11: > in sync
11: pull docker.io/library/alpine:latest CACHED

11: pull docker.io/library/alpine:latest
11: > in from > from alpine
11: > in sync
11: sha256:661ff4d9561e3fd050929ee5097067c34bafc523ee60f5294a37fd08056a73ca 0B / 3.251MiB 
11: sha256:661ff4d9561e3fd050929ee5097067c34bafc523ee60f5294a37fd08056a73ca 1MiB / 3.251MiB 
11: sha256:661ff4d9561e3fd050929ee5097067c34bafc523ee60f5294a37fd08056a73ca 2MiB / 3.251MiB 
11: sha256:661ff4d9561e3fd050929ee5097067c34bafc523ee60f5294a37fd08056a73ca 3.251MiB / 3.251MiB [1.88s]
11: pull docker.io/library/alpine:latest DONE

As you can see there, tha sha256 is the image digest. It'd be great if we could get that so we can validate what's the digest of the image being pulled.

jade cypress
#

Ah I'll take a look real quick. We run it in our drone with --silent because it crashes if we don't 🥲. So hopefully those are in dagger.cloud

jade cypress
#

It doesn't look like there's a way for us to see that unfortunately.

paper arch
#

I've seen a number of times that CI will just about always pull a new image but your local environment might be retaining a cached image with the same tag. I might try force pulling a new image. I think you can just run docker pull ubuntu:22.04 and see if you can reproduce it locally. Just a thought

jade cypress
#

maybe. I"ll try that. The 22.04 image that's in the official repo is 6 days old; if there was an issue with it I think it would have been revealed by now.

There's been a couple other logs where it seems like a container / step that ran very late in the dag had some kind of filesystem issue, like a path suddenly not being writeable.

jade cypress
#

I'll try to gather more information and report back.

jade cypress
unique hatch
jade cypress
#

Yeah I think that's fair. This is also a really resource intensive part of the pipeline, too.

#

In that link there are other containers that are imported that are able to run "/run.sh", and they were all built from the same Dockerfile

#

If I went and downloaded those images they'd have it in there too. I'm out right now but I wonder if I can figure out a way to get this to fail consistently.

jade cypress
#

I'm really struggling to reproduce this but it is still happening in our CI. These runners are ephemeral.

I ran exactly what CI was running locally (same program, same arguments).

The differences that I can think of currently:

  • OS: these are running on AWS EC2 runners, probably on AWS linux, but the job itself is running in the docker container drone/drone-runner-docker:1.8.2.
  • Docker daemon or docker version?

I'll try disabling --silent and seeing if I can get any better information from CI but I appreciate any suggestions on anything i can try to reproduce this locally.

jade cypress
#

Here's the result of running that without --silent:


1926: resolve image config for docker.io/library/ubuntu:22.04 DONE
1926: > in from ubuntu:22.04
1926: resolve image config for docker.io/library/ubuntu:22.04 DONE
1934: pull docker.io/library/ubuntu:22.04
1934: > in from ubuntu:22.04
1934: resolve docker.io/library/ubuntu:22.04@sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74 [0.00s]
1934: pull docker.io/library/ubuntu:22.04 DONE
...
1933: exec apt-get update -yq
1933: [0.12s] panic: exec: "apt-get": executable file not found in $PATH
1933: [0.12s] 
1933: [0.12s] goroutine 1 [running, locked to thread]:
1933: [0.12s] main.shim()
1933: [0.12s]     /app/cmd/shim/main.go:267 +0xfd8
1933: [0.12s] main.main()
1933: [0.12s]     /app/cmd/shim/main.go:64 +0x5d
1933: exec apt-get update -yq ERROR: process "apt-get update -yq" did not complete successfully: exit code: 2

And locally:

[~/Work/Grafana/grafana-build] (main*) [i] » docker run --rm -it ubuntu:22.04@sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74 which apt-get
/usr/bin/apt-get
[~/Work/Grafana/grafana-build] (main*) [i] » docker run --rm -it ubuntu:22.04@sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74 apt-get update -yq
Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]

The common theme with the few errors that we're seeing is:

  1. They're pretty deep into the dag. We can get around them by reducing the size / complexity of dag. The current dag has about ~1000 steps.
  2. They're filesystem related; the container is missing a file or folder, or the folder / path isn't writeable.
#

regarding point 1, it makes sense why we would be seeing this now despite code changes, because we just enabled building Grafana for more architectures (150-200 extra steps each)

jade cypress
#

Looking more closely, our Docker container that has our dagger pipeline is usign the docker client version 23.0, while our server version in our CI runners is now on 25.0. I'll see if using this combination locally causes it to fail the same way.

jade cypress
#

That did it.

I updated my docker daemon to 25.0.1 and ran it and that caused the failure.

maybe. double checking.

jade cypress