Dagger Engine health check | Dagger | Page 1

dire totem Jan 10, 2025, 12:24 PM

#

Hello!

Is there a health check or endpoint that I can configure using AWS ALB to a dagger engine hosted in AWS ECS? The health check is quite flexible except ALB does not support "allowing" 500 response codes as healthy.

Currently querying the URL responds with 500 and:

failed to get client metadata: failed to JSON-unmarshal X-Dagger-Client-Metadata: unexpected end of JSON input

The ALB response codes can be 200-499, however it does not allow adding extra body/json payloads to the check.

Thanks,

ionic nebula Jan 10, 2025, 1:06 PM

#

👋 this should work @dire totem

 curl -H "x-dagger-client-metadata: e30=" localhost:12345/

dire totem Jan 10, 2025, 1:24 PM

#

Hey! Thanks for the reply, is there a way to do it without setting a Header? As in is there an endpoint that doesn't expect this, or is dagger expecting that header for all requests?

dire totem Jan 10, 2025, 2:07 PM

#

Also I have tried using that Header just to test and get the following error now:

dagger-engine-1  | time="2025-01-10T14:05:59Z" level=debug msg="handling http request" client_hostname= client_id= contentType=application/grpc method=POST path=/moby.buildkit.v1.Control/Info session_id= span=bd98860c0ee915e4 spanID=bd98860c0ee915e4 trace=48069fa7577df0ff20f4afdfa319c67b traceID=48069fa7577df0ff20f4afdfa319c67b upgradeHeader=
dagger-engine-1  | time="2025-01-10T14:05:59Z" level=error msg="failed to serve request" error="get or init client: session ID is required" method=POST path=/moby.buildkit.v1.Control/Info

#

I am trying to setup an nginx to proxy to the dagger engine, which would allow me to setup a health check on the nginx side to always return 200 and then proxy the rest of the requests to the dagger engine

ionic nebula Jan 10, 2025, 2:12 PM

#

If you're using dagger behind an ALB why not just using a TCP probe?

#

as soon as the engine listens on the TCP port, it's ready to start accepting requests

#

cc @dire totem

dire totem Jan 10, 2025, 2:14 PM

#

Unfortunately ALB does not support simple TCP health checks 😅 , on the side I will check NLB configuration which does support just TCP checking

#

The engine itself works, it is just that AWS ECS keeps restarting the engine every few minutes because it thinks it is unhealthy because all endpoints return 500 if the header is not set (which ALB does not allow you to configure)

ionic nebula Jan 10, 2025, 2:16 PM

#

dire totem Unfortunately ALB does not support simple TCP health checks 😅 , on the side I w...

well.. yes, NLB might seem a better solution here since it doesn't make sense to loadbalanace across multiple engines per request level

#

you'll have to setup some sort of client-ip:port stickiness or something like that

dire totem Jan 10, 2025, 2:17 PM

#

I will keep that in mind, appreciate you taking a look - thanks!

rigid thorn Jan 10, 2025, 6:03 PM

#

cc @soft mantle

soft mantle Jan 10, 2025, 7:29 PM

#

Hi @dire totem, are you using AWS ECS with EC2 or Fargate? I tried to deploy Dagger Engine on Fargate but it is not possible due to the runtime (see this https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-security-considerations.html)

If you are using EC2, you certainly can run Dagger Engine on the host with privileged mode. However, ALB health checks are going to be a pain.

I ended up deploying the Dagger Engine to EC2 and auto-scaling groups with a Network LB to work around the issue. Granted you lose things like security groups when switching to NLBs but if the LB is internal it should not be too much of an issue.

Also, that 500 might be a sign of another error, do you have any logs from the engine output?

dire totem Jan 10, 2025, 10:16 PM

#

Hey @soft mantle I am deploying it to ECS on EC2 for the reasons you highlighted. The engine itself runs well in the container. I can connect to the container IP directly and everything is working. It’s only when the access is proxied via the ALB that I start running into the issues. I will explore NLB to see if that’s a viable option for us. Thanks for the input and your reference!

soft mantle Jan 10, 2025, 10:39 PM

#

dire totem Hey <@418233653592719364> I am deploying it to ECS on EC2 for the reasons you hi...

Actually, not sure if you can share your container definitions from your task definition, but if you are exposing the engine directly - it might be easy to expose another container in the definition file and have it act as a reverse proxy to the engine using its hostname (think like docker compose services - web can ping db when in the container).

If you are comfortable with Caddy, it would be easier to use their image and pass these args like this example https://caddyserver.com/docs/quick-starts/reverse-proxy#command-line. That would make it easier than trying to stuff a configuration file into a custom image just to configure nginx. Happy to give you a better example if you need it!

Reverse proxy quick-start - Caddy Documentation

Caddy is a powerful, enterprise-ready, open source web server with automatic HTTPS written in Go

dire totem Jan 10, 2025, 10:42 PM

#

Awesome! This is what I was thinking about doing with NGNIX. I will give it a try with Caddy. Essentially if I can get the LB health check response 200 (served by Caddy) and the rest of the requests reverse proxied to the engine in the same task, it should work. I will share my task def when I can. Currently on mobile

#

When I was playing around earlier, I had some issues with the Nginx reverse proxy due to some requests from the dagger cli being grpc, some http and (I think) requiring HTTP/2. the only way I was able to successfully proxy a session from the CLI was to do a stream.

Do you have any extra details on this? All good if not, I can try set this up with Caddy

dire totem Jan 13, 2025, 3:19 PM

#

@soft mantle just FYI - I was not unble to get the Caddy or NGINX configuration setup properly to route both the HTTP and gRPC calls preoperly. In the end I have just switched from using ALB to NLB so that we can just do a simple TCP health check instead

ionic nebula Jan 13, 2025, 3:36 PM

#

dire totem <@418233653592719364> just FYI - I was not unble to get the Caddy or NGINX confi...

@dire totem are you planning to put multiple engines behind the same LB?

#

How are you planning to load balance across them?

dire totem Jan 13, 2025, 4:24 PM

#

@ionic nebula Hey, we are not planning on load balancing between multiple dagger engines (until there is a way we can use shared caching). This is just so that we don't have to reference the IP address of the container directly and use the load balancer's DNS name

#

Is there a way to load balance across multiple engines or would we need to ensure that we have sticky sessions for clients?

ionic nebula Jan 13, 2025, 5:40 PM

#

dire totem Is there a way to load balance across multiple engines or would we need to ensur...

no, there's no way to load balance across engines so you'd need to find a way to have sticky sessions per clients

#

if you don't have the requirement to load-balance across engines, why the ALB / NLB in the first place?

dire totem Jan 13, 2025, 6:16 PM

#

Mainly it is for abstracting the container IP away in case of restarts and redeployments/upgrades of the engine

#

Now locally and on CI, I can configure the dagger engine env var and not need to worry about it changing

#

(This is all because we don’t use k8s within our org but rely on AWS ECS)

soft mantle Jan 13, 2025, 6:39 PM

#

dire totem <@418233653592719364> just FYI - I was not unble to get the Caddy or NGINX confi...

bummer, I know Matt Holt (Caddy maintainer) has a layer4 module for Caddy (https://github.com/mholt/caddy-l4) that he open sourced a while ago. If the NLB works for you then that should work.

I did a similar thing but ran Dagger Engine on EC2 with Fedora CoreOS (for auto updates and config) and then put the LB in front of those instances. Not sure that would make things easier, but might be worth documenting how to do something like that for future reference (e.g. using Dagger in limited environments)

GitHub

GitHub - mholt/caddy-l4: Layer 4 (TCP/UDP) app for Caddy

Layer 4 (TCP/UDP) app for Caddy. Contribute to mholt/caddy-l4 development by creating an account on GitHub.

ionic nebula Jan 13, 2025, 7:37 PM

#

Mainly it is for abstracting the container IP away in case of restarts and redeployments/upgrades of the engine

maybe it's easier to assign an ENI interface to the ECS task and assign an EIP there? I think that will also be cheaper than the NLB approach

ionic nebula Jan 13, 2025, 8:00 PM

#

in any case, NLB will just work as it's quite common to use it with ECS tasks as well

#Dagger Engine health check