#Connection via dlb to dagger engine failing

1 messages ยท Page 1 of 1 (latest)

midnight spire
#

Hello Team
I have recently upgraded dagger engine from 0.6.2 to 0.11.8.
I have 14 dagger fleet servers. My application can connect directly to these servers using tcp://server-dns:8080
but when I use dlb pool(all dagger fleet servers are in this list) and use tcp://dlb-dns-url:8080, it actually throws error in client

error comes from actual prouction box where i have dagger cli calling dlb-dagger-fleet-dns and things proceeds fine at some point but in the end it fails

container.from resolve: failed to resolve image example-image-registry: failed to resolve source metadata for example-image-registry: DeadlineExceeded: failed to get main client caller: no active session for i7u5i2x31lwni5thyi2iqo4ch: context deadline exceeded

For eg:
this works fine

{
  container {
    from(address:"alpine") {
      withExec(args:["echo","hello!! this is test"]) {
        stdout
      }
    }
  }
}
EOF```

but this does not
```_EXPERIMENTAL_DAGGER_RUNNER_HOST=tcp://dlb-dns:8080 /home/gitlab-runner/.local/bin/dagger -v query <<EOF
{
  container {
    from(address:"alpine") {
      withExec(args:["echo","hello!! this is test"]) {
        stdout
      }
    }
  }
}
EOF```

the same setup work with old engine 0.6.2 but not with 0.11.8

Could you please guide me on how to fix this?
lament silo
#

so the internal mechanism by how dagger is communicating between the client and the server has changed somewhat recently - it now uses http directly, instead of using grpc tunneling

#

could there potentially be something in the configuration there?

#

sorry, i'm not particularly familiar with dlb

midnight spire
#

We are already in production but having this problem othewrise we have to rollback , please if anyone really know what has changed and whats best thing i can do here ๐Ÿ™

#

this is engine.toml file

insecure-entitlements = ["security.insecure"]

[worker.containerd]
  enabled = false

[grpc]
  address = ["tcp://0.0.0.0:8080"]

[worker.oci]
  enabled = true
  gc = true
  snapshotter = "overlayfs"
  max-parallelism = 64
  cniPoolSize = 32

  [[worker.oci.gcpolicy]]
    keepBytes = "100000MB"
    keepDuration = "0"
    all = false
lament silo
#

do you have configuration for dlb?

midnight spire
#

i am not allowed to share exact information
but if you tell specifically what would you like to see
I can share few below
port - 8080
mode - TCP
SSL - NO
Healthcheck Timeout - 10

lament silo
#

do you have session affinity enabled?

midnight spire
dry yoke
#

@midnight spire I assume that by dlb you mean dynamic load balancer? Are you using any known load-balancer? or is it something internal to your org only.

midnight spire
#

we use a wrapper around haproxy

dry yoke
#

ok so I assume that if it works with regular haproxy it should work for your team, right?

#

I can try this tomorrow

midnight spire
#

@dry yoke let me know how this goes for you?

midnight spire
#

I have worked with our haproxy team, they have helped me to fix, we had to change setting : from balance: leastconn => to balancer: source

#

so that mean in theory dagger client expected to make more than one connection to same server, and in earlier setting this sticknes is not present, becuase when engine is trying to make second call(may be due its internal logic), it just don't find same server with same session

#

but it now created a problem of hotspot, basically with balancer: source, it picks up clientIP and hash it and assined always same server from pool, so the traffic split is not fair
and we don't have much clientIP as dagger cli is only runs from few servers

#

Need your thoughts on setting up load balancing parameter for new engine?

midnight spire
#

Now with this fix, builds are intermitent failing with 502, looks like timeout with dlb needs to be increased

dagger context get cancelled

[2024-06-28T13:19:14Z | DEBUG | debug] 86  : ! context canceled
returned error 502 Bad Gateway: http do: Post "http://dagger/query": unexpected EOF
midnight spire
#

another thing i observed is , dagger client is closing connection during the long process sometimes while going through haproxy, but i am not quite sure whats happening ๐Ÿ˜ฆ

dry yoke
# midnight spire Need your thoughts on setting up load balancing parameter for new engine?

we can't really advise a "recommended" way about doing this @midnight spire since it seems like this is quite a specific setup to what you're trying to achieve in your custom infra. What we generally recommend companies and teams that need to scale their engines is to use ephemeral runners and leverage on our distributed caching solution to optimize for efficiency

dry yoke
midnight spire
midnight spire
dry yoke
dry yoke
#

we have pipelines that run for hours without issues

#

seems to me this connection errors are quite likely to come from your dlb setup

midnight spire
dry yoke
#

@midnight spire I saw that you updated from v0.6.x to v0.11.x recently. A lot changed in how the client connects to the engine. @alpine reef shared a lot about these changes in a recent community call. You can check out the recording here: #1248166646321778738 message

midnight spire
#

Hey @dry yoke ๐Ÿ‘‹ reiterating on this again
even after uppgrading to 0.13.0, this issue still persits
if possible could you please try this on your side with haproxy connected to fleet of servers with(atleast two), also try long running pipeline, because simple use cases work with cli
may be it will get replicated on your side too