#Dagger servers are not sshable

1 messages ยท Page 1 of 1 (latest)

brittle storm
#

Hello Everyone ๐Ÿ™‚
We are using dagger engine 0.13.0 in remote dedicated fleet of high end servers.
Some of the servers are sometimes losing connection, upon investigation we found that these servers are even not ssh-able(lost ssh capability).

We suspect that docker IP range exhausted or may be dagger is not releasing them cleanly

Any thoughts/ideas, we are in urgent need to fix this issue?

honest latch
#

do you have any server logs or metrics that could help debug the issue? eg. memory/disk space/resource usage trailing logs?

#

is there a pattern to the connection losses eg. is it always on the same dagger call command? also, which language sdk are you using?

brittle storm
#

we are using go sdk 0.13.0 and through sdk we call one time dagger.connect and later we just create multiple container(without modules). Here's some captured logs from dagger engine

level=warning msg="failed to release network namespace 
\"lajeaobvk5wqcpo68kg5a7gar\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/lajeaobvk5wqcpo68kg5a7gar\": ef53"
buildkitd: failed to create engine: failed to create network providers: CNI setup error: plugin type="bridge" failed (add): failed to list chains: running [/usr/sbin/iptables -t nat -S --wait]: exit status 3: iptables v1.8.10 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

level=warning msg="failed to release network namespace \"330nr9iuuac1uy61gr8mwot5v\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/330nr9iuuac1uy61gr8mwot5v\": ef53"
level=warning msg="failed to release network namespace \"4putf26xs0qfx52e8rgstwcbh\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/4putf26xs0qfx52e8rgstwcbh\": ef53"```
teal spear
#

@brittle storm can you try running sudo modprobe iptable_nat in your servers and check if that message goes away?

#

generally when you get the buildkitd: failed to create engine: failed to create error message, the engine doesn't start so I'm curious how it apparently seems to be working for you ๐Ÿค”

brittle storm
#

those logs are just some of thing that I am suspicious of, basically dagger is not able to connect when servers are not sshable

#

can you try running sudo modprobe iptable_nat in your servers and check if that message goes away?
we are already doing this on our servers

hardy yoke
#

@brittle storm could you look at disk usage for the server? It's possible that the disk has filled up.

brittle storm
#

Our storage usage is below 70% always due to GC fine tuned settings

teal spear
#

@brittle storm would it be possible to check some in some if your servers the number of connections and opened files?

#

If they're constantly growing that'd mean that there's probably something we're not releasing correctly

#

We've tried to get a quick repro with @hardy yoke but we weren't able to reproduce a similar behavior

brittle storm
#

would it be possible to check some in some if your servers the number of connections and opened files?
@teal spear you mean tcp connections on server and opened file descriptors?

brittle storm
#

from one of those dagger server

netstat -an | grep 'tcp' | wc -l
110

sudo lsof | wc -l
70061
teal spear
#

mind sharing what most of those lsofs are waiting on?

#

cc @feral jetty

#

looks a potential dagger/dagger issue

brittle storm
#

@teal spear @feral jetty you mean lsof whole output? or do you have specific option with cmd to check? let me kno

#

Our servers are having these problems almost every days, so we are in bit of urgency to fix this asap, please consider this on priority ๐Ÿ™

teal spear
teal spear
#

are you closing the dagger sessions after you use them? What SDK are you currently using?

brittle storm
teal spear
#

FWIW @brittle storm shared with me an lsof output and seems like dagger is not the process that's holding the file descriptors so there's nothing initially indicating that this might be the issue

#

@brittle storm mind running this command also in one of your engines to verify how many network namespaces have been created?
docker exec $engine_container ls -la /var/lib/dagger/net/cni | wc -l

#

this should give us a hint about if they're being released correctly

brittle storm
#

i ran same cmd into some of the servers(4 of them), it gave 35 as result

teal spear