#Last ditch effort to get Matter/Thread working on Docker

1 messages · Page 1 of 1 (latest)

ivory light
#

Hi folks - I recently decided to try to move my HA environment over to Docker, under-estimating the challenges of getting the Matter server working. I'll post more details in the thread, but in short I'm running Docker on Ubuntu with host mode networking. I have 1 Matter over Thread device, 4 border routers (mostly my Eero APs), and I have the device working...I can communicate with it...but every so often (sometimes 5 minutes, sometimes 5 hours) I get time outs in my Matter logs and it has to re-subscribe.

I've tried everything I can think of, but nothing seems to help. I spun up a VM on the same host and it had 0 problems...so maybe it's "just Docker" but figured I'd ask here and see if anybody had any last ditch ideas.

#

The one device is an Aqara lock. I can ping it "ok", but the reponse times are pretty suboptimal (that said, this doesn't seem to impact HAOS, which had the same ping stats)

rtt min/avg/max/mdev = 834.004/1383.780/1941.100/523.315 ms,
#

The route to the device shows the 4 border routers:

fda0:939:eab7:1::/64 proto ra metric 100 expires 1597sec pref medium
    nexthop via fe80::a91:a3ff:fe4a:f773 dev enp2s0 weight 1 
    nexthop via fe80::22be:cdff:fe1d:8812 dev enp2s0 weight 1 
    nexthop via fe80::9e0b:5ff:fe85:bdf2 dev enp2s0 weight 1 
    nexthop via fe80::9e0b:5ff:fe84:bb92 dev enp2s0 weight 1 
#

On the python-matter-server, I have ipv6 connectivity (the snippets above are from that container) and can run tcpdump to see inbound neighbor discovery packets.

#

I wrote a small scapy Python script to capture the IPv6 NDP RIO packates in that container and I consistently see those packets as well.

#

On the host system (Ubuntu 24.04 kernel 6.11.0-18-generic), I've tried pretty much every network config I can think of:

  • Set all the relevant ipv6 sysctl settings for all, default, and my enp2s0 interfaces
  • Disabled ip6tables firewall rules and set them to allow all
  • Even tried switching to NetworkManager

But alas, none of that has mattered (hah). I'm not sure if it even would since I'm running Docker in host mode.

#

So maybe this is just "HAOS has been optimized for matter connectivity, so just use that you dummy"...but I figured I'd ask here before I go and reset everything. 🙂

ivory light
#

For reference, these are the error messages that show up in my matter server logs:

2025-03-05 17:10:29.716 (Dummy-2) CHIP_ERROR [chip.native.EM] <<5 [E:23048r with Node: <0000000000000001, 1> S:828 M:255440303] (S) Msg Retransmission to 1:0000000000000001 failure (max retries:4)
2025-03-05 17:15:48.624 (Dummy-2) CHIP_ERROR [chip.native.DMG] Subscription Liveness timeout with SubscriptionID = 0xf37ab654, Peer = 01:0000000000000001
2025-03-05 17:15:48.626 (MainThread) INFO [matter_server.server.device_controller] <Node:1> Subscription failed with CHIP Error 0x00000032: Timeout, resubscription attempt 0
...
2025-03-05 17:15:53.627 (MainThread) INFO [matter_server.server.device_controller] <Node:1> Re-Subscription succeeded
south steppe
#

Ping will be slow for battery powered devices fwiw

ivory light
#

Makes sense.

south steppe
#

The radio gets switched off to save power, and the nearest mains powered router buffers messages until it checks in.

#

You can even see messages get batched together, so responses arrive in batches too

ivory light
#

Ah, ok. I've definitely observed that.

south steppe
#

Do you have an mdns browser to hand? Can you see any trel records from your border routers?

#

You either need trel or all the APs need to be meshing over thread, if they aren’t then you might find turning off one of the APs (trial and error) may help.

south steppe
#

Your route table looks good. If your sysctl were wrong you wouldn’t have any nexthop. The fact that your device is ping able tells me that the basics are good.

#

The thing you might struggle with is stale routes. That’s why HAOS carries kernel patches.

#

When your host is configured for ip6 forwarding then it gets treated as a router under the kernels reading of the RFC and disables some of its dead neighbour detection

#

So routes might be used for up to half an hour after they become stale

#

If the kernel has chosen that stale route over the others packets just won’t get delivered until it chooses a different one, which without dead neighbour detection will take a while.

#

I mitigate that in my environment by just running the matter server with a macvlan interface rather than host mode networking. Forwarding is disabled inside the container. All the sysctls to make matter work are applied inside the container. Then I can still use forwarding on the hosts network.

#

There are other mitigations, iirc. But I can’t advise on that.

#

You would see extra routes for a 30 minute window around any outage if that was the cause, though.

#

(Stale ones and their replacements)

ivory light
#

an mdns browser
I've got Flame on my mac - I see trel records on 3 of my 4 border routers (3 eeros, 1 random Amazon device). I don't know much about thread so just ticked the "enable thread" box on my eero network.

south steppe
#

Network manager - not a bad idea but you need to be sure it has the HAOS patch otherwise that will also not work. It’s relatively recent that it was merged so I couldn’t be sure it was in 24.04. Without that patch multi border router setups aren’t really tenable.

#

I’d back it out at the moment, tbh

#

Is the non eero device anywhere near the eeros / aqara?

ivory light
#

unning the matter server with a macvlan interface rather than host mode networking
That was one route I started to go down as well - do you have details on how you set that up? I wasn't sure what the exact settings were I needed there - I wasn't sure if dhcp was working right when I first tried it as the device got an ip of .2 on my network, which seemed odd.

south steppe
#

(Assuming it’s the eeros that have thread and the rando device that doesn’t)

ivory light
#

The non eero device is, I think, right next to the lock. Actually bought that device specifically for it as the lock is in the garage/away from the house and wanted something close for it to communicate with.

south steppe
#

I’m on k8s, with multus to configure multihomed containers and custom things to bypass multus bugs, and also on my phone, so not right now and not in a fork that’d be immediately useful.

ivory light
#

Ah, ok.

#

I'll dig more into the macvlan stuff then, knowing that's at least a workable route.

south steppe
#

so the eeros are all “far” from the lock

ivory light
#

The closest eero is probably 15-20ft (5-6m).

south steppe
#

What has it got to go through

#

Wood? Plaster?

#

Thick walls, old walls etc

ivory light
#

House wall + wood, maybe some chicken wire (1940s house). 😢

south steppe
#

Any other 2.4ghz interference possible? Zigbee? Bluetooth?

ivory light
#

No zigbee devices. Def some bluetooth likely. Also have an ambient weather device on 2.4gz.

south steppe
#

The chicken wire is a concern for sure; we’ve got mesh on our shed (fox proofing) and it’s a pita for getting any signals through.

#

Ok what id be tempted to do is turn thread off on your eeros and see if it helps

#

Then you’ll force traffic to pop out the rando device near the aqara.

ivory light
#

Interesting - might need to figure out how/if to create a different thread network. I do have 2 thread networks showing up in HA.

#

@south steppe Thank you much - this at least gives me a couple new things to try. 😅

south steppe
#

Ping me if it helps or not; uk TZ

ivory light
#

Tried the macvlan approach for a bit, but I can't for the life of me get the ipv6 routing to work right (probably a "me" issue...just don't know what things to plug in).

Now I'm trying a quick k3s install...was able to get the lock added too much of an issue (I apparently hit the controller limit on the lock with all my attempts at adding/removing it to different HA installs)...will see how this goes.

ivory light
#

1 hour later...no connection issues in k3s. 😔

ivory light
#

So after k3s worked well, I uninstalled it and re-added the device in my old docker environment while I investigated spinning that up.

I don't know if it was removing an old controller from the lock, rebooting the lock, or the fact that the k3s install actually enabled ipv6 forwarding ... but now my docker setup is working better than it has in the past week. 😢 🤷‍♂️

south steppe
#

Weee

#

Pretty confident it’s not a forwarding problem. If forwarding was off, you would not have got as far as pinging the aqara.