#Thread devices lose comms every 6 hours on the hour

1 messages Β· Page 1 of 1 (latest)

brave trout
#

Not sure what the "right" way of raising this is, but have also logged a GitHub ticket for the same problem: https://github.com/home-assistant/addons/issues/4003. Please refer to this for more information.

TL;DR: at 5am, 11am, 5pm and 11pm, all my Thread devices will drop out and take 30-45 minutes to regain connection. The time intervals are like clockwork, and I cannot for the life of me work out what is causing the supposed 6 hour schedule.

Network: Ubiquiti
HAOS: RPI4B

Issue occurs on:

  • SkyConnect ZBT-1 on RCP only firmware
  • SMLIGHT SLZB-06MG24 on RCP only firmware
  • Google Streamer 4K (tested as of an hour ago, and all Thread devices drop as of 5pm).

Matter over WiFi devices are unaffected, it is ONLY Thread devices that fall over (which unfortunately make up 95% of my lights, sensors and power plugs).

Any help or support direction with this would be greatly appreciated.

GitHub

:heavy_plus_sign: Docker add-ons for Home Assistant - Issues Β· home-assistant/addons

glass bridge
brave trout
glass bridge
#

The thread where this is being discussed is titled "Adding second POE OTBR?" tagged with "matter" as well. It doesn't finally solve the issue but shows some nice graphs where the breakdown is well visible. Maybe you can find some similarities to your setup giving some hints.

EDIT: Maybe this link works:
https://discord.com/channels/330944238910963714/1355964645436817540

brave trout
#

Ah yes, had seen that thread (second link worked).

The intervals are not WiFi AP related as if I reboot my RPI at a different time, the 6 hour interval correlates to when the RPI was booted.

#

To me, this limits the behaviour to something within the RPI HAOS implementation or something Thread specific (again, this does not affect Matter over WiFi devices in any shape or form).

hard sentinel
#

@brave trout so there is a known issue with Google Streamer TV, where it doesn't respond properly to mDNS requests when there are too many devices on the Thread network. In my case, the devices didn't come back online until another BR got selected as SRP (handling mDNS on behalf of devices). Does the problem go away if you remove the Google TV Streamer?

brave trout
#

@hard sentinel I don't normally use the Thread component of the Streamer, but did for troubleshooting purposes (as I forgot they had a Thread capability). I primarily use the ZBT-1, though am wanting to move away from it and use the SLZB-06MG24.

#

that said, the issue has somehow degraded further. Thread is now dropping anywhere between every 90 minutes to 6 hours now

#

I only have the graph measuring my Matter lights (as Home Assistant seems to freak out when the device total breaches 50, but I use auto-entities to track the other MoT devices).

#

I've also migrated my HAOS instance away from the RPI4B I was using onto a VM (Unraid), and the behaviour persists.

hard sentinel
#

@brave trout so what is the Thread integration dialog showing for your Thread network, is the Google Streamer TV part of your Thread network or not? πŸ€”

brave trout
#

please don't get hung up on the Streamer. It is not in use for Thread purposes.

#

For the time it was in use, yes it was part of the home-assistant Thread network

#

as is/was the ZBT-1

#

and the SLZB

#

I have tried a combination of Thread devices for OTBR use, however several forum posts suggest that multiple OTBRs make things worse (even those supporting TREL). My preference is redundancy, but Thread clearly isn't a fan.

#

as it stands, it is only the SLZB in use.

hard sentinel
#

Yeah I mean the Streamer knowingly cause problem with devices, so that is why I want to just make sure it is no longer part of the network.

I personally don't have expierence with the SLZB. I guess that means you use the regular OTBR add-on and tunnel the RCP protocol over Ethernet, is that correct? Do you have any errors in the OTBR logs (specificially radio tx timeout πŸ€” )?

brave trout
#

the RPI4B has a bug (which has been reported but noone has looked into on Github) that the dummy TTY devices do not appear in order for the OTBR addon to start. For a time, I had to still have the ZBT-1 connected in order to have the SLZB do RCP (without a USB device selected, the OTBR addon does not start)

#

I don't know whether that was constitute two 15.4 radios running at the same time causing interference, however now that HAOS is on a VM, the dummy TTY devices are now available, so that limits that part (for what it's worth)

hard sentinel
#

Yeah the dummy tty shouldn't matter at all, its not used. It's just because we can't make it optional in the add-on config. Maybe it would be better to have a separate OTBR add-on for networked radios πŸ€”

brave trout
hard sentinel
#

But anyhow, can you check the logs? Anything which coincides with the timing of the Threead devices drop off you've posted above (like around ~14:00 yesterday?)

#

You can get logs with timing information in the console using -v, e.g. ha addons logs core_openthread_border_router -v.

brave trout
#

the OTBR logs at the moment are littered with 'failed to send IPv6 UDP messages', however most of the MoT devices are working (I say most as after a full drop, the network never fully recovers). This is the same across the ZBT-1, SLZB and Streamer). When the addon first starts, it shows send/receiving without issue.

Before you say it's probably network related, I won't rule it out but I would like some validation it is network as the devices work for hours at a time before it completely dies.

#

if I restart the OTBR addon, the "failed" messages disappear

#

example:

05:39:55.566 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:90, chksum:2393, ecn:no, to:0xd400, sec:yes, error:NoAck, prio:low, radio:15.4 05:39:55.566 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 05:39:55.566 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:cbb6:786a:c4c9:fb19]:5540 05:40:53.489 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:90, chksum:8185, ecn:no, to:0x2c00, sec:yes, error:NoAck, prio:low, radio:15.4 05:40:53.489 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 05:40:53.489 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d]:5540

hard sentinel
#

Is fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 an IPv6 of your HA host?

brave trout
#

nope. fdf0:d64f:d104:fdd1:255c:e1ab:e09a:ef38 is.

hard sentinel
#

That is an entire different network. So what is fd7c.. then? πŸ€”

brave trout
#

there are several ULA subnets coming from wpan0, which my understanding is the Thread networks doing

#

fd6a:9a1c:a68c:d6bb::/64 dev wpan0 metric 64 fd7c:4664:bcba:1::/64 dev wpan0 metric 64 fdf0:d64f:d104:fdd1::/64 dev enp1s0 metric 100

#

enp1s0 is the NIC

#

the rest of the v6 addresses are all fe80

#

expanded wpan0 routes (some are deprecated):

wpan0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 state UNKNOWN qlen 500 inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc38/64 scope global deprecated flags 02 valid_lft forever preferred_lft 0sec inet6 fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8/64 scope global flags 02 valid_lft forever preferred_lft forever inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc11/64 scope global deprecated flags 02 valid_lft forever preferred_lft 0sec inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc10/64 scope global deprecated flags 02 valid_lft forever preferred_lft 0sec inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:a800/64 scope global deprecated flags 02 valid_lft forever preferred_lft 0sec inet6 fd6a:9a1c:a68c:d6bb:89f7:8c6d:59d9:f172/64 scope global deprecated flags 02 valid_lft forever preferred_lft 0sec inet6 fe80::984b:ab29:b109:14e4/64 scope link flags 02 valid_lft forever preferred_lft forever

#

ip -6 n from the HAOS terminal:

fe80::c6e7:aeff:fe07:2727 dev enp1s0 lladdr <redacted MAC> used 0/0/0 probes 1 STALE fdf0:d64f:d104:fdd1:f4b4:c761:c6f0:cc76 dev enp1s0 lladdr <redacted MAC> used 0/0/0 probes 1 STALE fe80::c6e7:aeff:fe07:28d4 dev enp1s0 lladdr <redacted MAC> ref 1 used 0/0/0 probes 1 DELAY fe80::60f6:beff:fe9c:53b5 dev enp1s0 lladdr <redacted MAC> used 0/0/0 probes 1 STALE fdf0:d64f:d104:fdd1:9683:c4ff:fe3d:8012 dev enp1s0 lladdr <redacted MAC> router used 0/0/0 probes 1 STALE fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:2650 dev enp1s0 lladdr <redacted MAC> used 0/0/0 probes 1 STALE fe80::9683:c4ff:fe3d:8012 dev enp1s0 lladdr <redacted MAC> router used 0/0/0 probes 1 STALE fe80::c6e7:aeff:fe07:2650 dev enp1s0 lladdr <redacted MAC> ref 1 used 0/0/0 probes 1 REACHABLE fe80::c6e7:aeff:fe07:21a1 dev enp1s0 lladdr <redacted MAC> ref 1 used 0/0/0 probes 2 DELAY fe80::12e4:c2ff:fe62:e356 dev enp1s0 lladdr <redacted MAC> used 0/0/0 probes 0 STALE fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:28d4 dev enp1s0 lladdr <redacted MAC> ref 1 used 0/0/0 probes 1 REACHABLE fe80::d273:d5ff:fe89:aa90 dev enp1s0 lladdr <redacted MAC> ref 1 used 0/0/0 probes 1 REACHABLE

hard sentinel
#

Ok, so yeah fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 is an address of your HA host.

#

5540 is the defualt Matter port, so that is definitely the Matter controller attempting to talk to two independent devices (diffferent dst IP), where the Thread border router seems unable to send an IPv6 address to.

brave trout
#

but is that internal to the HAOS ecosystem? or is that stemming from the host network?

hard sentinel
#

I assume the Matter Server runs on the same installation as your OTBR add-on right?

brave trout
#

yes, it uses the Matter addon installed the same way as OTBR

#

(through the addon store)

hard sentinel
#

So yeah, the OTBR as well as the Matter controller run on "host" network (so share the same network namespace on your HAOS installation). The Matter server essentially can reach out to the Thread network directly.

brave trout
#

hmm; today I learned HAOS has a builtin zeroconf browser (was using a third party app to check whether matter._tcp protocols appeared). Guessing it's been there a while I just didn't notice.

point being, most of the fd7c entries correlate to Matter devices...except 9aa8. That doesn't seem to exist anywhere, so what would OTBR be trying to talk to?

#

00:25:25.988 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:90, chksum:d1f9, ecn:no, to:0x2c00, sec:yes, error:NoAck, prio:low, radio:15.4 00:25:25.988 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 00:25:25.988 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:63e5:6d92:8f4f:331c]:5540

{"name":"524D1A0448D07C81-127989949AF9BFF3._matter._tcp.local.","type":"_matter._tcp.local.","port":5540,"properties":{"SII":"800","SAI":"800","SAT":"4000","T":"0"},"ip_addresses":["fd7c:4664:bcba:1:63e5:6d92:8f4f:331c"]}

fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 is active and traceable, I just can't seem to determine what it is....

PING fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 (fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8): 56 data bytes 64 bytes from fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8: seq=0 ttl=64 time=0.069 ms 64 bytes from fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8: seq=1 ttl=64 time=0.095 ms 64 bytes from fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8: seq=2 ttl=64 time=0.126 ms 64 bytes from fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8: seq=3 ttl=64 time=0.092 ms

traceroute fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 traceroute to fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 (fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8), 30 hops max, 72 byte packets 1 fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8 (fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8) 0.006 ms 0.005 ms 0.004 ms

#

found it; answered my own question - it's the host network for wpan0:

inet6 fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8/64 scope global flags 02 valid_lft forever preferred_lft forever

to my limited knowledge, the wpan0 is the radio component for Thread on 802.15.4... so does that error suggest the radio can't talk with itself? πŸ˜•

hard sentinel
#

But yeah thsi discovery information is quite useful 🀩

#

Especially that you can search by IP, this is super useful.

#

There should be multiple entries for fd7c:4664:bcba:1:63e5:6d92:8f4f:331c actually

brave trout
hard sentinel
#

The one looks like a non Home Assitant fabric

brave trout
#

other than the name, they both contain the same information

#

{"name":"11434629F767C934-000000000000001C._matter._tcp.local.","type":"_matter._tcp.local.","port":5540,"properties":{"SII":"800","SAI":"800","SAT":"4000","T":"0"},"ip_addresses":["fd7c:4664:bcba:1:63e5:6d92:8f4f:331c"]}

{"name":"524D1A0448D07C81-127989949AF9BFF3._matter._tcp.local.","type":"_matter._tcp.local.","port":5540,"properties":{"SII":"800","SAI":"800","SAT":"4000","T":"0"},"ip_addresses":["fd7c:4664:bcba:1:63e5:6d92:8f4f:331c"]}

hard sentinel
#

So yeah that particular node is 1C, so 28.

#

Can you ping fd7c:4664:bcba:1:63e5:6d92:8f4f:331c?

brave trout
#

ooh, as in the Matter node?

hard sentinel
#

Yes

brave trout
brave trout
# hard sentinel Can you ping `fd7c:4664:bcba:1:63e5:6d92:8f4f:331c`?

yep, I can:

PING fd7c:4664:bcba:1:63e5:6d92:8f4f:331c (fd7c:4664:bcba:1:63e5:6d92:8f4f:331c): 56 data bytes 64 bytes from fd7c:4664:bcba:1:63e5:6d92:8f4f:331c: seq=0 ttl=64 time=69.928 ms 64 bytes from fd7c:4664:bcba:1:63e5:6d92:8f4f:331c: seq=1 ttl=64 time=51.663 ms 64 bytes from fd7c:4664:bcba:1:63e5:6d92:8f4f:331c: seq=2 ttl=64 time=73.305 ms 64 bytes from fd7c:4664:bcba:1:63e5:6d92:8f4f:331c: seq=3 ttl=64 time=72.269 ms

#

hmm, so the ULA allocations do not seem to be uniform; trying to find a device that is reporting as offline at the moment, which in this example is Node 26, or 1A. Can the last hextet just include 1a? or does it have to end in the node ID?

#

if it can just be included in the last hextet, then Node 26 in this case correlates to fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2 and I am unable to ping that

#

.. I stand corrected; I can ping it, but the latency is excessive

#

ping fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2 PING fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2 (fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2): 56 data bytes 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=0 ttl=63 time=1312.187 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=1 ttl=63 time=339.155 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=2 ttl=63 time=1808.867 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=3 ttl=63 time=922.612 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=4 ttl=63 time=2320.479 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=5 ttl=63 time=1346.268 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=6 ttl=63 time=356.849 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=7 ttl=63 time=1857.883 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=8 ttl=63 time=868.783 ms 64 bytes from fd7c:4664:bcba:1:ce34:82a2:9a2:c1a2: seq=9 ttl=63 time=2367.704 ms

hard sentinel
#

So Node 26 really should be 000000000000001A.

hard sentinel
brave trout
hard sentinel
#

Yeah then it is relatively high indeed.

brave trout
#

it's also reporting as offline

#

so cannot be turned on/off in any way through HA

hard sentinel
#

But it appears int he Zeroconf browser?

brave trout
#

yes

hard sentinel
#

What are the matter logs saying?

#

(related to that device?)

brave trout
brave trout
#

annoying it has come back online as of 12 minutes ago

#

the latency definitely has not improved though

#

but there's no packet loss

#

and while the device is "online", the Matter entry for this light has no IP address assigned to it

hard sentinel
#

So all devices online currently?

brave trout
#

at this point in time, yes

#

it has however taken almost 90 minutes

#

from 4 hours ago:

#

90 minutes ago:

#

I expect them to all fall over within the next couple of hours

#

and this cycle continues

#

I also have a bar graph tracking when the lights fall over (this is just today):

#

this is as of right now

#

for what it's worth, the OTBR addon is no longer reporting any failed entries (it stopped reporting entirely?)

#

I don't know the exact time it stopped, as the OTBR addon does not track local time but only time when the addon was started, which makes it difficult to correlate failures.

#

OTBR drop logs have restarted, but lights are still reachable for the moment:

01:30:44.997 [N] MeshForwarder-: Dropping (reassembly queue) IPv6 UDP msg, len:1236, chksum:b53a, ecn:no, sec:yes, error:ReassemblyTimeout, prio:normal, rss:-74.0, radio:15.4 01:30:44.997 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:c856:d188:bdb:578a]:5540 01:30:44.997 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 01:30:45.999 [N] MeshForwarder-: Dropping (reassembly queue) IPv6 UDP msg, len:1236, chksum:5e25, ecn:no, sec:yes, error:ReassemblyTimeout, prio:normal, rss:-75.0, radio:15.4 01:30:45.999 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:4208:f599:a239:3120]:5540 01:30:45.999 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 01:30:45.999 [N] MeshForwarder-: Dropping (reassembly queue) IPv6 UDP msg, len:1236, chksum:b53a, ecn:no, sec:yes, error:ReassemblyTimeout, prio:normal, rss:-73.5, radio:15.4 01:30:45.999 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:c856:d188:bdb:578a]:5540 01:30:45.999 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329 01:30:47.002 [N] MeshForwarder-: Dropping (reassembly queue) IPv6 UDP msg, len:1236, chksum:b53a, ecn:no, sec:yes, error:ReassemblyTimeout, prio:normal, rss:-75.0, radio:15.4 01:30:47.002 [N] MeshForwarder-: src:[fd7c:4664:bcba:1:c856:d188:bdb:578a]:5540 01:30:47.002 [N] MeshForwarder-: dst:[fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8]:52329

#

also starting to get these:

01:31:33.147 [N] MeshForwarder-: Dropping rx frag frame, error:Drop, len:88, src:0xd51b, dst:0xa800, sec:yes, tag:42219, offset:816, dglen:1244 01:31:33.153 [N] MeshForwarder-: Dropping rx frag frame, error:Drop, len:88, src:0xd51b, dst:0xa800, sec:yes, tag:42219, offset:904, dglen:1244 01:31:33.164 [N] MeshForwarder-: Dropping rx frag frame, error:Drop, len:88, src:0xd51b, dst:0xa800, sec:yes, tag:42219, offset:992, dglen:1244 01:31:33.171 [N] MeshForwarder-: Dropping rx frag frame, error:Drop, len:88, src:0xd51b, dst:0xa800, sec:yes, tag:42219, offset:1080, dglen:1244 01:31:33.197 [N] MeshForwarder-: Dropping rx frag frame, error:Drop, len:76, src:0xd51b, dst:0xa800, sec:yes, tag:42219, offset:1168, dglen:1244

hard sentinel
#

Some messages are expected, e.g. in this case it can be that a half transmitted packet just gets lost. The Matter protocol will retry up to 4 times on a higher level if such drops happen.

brave trout
hard sentinel
#

Hm, yeah probably all UTC 😒

brave trout
#

yup

#

I can at least add my TZ to that, so thank you for that.

hard sentinel
#

Not sure yet how helpful this really becomes πŸ˜…

#

But the ot-ctl command is useful if you want to dive into the mesh

brave trout
#

for the moment, everything is quiet and working, so I'm not touching a damn thing.

#

I will keep an eye on it and see if anything changes over the next day or so

#

thanks @hard sentinel !

glass bridge
#

Put differently: I have my OTBR running outside of HA, would still be interested in using the script. Would this be possible?

brave trout
hard sentinel
hard sentinel
brave trout
#

it's interesting; I identified two of the lights that weren't working from the 4 listed here from that yellow line, and it seems as soon as I started to ping them, they came back online.

brave trout
# hard sentinel Are the devices till listed in the Zeroconf browser? How about ping?

It's that time again πŸŽ‰

All Matter over Thread devices are offline, but showing in the Zeroconf browser.

Spot checking a couple of devices though; the unavailable ones are NOT PINGABLE:

➜ ~ ping fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d PING fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d (fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d): 56 data bytes ^C --- fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d ping statistics --- 23 packets transmitted, 0 packets received, 100% packet loss

#

{"name":"11434629F767C934-0000000000000021._matter._tcp.local.","type":"_matter._tcp.local.","port":5540,"properties":{"SII":"800","SAI":"800","SAT":"4000","T":"0"},"ip_addresses":["fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d"]}

{"name":"524D1A0448D07C81-FBAB01C0D608328D._matter._tcp.local.","type":"_matter._tcp.local.","port":5540,"properties":{"SII":"800","SAI":"800","SAT":"4000","T":"0"},"ip_addresses":["fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d"]}

brave trout
#

20 minutes later the same light above is now pingable,. but has ~40% packet loss (still offline in the Matter integration):

PING fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d (fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d): 56 data bytes 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=2 ttl=64 time=140.426 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=3 ttl=64 time=307.690 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=4 ttl=64 time=104.249 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=5 ttl=64 time=282.879 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=6 ttl=64 time=138.539 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=7 ttl=64 time=72.458 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=10 ttl=64 time=340.585 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=11 ttl=64 time=482.847 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=16 ttl=64 time=196.769 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=17 ttl=64 time=303.410 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=18 ttl=64 time=136.298 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=19 ttl=64 time=51.183 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=20 ttl=64 time=104.964 ms 64 bytes from fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d: seq=21 ttl=64 time=51.841 ms ^C --- fd7c:4664:bcba:1:4ee3:c6d3:f1b7:bd7d ping statistics --- 23 packets transmitted, 14 packets received, 39% packet loss round-trip min/avg/max = 51.183/193.867/482.847 ms

Not sure if that's just because there's a bunch more Thread traffic because devices are trying to resubscribe

brave trout
#

@hard sentinel new update; OTBR logs show a 'radio tx timeout' around the time Thread fails. However, this log is present in other times where the MoT devices are working as expected (sometimes the log appears every 15 minutes, sometimes every few hours). However, this time around was a rapid succession RCP failure/radio tx timeout within 15 seconds of each other. The log doesn't look any different, but as soon as that second failure occurred, everything Thread went haywire:

05:53:00.492 [W] P-RadioSpinel-: radio tx timeout 05:53:00.492 [W] P-RadioSpinel-: RCP failure detected 05:53:00.492 [W] P-RadioSpinel-: Trying to recover (1/2) 05:53:00.612 [C] P-RadioSpinel-: RCP => [C] Platform------: Reset info: 0x3 (EXT) 05:53:00.612 [C] P-RadioSpinel-: RCP => [C] Platform------: Extended Reset info: 0x301 (PIN) 05:53:00.612 [N] P-RadioSpinel-: RCP recovery is done 05:53:16.919 [W] P-RadioSpinel-: radio tx timeout 05:53:16.919 [W] P-RadioSpinel-: RCP failure detected 05:53:16.919 [W] P-RadioSpinel-: Trying to recover (1/2) 05:53:17.039 [C] P-RadioSpinel-: RCP => [C] Platform------: Reset info: 0x3 (EXT) 05:53:17.039 [C] P-RadioSpinel-: RCP => [C] Platform------: Extended Reset info: 0x301 (PIN) 05:53:17.039 [N] P-RadioSpinel-: RCP recovery is done

This is using the SLZB. The ZBT-1 does not appear to have this entry, however at the time where Thread falls over on the ZBT-1, I get 'CCA_FAILURE' instead, and after this the Thread network again goes haywire:

05:27:48.674 [D] P-SpinelDrive-: Received spinel frame, flg:0x2, iid:0, tid:11, cmd:PROP_VALUE_IS, key:LAST_STATUS, status:CCA_FAILURE

Same outcome, different causes? Not sure what to make of it.

glass bridge
#

@brave trout the question here is indeed whether or not the tx errors or CCA_FAILURES are a consequence of the thread devices trying to reconnect, hence consuming too much of the bandwidth of your thread channel OR if they are the cause why the devices start to drop.

Generally, if that is available (as in "compiled in") in your OTBR, you can

ot-ctl channel monitor
enabled: 1
interval: 41000
threshold: -75
window: 960
count: 2569
occupancies:
ch 11 (0x04a6)   1.81% busy
ch 12 (0x0230)   0.85% busy
ch 13 (0x0776)   2.91% busy
ch 14 (0x0767)   2.89% busy
ch 15 (0x1e78)  11.90% busy
ch 16 (0x3019)  18.78% busy
ch 17 (0x2fae)  18.62% busy
ch 18 (0x2d3c)  17.66% busy
ch 19 (0x2934)  16.09% busy
ch 20 (0x133d)   7.51% busy
ch 21 (0x1c65)  11.09% busy
ch 22 (0x2ab9)  16.68% busy
ch 23 (0x26ca)  15.15% busy
ch 24 (0x1d85)  11.53% busy
ch 25 (0x0d40)   5.17% busy
ch 26 (0x1c74)  11.11% busy

Done

See https://github.com/openthread/openthread/blob/main/src/cli/README.md
for what you can do with the cli (you probably have to switch to the branch/commit id of your OTBR build to get the right docs for your specific OTBR build).

#

So, I have set my TBRs to use channel 13, which had the lowest usage at the time chosen. Also it is understood that my WiFi on the 2.4G spectrum is set to use a constant channel (11)
(see here for a reference: https://www.metageek.com/training/resources/zigbee-wifi-coexistence/ )

#

So, it might be worthwhile monitoring the business of your channel just before your devices start to drop.
Should the channel become busy before your devices start to drop, there might be something in your environment that is causing congestion/interference on your thread channel. maybe you even have zigbee devices running using a channel nearby.

#

Should the channel become busy only right after the devices drop, there might be some other issue causing it:
What about your WiFi (I assume your RPI uses WiFi). Do you have band steering enabled in your AP or something else that causes your RPI to change APs, channels or even switch from 2.4G to 5G?
Would it be possible (just to exclude this as a cause) to connect the RPI via ethernet?

#

Also, when you use the SLZB-6, do you plug it into your RPI directly via USB or do you "tunnel" the RCP's serial protocol via WiFi/ethernet into your OTBR?
I would strongly recommend not to do the latter because of the sensitive timing between the border agent and the RCP. (Although that might be the reason why you chose the SLZB-6 over an SLZB-7 in the first place).

#

Also, with devices that have ~40% packet loss when being pinged after a reconnect, does that get back to normal after a while?

#

Also, you could use mtr on a) your local machine, b) on the HA to see if the packet loss happens on the WiFi or thread part of the network.

brave trout
# glass bridge <@237029634330460161> the question here is indeed whether or not the tx errors o...

I'm using the native OTBR addon from the HAOS add-on store. by default, ot-ctl is not available in HAOS however with @hard sentinel 's github install script for thread debug, it becomes available. Unfortunately the ot-ctl channel monitor command returns nothing enabled:

Updating existing installation...
HEAD is now at 5bb2719 Adjust width
Adding /root/.local/share/thread-debug/bin to PATH in /root/.bashrc
Please restart your terminal or run: source /root/.bashrc
βœ… Installation complete. You can now use the scripts from thread-debug!
➜  ~ source /root/.bashrc
➜  ~ ot-ctl channel monitor
enabled: 0
Done```
#

I have also seen the reference to co-existence and have tried moving the Thread channel around (by default it was created on 15. I moved it to 25 while shifting my WiFi channel to 1 (for testing) and it seemed to make it worse. It's also possible that it's because I have Zigbee on 25 as well, but changing Zigbee's channel will cause a complete reset of the devices which I'm not doing. The silver lining is at least with Thread, when I changed channels all 63 MoT devices I have all came across without an issue, so that's a positive.

#

It has been set back to 15

brave trout
# glass bridge Also, when you use the SLZB-6, do you plug it into your RPI directly via USB or ...

I tunnel the RCP via the SLZBs Zigbee-over-Ethernet option. I had tried via USB (setting the SLZB back to Zigbee-over-USB (yes it references Zigbee, however the FW flashed is MoT but the stick can do both)). I had tried USB initially (pulling the ZBT-1 out and replacing it with the SLZB and it made zero difference to the intervals of dropouts, so I've gone back to RCP tunneling as I can then have the SLZB anywhere in the house next to a PoE switch, which is much preferred in my use case.

brave trout
brave trout
#

I also have other OTBRs I have tested in the past (all Ethernet as well), as I've seen on several occasions from forum posts that WiFi can have a significant impact. I would like to think Ethernet does not have the same issues. The devices I reference are two GL.iNet S20s and one GL.iNet S200, both flashed with firmware that supports OTBR use with Home Assistant. They have however made no difference to this problem, so they remain disconnected for the time being (that, and as mentioned previously I've found running multiple OTBRs only seems to make things worse).

#

Even the Streamer 4k (when it was being tested) was running via Ethernet, just to rule out WiFi interference (or attempt to).

#

I've also started to notice in the OTBR logs that it seems to be trying to talk to an fe80 address on port 19788. A brief Google of that port suggests it's used for Mesh Link Establishment protocol for IEEE 802.15.4 radio mesh networks. the fe80 source address is on the same wpan0 interface as the ULA subnet is on, but the address is not resolvable by traceroute and also cannot be pinged, so I'm not sure what destination device it is looking for:

09:58:17.645 [N] MeshForwarder-:     src:[fe80:0:0:0:984b:ab29:b109:14e4]:19788
09:58:17.645 [N] MeshForwarder-:     dst:[fe80:0:0:0:a069:6788:a970:ba51]:19788
09:58:18.715 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:138, chksum:da9c, ecn:no, to:a2696788a970ba51, sec:no, error:NoAck, prio:net, radio:15.4
09:58:18.716 [N] MeshForwarder-:     src:[fe80:0:0:0:984b:ab29:b109:14e4]:19788
09:58:18.716 [N] MeshForwarder-:     dst:[fe80:0:0:0:a069:6788:a970:ba51]:19788
09:59:45.855 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:86ef, ecn:no, to:a2696788a970ba51, sec:no, error:NoAck, prio:net, radio:15.4
09:59:45.855 [N] MeshForwarder-:     src:[fe80:0:0:0:984b:ab29:b109:14e4]:19788
09:59:45.855 [N] MeshForwarder-:     dst:[fe80:0:0:0:a069:6788:a970:ba51]:19788```

```28: wpan0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 state UNKNOWN qlen 500
    inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc11/64 scope global deprecated flags 02 
       valid_lft forever preferred_lft 0sec
    inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc10/64 scope global deprecated flags 02 
       valid_lft forever preferred_lft 0sec
    inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc38/64 scope global deprecated flags 02 
       valid_lft forever preferred_lft 0sec
    inet6 fd6a:9a1c:a68c:d6bb:0:ff:fe00:2800/64 scope global deprecated flags 02 
       valid_lft forever preferred_lft 0sec
    inet6 fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8/64 scope global flags 02 
       valid_lft forever preferred_lft forever
    inet6 fd6a:9a1c:a68c:d6bb:89f7:8c6d:59d9:f172/64 scope global deprecated flags 02 
       valid_lft forever preferred_lft 0sec
    >>>>> inet6 fe80::984b:ab29:b109:14e4/64 scope link flags 02 <<<<<
       valid_lft forever preferred_lft forever```

`traceroute fe80:0:0:0:a069:6788:a970:ba51                                    
traceroute: can't connect to remote host: Invalid argument`

```➜  ~ ping fe80:0:0:0:a069:6788:a970:ba51
PING fe80:0:0:0:a069:6788:a970:ba51 (fe80::a069:6788:a970:ba51): 56 data bytes
--- fe80:0:0:0:a069:6788:a970:ba51 ping statistics ---
11 packets transmitted, 0 packets received, 100% packet loss```
brave trout
#

I have now swapped the SLZB back to USB mode and reconfigured the OTBR addon to accept it.

OTBR Addon settings:

brave trout
brave trout
#

Changing to USB mode on the SLZB did not fix the Thread network dropout; dropped at exactly a 6 hour interval like before.

brave trout
#

@hard sentinel @glass bridge So I've had some Aqara T2 Bulbs delivered today (come with both a Zigbee and Thread radio and can be swapped as required via BLE). Onboarded two (of 8) devices, both worked successfully. Had the usual 6 hour blip, and the newly added Thread Aqara bulb disappeared, like everything else MoT, but the Zigbee Aqara bulb did NOT disappear and is working exactly as expected. To me that validates something within the Thread ecosystem is completely falling on its face, as it is the only protocol affected. Zigbee and Thread share the same frequencies also to my knowledge, so I would have expected that if if it was an external influence it would have killed BOTH protocols at the same time, but it has not.

#

^ logs for both addons, both in debug mode.

#

The Thread setup is also using the SLZB-06MG24 in USB mode.

hard sentinel
#

This all smells more like a temporary routing failure or something along those lines.

hard sentinel
#

Can you have a look at your routing table using ip -6 route when things are ok, and then when they go south? I wonder if routes are missing/or changed during that time. Specifically the routes to the Thread network, so the ones starting with fd7c:4664:bcba:1.

If you use the advanced Terminal, install iproutes2 first (using apk add iproute2)

#

Also I wonder if your Thread network maybe goes through an address change for some reason maybe? πŸ€”

brave trout
# hard sentinel We've disabled the channel monitor by defualt since it causes the radio to frequ...

ok I've enabled it and am starting to get some data from the channel monitor. At first ~5 channels went to 100% busy but have since settled:

enabled: 1
interval: 41000
threshold: -75
window: 960
count: 4
occupancies:
ch 11 (0x3332)  19.99% busy
ch 12 (0x0000)   0.00% busy
ch 13 (0x3fff)  24.99% busy
ch 14 (0x3fff)  24.99% busy
ch 15 (0x3332)  19.99% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x3fff)  24.99% busy
ch 18 (0x0000)   0.00% busy
ch 19 (0x0000)   0.00% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x0000)   0.00% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x3fff)  24.99% busy
ch 25 (0x3fff)  24.99% busy
ch 26 (0x0000)   0.00% busy```
brave trout
# hard sentinel Can you have a look at your routing table using `ip -6 route` when things are o...

have installed iproute2, though the output before I installed it and after from advanced terminal looks the same:

fd6a:9a1c:a68c:d6bb::/64 dev wpan0 proto kernel metric 64 pref medium
fd7a:115c:a1e0::9c01:4820 dev tailscale0 proto kernel metric 256 pref medium
fd7c:4664:bcba:1::/64 dev wpan0 proto kernel metric 64 pref medium
fdf0:d64f:d104:fdd1::/64 dev enp1s0 proto ra metric 100 pref medium
fe80::/64 dev veth23cbcc9 proto kernel metric 256 pref medium
fe80::/64 dev hassio proto kernel metric 256 pref medium
fe80::/64 dev docker0 proto kernel metric 256 pref medium
fe80::/64 dev veth2e6d269 proto kernel metric 256 pref medium
fe80::/64 dev veth0af4f7d proto kernel metric 256 pref medium
fe80::/64 dev vethe945e89 proto kernel metric 256 pref medium
fe80::/64 dev veth97fdbb5 proto kernel metric 256 pref medium
fe80::/64 dev veth870e0b7 proto kernel metric 256 pref medium
fe80::/64 dev veth536cb30 proto kernel metric 256 pref medium
fe80::/64 dev tailscale0 proto kernel metric 256 pref medium
fe80::/64 dev vethd3765ee proto kernel metric 256 pref medium
fe80::/64 dev veth3f3633b proto kernel metric 256 pref medium
fe80::/64 dev veth7836812 proto kernel metric 256 pref medium
fe80::/64 dev vethad019cc proto kernel metric 256 pref medium
fe80::/64 dev veth3e927be proto kernel metric 256 pref medium
fe80::/64 dev veth502f10a proto kernel metric 256 pref medium
fe80::/64 dev wpan0 proto kernel metric 256 pref medium
fe80::/64 dev enp1s0.1 proto kernel metric 1024 pref medium
fe80::/64 dev enp1s0 proto kernel metric 1024 pref medium```
#

for what it's worth, my networking gear does not have IPv6 enabled globally on this portion of the network, so any v6 activities should all be contained within Home Assistant.

#

following the current pattern, the lights will fail again in ~3.5 hours. I'll update the thread then.

brave trout
#

@hard sentinel Thread network has fallen apart just now.

Before the drops:

enabled: 1
interval: 41000
threshold: -75
window: 960
count: 324
occupancies:
ch 11 (0x10c8)   6.55% busy
ch 12 (0x2bb5)  17.07% busy
ch 13 (0x3d5d)  23.97% busy
ch 14 (0x2fb6)  18.63% busy
ch 15 (0x1424)   7.86% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0a29)   3.96% busy
ch 18 (0x035a)   1.30% busy
ch 19 (0x0352)   1.29% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x035c)   1.31% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0358)   1.30% busy
ch 25 (0x0d96)   5.30% busy
ch 26 (0x0a2f)   3.97% busy```

```➜  ~ ip -6 route
fd6a:9a1c:a68c:d6bb::/64 dev wpan0 proto kernel metric 64 pref medium
fd7a:115c:a1e0::9c01:4820 dev tailscale0 proto kernel metric 256 pref medium
fd7c:4664:bcba:1::/64 dev wpan0 proto kernel metric 64 pref medium
fdf0:d64f:d104:fdd1::/64 dev enp1s0 proto ra metric 100 pref medium
fe80::/64 dev veth23cbcc9 proto kernel metric 256 pref medium
fe80::/64 dev hassio proto kernel metric 256 pref medium
fe80::/64 dev docker0 proto kernel metric 256 pref medium
fe80::/64 dev veth2e6d269 proto kernel metric 256 pref medium
fe80::/64 dev veth0af4f7d proto kernel metric 256 pref medium
fe80::/64 dev vethe945e89 proto kernel metric 256 pref medium
fe80::/64 dev veth97fdbb5 proto kernel metric 256 pref medium
fe80::/64 dev veth870e0b7 proto kernel metric 256 pref medium
fe80::/64 dev veth536cb30 proto kernel metric 256 pref medium
fe80::/64 dev tailscale0 proto kernel metric 256 pref medium
fe80::/64 dev vethd3765ee proto kernel metric 256 pref medium
fe80::/64 dev veth3f3633b proto kernel metric 256 pref medium
fe80::/64 dev veth7836812 proto kernel metric 256 pref medium
fe80::/64 dev vethad019cc proto kernel metric 256 pref medium
fe80::/64 dev wpan0 proto kernel metric 256 pref medium
fe80::/64 dev veth280fe00 proto kernel metric 256 pref medium
fe80::/64 dev veth36ebdd4 proto kernel metric 256 pref medium
fe80::/64 dev enp1s0.1 proto kernel metric 1024 pref medium
fe80::/64 dev enp1s0 proto kernel metric 1024 pref medium
#

during the dropouts (right now):

enabled: 1
interval: 41000
threshold: -75
window: 960
count: 337
occupancies:
ch 11 (0x10c8)   6.55% busy
ch 12 (0x2bb5)  17.07% busy
ch 13 (0x3d5d)  23.97% busy
ch 14 (0x2fb6)  18.63% busy
ch 15 (0x1424)   7.86% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0a29)   3.96% busy
ch 18 (0x035a)   1.30% busy
ch 19 (0x0352)   1.29% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x035c)   1.31% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0358)   1.30% busy
ch 25 (0x0d96)   5.30% busy
ch 26 (0x0a2f)   3.97% busy
fd6a:9a1c:a68c:d6bb::/64 dev wpan0 proto kernel metric 64 pref medium
fd7a:115c:a1e0::9c01:4820 dev tailscale0 proto kernel metric 256 pref medium
fd7c:4664:bcba:1::/64 dev wpan0 proto kernel metric 64 pref medium
fdf0:d64f:d104:fdd1::/64 dev enp1s0 proto ra metric 100 pref medium
fe80::/64 dev veth23cbcc9 proto kernel metric 256 pref medium
fe80::/64 dev hassio proto kernel metric 256 pref medium
fe80::/64 dev docker0 proto kernel metric 256 pref medium
fe80::/64 dev veth2e6d269 proto kernel metric 256 pref medium
fe80::/64 dev veth0af4f7d proto kernel metric 256 pref medium
fe80::/64 dev vethe945e89 proto kernel metric 256 pref medium
fe80::/64 dev veth97fdbb5 proto kernel metric 256 pref medium
fe80::/64 dev veth870e0b7 proto kernel metric 256 pref medium
fe80::/64 dev veth536cb30 proto kernel metric 256 pref medium
fe80::/64 dev tailscale0 proto kernel metric 256 pref medium
fe80::/64 dev vethd3765ee proto kernel metric 256 pref medium
fe80::/64 dev veth3f3633b proto kernel metric 256 pref medium
fe80::/64 dev veth7836812 proto kernel metric 256 pref medium
fe80::/64 dev vethad019cc proto kernel metric 256 pref medium
fe80::/64 dev wpan0 proto kernel metric 256 pref medium
fe80::/64 dev veth280fe00 proto kernel metric 256 pref medium
fe80::/64 dev veth36ebdd4 proto kernel metric 256 pref medium
fe80::/64 dev enp1s0.1 proto kernel metric 1024 pref medium
fe80::/64 dev enp1s0 proto kernel metric 1024 pref medium```
#

routes don't look any different to me

#

curiously some -6 routes are showing failed (ip -6 n - I forgot to do this one before everything went haywire):

fe80::d273:d5ff:fe89:aa90 dev enp1s0 lladdr d0:73:d5:89:aa:90 REACHABLE 
fe80::c6e7:aeff:fe07:21a1 dev veth23cbcc9 FAILED 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:28d4 dev enp1s0 lladdr c4:e7:ae:07:28:d4 REACHABLE 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:2727 dev enp1s0 FAILED 
fe80::9ff4:5413:bae3:909e dev enp1s0 FAILED 
fdf0:d64f:d104:fdd1:9683:c4ff:fe3d:8012 dev enp1s0 lladdr 94:83:c4:3d:80:12 router STALE 
fe80::c6e7:aeff:fe07:28d4 dev enp1s0 lladdr c4:e7:ae:07:28:d4 REACHABLE 
fe80::c6e7:aeff:fe07:2727 dev enp1s0 lladdr c4:e7:ae:07:27:27 REACHABLE 
fe80::9683:c4ff:fe3d:8012 dev enp1s0 lladdr 94:83:c4:3d:80:12 router STALE 
fe80::c6e7:aeff:fe07:28d4 dev veth23cbcc9 FAILED 
fe80::c6e7:aeff:fe07:2650 dev enp1s0 lladdr c4:e7:ae:07:26:50 STALE 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:21a1 dev enp1s0 lladdr c4:e7:ae:07:21:a1 STALE 
fe80::decc:e6ff:fe9d:4235 dev enp1s0 lladdr dc:cc:e6:9d:42:35 STALE 
fe80::c6e7:aeff:fe07:21a1 dev enp1s0 lladdr c4:e7:ae:07:21:a1 REACHABLE ```
#

huh.... the fd7c: network isn't listed

#

I'll redo the same command again once everything stabilises (~half hour)

brave trout
#

@hard sentinel got somewhat distracted but it seems the fd7c routes do not appear in ip -6 n even when Thread is working as expected:

fe80::d273:d5ff:fe89:aa90 dev enp1s0 lladdr d0:73:d5:89:aa:90 REACHABLE 
fe80::c6e7:aeff:fe07:21a1 dev veth23cbcc9 FAILED 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:28d4 dev enp1s0 lladdr c4:e7:ae:07:28:d4 STALE 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:2727 dev enp1s0 FAILED 
fe80::9ff4:5413:bae3:909e dev enp1s0 FAILED 
fe80::e56:5cff:fe9f:8b90 dev enp1s0 lladdr 0c:56:5c:9f:8b:90 STALE 
fdf0:d64f:d104:fdd1:9683:c4ff:fe3d:8012 dev enp1s0 lladdr 94:83:c4:3d:80:12 router STALE 
fe80::c6e7:aeff:fe07:28d4 dev enp1s0 lladdr c4:e7:ae:07:28:d4 STALE 
fe80::c6e7:aeff:fe07:2727 dev enp1s0 lladdr c4:e7:ae:07:27:27 STALE 
fe80::9683:c4ff:fe3d:8012 dev enp1s0 lladdr 94:83:c4:3d:80:12 router STALE 
fe80::c6e7:aeff:fe07:28d4 dev veth23cbcc9 FAILED 
fe80::c6e7:aeff:fe07:2650 dev enp1s0 lladdr c4:e7:ae:07:26:50 REACHABLE 
fdf0:d64f:d104:fdd1:c6e7:aeff:fe07:21a1 dev enp1s0 lladdr c4:e7:ae:07:21:a1 STALE 
fe80::decc:e6ff:fe9d:4235 dev enp1s0 lladdr dc:cc:e6:9d:42:35 STALE 
fe80::c6e7:aeff:fe07:21a1 dev enp1s0 lladdr c4:e7:ae:07:21:a1 REACHABLE ```

```➜  ~ ip -6 route           
fd6a:9a1c:a68c:d6bb::/64 dev wpan0 proto kernel metric 64 pref medium
fd7a:115c:a1e0::9c01:4820 dev tailscale0 proto kernel metric 256 pref medium
fd7c:4664:bcba:1::/64 dev wpan0 proto kernel metric 64 pref medium
fdf0:d64f:d104:fdd1::/64 dev enp1s0 proto ra metric 100 pref medium
fe80::/64 dev veth23cbcc9 proto kernel metric 256 pref medium
fe80::/64 dev hassio proto kernel metric 256 pref medium
fe80::/64 dev docker0 proto kernel metric 256 pref medium
fe80::/64 dev veth2e6d269 proto kernel metric 256 pref medium
fe80::/64 dev veth0af4f7d proto kernel metric 256 pref medium
fe80::/64 dev vethe945e89 proto kernel metric 256 pref medium
fe80::/64 dev veth97fdbb5 proto kernel metric 256 pref medium
fe80::/64 dev veth870e0b7 proto kernel metric 256 pref medium
fe80::/64 dev veth536cb30 proto kernel metric 256 pref medium
fe80::/64 dev tailscale0 proto kernel metric 256 pref medium
fe80::/64 dev vethd3765ee proto kernel metric 256 pref medium
fe80::/64 dev veth3f3633b proto kernel metric 256 pref medium
fe80::/64 dev veth7836812 proto kernel metric 256 pref medium
fe80::/64 dev vethad019cc proto kernel metric 256 pref medium
fe80::/64 dev wpan0 proto kernel metric 256 pref medium
fe80::/64 dev veth280fe00 proto kernel metric 256 pref medium
fe80::/64 dev veth36ebdd4 proto kernel metric 256 pref medium
fe80::/64 dev enp1s0.1 proto kernel metric 1024 pref medium
fe80::/64 dev enp1s0 proto kernel metric 1024 pref medium```
#

the ot-ctl channel monitor output also looks very little different to ~4 hours ago:

enabled: 1
interval: 41000
threshold: -75
window: 960
count: 641
occupancies:
ch 11 (0x10c8)   6.55% busy
ch 12 (0x2bb5)  17.07% busy
ch 13 (0x3d5d)  23.97% busy
ch 14 (0x2fb6)  18.63% busy
ch 15 (0x1424)   7.86% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0a29)   3.96% busy
ch 18 (0x035a)   1.30% busy
ch 19 (0x0352)   1.29% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x035c)   1.31% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0358)   1.30% busy
ch 25 (0x0d96)   5.30% busy
ch 26 (0x0a2f)   3.97% busy```
hard sentinel
#

Yeah channel monitor only samples every so often, it just gives you an overview of channel utilization over time. probably not all that helpful. What channel is your Thread network on?

#

@brave trout I guess the device aren't reachable again?

You can get insight into the topology of the Thread network using

ot-ctl meshdiag topology
# or
ot-ctl meshdiag topology children ip6-addrs

The latter gives a longer output with IP addresses etc.

When the Thread devices fail absolutely none of them are reachable (ping)? What would be interesting if that topology view changes when they become unreachable.

I wonder if just one Thread border router capable device ends up "eating" all traffic for some reasons. Did you try to make your Thread network smaller, e.g. only a couple o Thread devices from the same brand?

#

Also ot-ctl netdata show would be interesting if its different between working/failed state.

brave trout
#

devices at this particular point in time are working as expected

#

the Thread fallover event isn't for another couple of hours

#

yes when the network falls on its face, nothing is pingable. The Eve Motion sensors recover the quickest, followed by the Eve Power plugs then the Aqara door sensors. The Nanoleaf lights always take the longest, but I'm in the midst of replacing all E27/GU10 lights with Aqara T2s.

#
Prefixes:
fd7c:4664:bcba:1::/64 paos low 1000
Routes:
fc00::/7 sa med 1000
Services:
44970 01 5c000500000e10 s 1000 0
44970 5d fd6a9a1ca68cd6bb89f78c6d59d9f172d127 s 1000 1
Contexts:
fd7c:4664:bcba:1::/64 1 c
Commissioning:
32000 - - -
Done```
#

^ is as of right now

#
id:04 rloc16:0x1000 ext-addr:9a4bab29b10914e4 ver:4 - me - br
    3-links:{ 03 05 10 14 18 22 38 46 52 53 55 62 }
    2-links:{ 07 09 17 19 }
    1-links:{ 21 40 }
id:03 rloc16:0x0c00 ext-addr:face50964fbadb24 ver:4
    3-links:{ 04 05 07 09 10 14 17 18 19 22 38 46 52 53 55 62 }
    2-links:{ 21 40 }
id:05 rloc16:0x1400 ext-addr:d215af600ba4bd32 ver:4
    3-links:{ 03 04 07 10 14 18 22 38 53 55 62 }
    2-links:{ 09 17 46 52 }
id:07 rloc16:0x1c00 ext-addr:c28810c2eba9c6b8 ver:4
    3-links:{ 05 14 18 22 38 53 62 }
    2-links:{ 03 04 }
    1-links:{ 46 55 }
id:09 rloc16:0x2400 ext-addr:6ed7b79c47d28469 ver:4
    3-links:{ 03 17 38 46 53 55 }
    2-links:{ 04 05 22 40 52 }
    1-links:{ 10 14 18 21 62 }
id:10 rloc16:0x2800 ext-addr:2e78893990c331c4 ver:4
    3-links:{ 03 04 05 22 38 46 53 55 }
    2-links:{ 14 18 }
    1-links:{ 09 17 52 62 }
id:14 rloc16:0x3800 ext-addr:62d6dda3cac1b584 ver:4
    3-links:{ 03 04 05 07 18 38 53 55 62 }
    2-links:{ 10 22 }
    1-links:{ 09 17 46 52 }
id:18 rloc16:0x4800 ext-addr:a2696788a970ba51 ver:4
    3-links:{ 03 04 05 07 14 22 38 53 55 62 }
    2-links:{ 10 46 52 }
    1-links:{ 09 17 19 40 }
id:17 rloc16:0x4400 ext-addr:c2bfed0aa7b0dcab ver:4
    3-links:{ 03 09 19 21 22 46 52 }
    2-links:{ 04 05 10 14 40 55 62 }
    1-links:{ 18 38 53 }
id:19 rloc16:0x4c00 ext-addr:d2ef365e665867ce ver:4
    3-links:{ 03 17 21 46 52 }
    2-links:{ 04 22 38 40 55 }
    1-links:{ 18 }
id:21 rloc16:0x5400 ext-addr:eafebab1130e434f ver:4
    3-links:{ 17 19 40 46 52 }
    2-links:{ 03 22 38 }
    1-links:{ 04 09 53 55 }
id:22 rloc16:0x5800 ext-addr:fe17eca420eead73 ver:4 - leader
    3-links:{ 03 04 05 07 09 10 14 17 18 21 38 40 46 52 53 55 62 }
    2-links:{ 19 }
id:38 rloc16:0x9800 ext-addr:ae2e7ae04b5cd545 ver:4
    3-links:{ 03 04 05 07 09 10 14 18 21 22 46 52 53 55 62 }
    2-links:{ 17 19 40 }
id:40 rloc16:0xa000 ext-addr:3eee988b72ce267e ver:4
    3-links:{ 21 46 52 }
    2-links:{ 09 19 22 }
    1-links:{ 03 17 38 53 55 }
id:46 rloc16:0xb800 ext-addr:faef426c3ccf40e4 ver:4
    3-links:{ 03 04 09 10 17 19 21 22 38 40 52 55 62 }
    2-links:{ 05 14 18 53 }
    1-links:{ 07 }
id:52 rloc16:0xd000 ext-addr:72041b4338864840 ver:4
    3-links:{ 03 04 09 17 19 21 22 38 40 46 55 }
    2-links:{ 05 10 14 18 62 }
    1-links:{ 53 }
id:53 rloc16:0xd400 ext-addr:1a95794d818423da ver:4
    3-links:{ 03 04 05 07 09 10 14 18 22 38 55 62 }
    2-links:{ 46 52 }
    1-links:{ 17 21 40 }
id:55 rloc16:0xdc00 ext-addr:42e4df307f040591 ver:4
    3-links:{ 03 04 05 09 10 14 18 22 38 46 52 53 62 }
    2-links:{ 07 17 19 40 }
    1-links:{ 21 }
id:62 rloc16:0xf800 ext-addr:864b45d648c3a4fb ver:4
    3-links:{ 03 04 05 07 14 18 22 38 53 55 }
    2-links:{ 17 46 52 }
    1-links:{ 09 10 }
Done```
#

you'll have to tell me what's good or bad from that output as I'm not seeing any red flags per se

#

also making my network smaller isn't really a viable option at the moment.

#

I am technically doing that by replacing the Nanoleaf E27/GU10s with Aqara T2s, but that's due to commissioning via Zigbee

#

the plan eventually is everything back on Thread once this fiasco is resolved.

brave trout
#

@hard sentinel so tried the ot-ctl meshdiag topology during the current crash and now get this:

id:04 rloc16:0x1000 ext-addr:9a4bab29b10914e4 ver:4 - me - br
    3-links:{ 03 05 09 10 14 18 22 38 46 52 53 55 62 }
    2-links:{ 17 }
    1-links:{ 21 }
Error 28: ResponseTimeout```
#

expanded version:

id:04 rloc16:0x1000 ext-addr:9a4bab29b10914e4 ver:4 - me - br
    3-links:{ 03 05 09 10 14 18 38 46 52 53 55 62 }
    2-links:{ 17 }
    1-links:{ 21 }
    ip6-addrs:
        fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc11
        fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc38
        fd7c:4664:bcba:1:5e79:a5f1:6f1d:9aa8
        fd6a:9a1c:a68c:d6bb:0:ff:fe00:fc10
        fd6a:9a1c:a68c:d6bb:0:ff:fe00:1000
        fd6a:9a1c:a68c:d6bb:89f7:8c6d:59d9:f172
        fe80:0:0:0:984b:ab29:b109:14e4
    children: none
Error 28: ResponseTimeout```
#

netdata and channel monitor look the same:

Prefixes:
fd7c:4664:bcba:1::/64 paos low 1000
Routes:
fc00::/7 sa med 1000
Services:
44970 01 5c000500000e10 s 1000 0
44970 5d fd6a9a1ca68cd6bb89f78c6d59d9f172d127 s 1000 1
Contexts:
fd7c:4664:bcba:1::/64 1 c
Commissioning:
32000 - - -
Done```

```➜  ~ ot-ctl channel monitor                     
enabled: 1
interval: 41000
threshold: -75
window: 960
count: 860
occupancies:
ch 11 (0x10c8)   6.55% busy
ch 12 (0x2bb5)  17.07% busy
ch 13 (0x3d5d)  23.97% busy
ch 14 (0x2fb6)  18.63% busy
ch 15 (0x1424)   7.86% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0a29)   3.96% busy
ch 18 (0x035a)   1.30% busy
ch 19 (0x0352)   1.29% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x035c)   1.31% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0358)   1.30% busy
ch 25 (0x0d96)   5.30% busy
ch 26 (0x0a2f)   3.97% busy

Done```
brave trout
#

weird; I've moved the Thread channel from 15 (HAOS default) to 22 (as the scans kept coming up as 0% utilised for testing purposes, and now the channel monitor reports no activity on any channel:

enabled: 1
interval: 41000
threshold: -75
window: 960
count: 11
occupancies:
ch 11 (0x0000)   0.00% busy
ch 12 (0x0000)   0.00% busy
ch 13 (0x0000)   0.00% busy
ch 14 (0x0000)   0.00% busy
ch 15 (0x0000)   0.00% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0000)   0.00% busy
ch 18 (0x0000)   0.00% busy
ch 19 (0x0000)   0.00% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x0000)   0.00% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0000)   0.00% busy
ch 25 (0x0000)   0.00% busy
ch 26 (0x0000)   0.00% busy```
#

OTBR addon logs do validate the channel change was successful though:

00:32:27.981 [D] P-SpinelDrive-: Sent spinel frame, flg:0x2, iid:0, tid:9, cmd:PROP_VALUE_SET, key:STREAM_RAW, len:21, channel:22, maxbackoffs:4, maxretries:0 ...

And all Thread devices are currently working as expected.

#

Is the OTBR scanning mechanism only designed to work on certain channels?

hard sentinel
#

I don't think so. I mean it has only 11 samples, could be that things were just not busy πŸ€” Or maybe its just some firmware issue. Can you try disable and reenabling the monitor?

hard sentinel
#

I wonder if your radio expierences complete outages of some kind at times? πŸ€” But then, you did change the radio and it had the same outages with the same timing no?

brave trout
brave trout
hard sentinel
#

At least for ZBT-1, I can guarantee that there is no firmware issue which makes the radio inoperable every 6h. I run ZBT-1 here, and it runs fine for days/weeks.. without a single device dropping off.

So it must be either your link between the radio and your HA host (USB hanging or something?) or something in the air which kills RF completely, either sending or receiving side. There are options to sniff and Wireshark 802.15.4 packets, but this would need a separate system setup for this. It's a while since I have done that the last time πŸ™ˆ

brave trout
#

So we're really no closer to determining the cause. I've swapped just about everything I can and nothing has really made a difference. Hardware, software, infrastructure, the works. Nothing explains the intervals of drop-outs and I've already validated that wifi and zigbee devices work during the thread outages.

glass bridge
#

Hmm, maybe the border-agent just dies and then is brought up by the watchdog again. But such a thing should be visible in the logs as well.

I would also rule out USB, because it behaves the same when you tunnel the rpc over eth.

hard sentinel
#

Just reached out to some Google Thread devs about this. They look into the logs a bit.

#

But from my perspective this seems a radio issue. Probably Thread is a bit more suseptable to interference, and makes the mesh temporarily fail. It would be interesting if at the very moment the Thread devices fail, if you also see Zigbee beeing slow to react, or retransmits in the logs (I think at least ZHA logs these things these days).

delicate geyser
# brave trout So we're really no closer to determining the cause. I've swapped just about ever...

It's been several months now. Did you ever figure this out?

I have a completely Apple thread network with 100 Matter over Thread devices. Only two devices are battery. My every-six-hour issue started sometime after I tried installing a ZBT-2 (but uninstalled it a couple hours later after it wreaked havoc on my network) or 2) after I installed Inovelli White beta firmware for my 72 switches (but then deleted the JSON file from my matter_core_server updates folder and restarted the Matter Server).

I have a feeling that the failure is part of the "looking for updates" mechanism and I may have corrupted something in that process. But I've put everything back to how it was and I'm still seeing a big MoT failure every 6 hours.

I'm wondering if updating to Matter 8.2.0 might help, or even the matter.js beta. Hate to install beta on a household though.

Any ideas? I can give more details if necessary.

brave trout
# delicate geyser It's been several months now. Did you ever figure this out? I have a completel...

Not specifically, no, as noone could pinpoint the problem.

I ended up falling back to half Zigbee/half Thread (50 of each) and it's been stable for the last 6 months. FWIW, I ended up changing my ZBT-1's Thread radio channel to 22 (defaults to 25 in HA) and that may have helped, but it was quite a while ago that I don't recall if it made much of a difference. I do know that the default channel I had a myriad of problems with onboarding, so I'm assuming that was due to radio interference from Zigbee and/or WiFi (given they all share the same frequencies, so co-existence is a thing).

#

I may at some point in the future bring the Zigbee devices I have back into Thread (as they have dual radios), but while it's working (and my household aren't yelling at me for basic stuff not working), I'm leaving it alone.

glass bridge
#

@delicate geyser 8.2.0 is essentially 8.1.2 plus the ability to enable matter.js when you toggle the "beta" switch in the configuration.
If it doesn't work for you, just turn the beta off, and you are back to 8.1.2. It does import all the settings from 8.1.2 to matter.js. It doesn't sync changes back to python though, I guess (like new devices added). For my 160 devices on HA Green, the import took about 6-8 minutes. Initialization another 20. I have restarted the server a couple of times since. Each time the server gets started with beta switch set to "on", it will download the latest beta (again). That alone takes a few minutes. Once done, it takes <4 minutes for all devices to come back online. That alone is a huge improvement. During the two days I have been testing, it lost two devices, and I had to re-pair them. Other than that, it is working solid for me.

brave trout
delicate geyser
glass bridge
glass bridge
#

It actually persists the IP addresses from the previous session and reuses them while doing updates in the background or when communication with the old known IP fails (don't know the exact heuristics). That makes it way faster for devices to come online again.

brave trout
# glass bridge It actually persists the IP addresses from the previous session and reuses them ...

I see. What would the general definition of "way faster" be in this sense though? For the longest time I've been trying to work out what the expected average for a device subscription for Home Assistant to be able to use a Matter device would be, but there's no clear information on this. Is it seconds? minutes? an hour?

Because of the 6 hour "everything is cooked" issues I had, it took an hour every time it broke to resub all devices back to being manageable. That didn't fly very well which is why I went back to Zigbee. On a restart of the ZBT-1, all Zigbee devices are available within 5 seconds. Every single one of them. I was under the impression Matter would behave in a similar manner as to my knowledge Matter shares some of its libraries with Matter, so there must be some implementation components that are the same or closely similar.

I've just migrated to the beta JS for Matter and not really noticing any increase in speed with the resubscriptions.

iron tartan
#

The faster is mostly anecdotal and being reported back in real time by beta testers now. For me and a few others with large setups, after the initial slower first Matter-js.server startup, reconnect for some 90 plus Matter devices (vast majority thread) went 30-60 minutes with the python server to 5 minutes with the Matter-js.server beta.

brave trout
#

first answer I've got that's given me some kind of a ball park figure. Thank you.

brave trout
# iron tartan The faster is mostly anecdotal and being reported back in real time by beta test...

I stand corrected on not noticing the speed difference in the beta; it is very much visible and wow. All 57 of my thread devices (across 7 vendors) came up from a cold start of the Matter Server addon in less than 2 minutes, vs the 15-30 minutes in the Python version. It now makes me want to convert the 55 Zigbee devices with Thread Radios in them back to Thread because the time delay on Matter was just too much at the time....

tiny osprey
#

Hi, I just found this post and now see the pattern. For my network my devices also loose connectivity every 6 hours.
I have tried so many things over the last months to no resolution of the issue.
This is my running topic:
https://discord.com/channels/330944238910963714/1458531230910775550

Quick noob question, how did you make the graph of the dropping devices?

tiny osprey
# brave trout weird; I've moved the Thread channel from 15 (HAOS default) to 22 (as the scans ...

I can confirm, switching channels at some point also broke my channel monitor functionality.
I cannot get it to produce the stats like it did before.
No HA restart or hardware reboot will make a change, numbers still all stay 0

➜  ~ docker exec addon_core_openthread_border_router ot-ctl channel monitor
enabled: 1
interval: 41000
threshold: -75
window: 960
count: 129
occupancies:
ch 11 (0x0000)   0.00% busy
ch 12 (0x0000)   0.00% busy
ch 13 (0x0000)   0.00% busy
ch 14 (0x0000)   0.00% busy
ch 15 (0x0000)   0.00% busy
ch 16 (0x0000)   0.00% busy
ch 17 (0x0000)   0.00% busy
ch 18 (0x0000)   0.00% busy
ch 19 (0x0000)   0.00% busy
ch 20 (0x0000)   0.00% busy
ch 21 (0x0000)   0.00% busy
ch 22 (0x0000)   0.00% busy
ch 23 (0x0000)   0.00% busy
ch 24 (0x0000)   0.00% busy
ch 25 (0x0000)   0.00% busy
ch 26 (0x0000)   0.00% busy

Done```
brave trout
#

@tiny osprey I did mine with a template sensor that tracks Matter devices and when they turn unavailable:

      {{ states.light 
        | selectattr('entity_id', 'in', matter_entities)
        | selectattr('state', 'eq', 'unavailable')
        | list | count }}```

Then used Apexcharts card to map them in the graph above:

```type: custom:apexcharts-card
graph_span: 24h
update_interval: 1m
span:
  start: day
series:
  - entity: sensor.unavailable_matter_devices_count
    name: Unavailable Matter over Thread Lights
    extend_to: now
    stroke_width: 2
    type: line```
tiny osprey
#

Thanks, I will try to make something like that for my devices.
I am still searching for a solution to all these dropping devices.

brave trout
#

I initially thought Nanoleaf was the cause of my problems (the lights are terrible), but I no longer have any of their lights and it was solid for months... now this issue has reared its ugly head again and I'm essentially back to square one to find a fix.

tiny osprey
#

I do or never did own any Nanoleaf.
I think it is a config change I once made. Added a device or had an error in adding a device or maybe something Apple HomeKit related.
Maybe since reversed the change but the software still is not completely clear of the reversed change or something

brave trout
#

Any chance your networking gear is Ubiquiti?

tiny osprey
#

It is just now 11:00 o clock, all devices dropped off again

#

Yes Ubquiti

#

UniFi OS Version: 4.4.6
U6+ AP Device version: 6.7.31
2x AC Lite AP Device version: 6.7.35
USW Lite 16 POE Device version: 7.2.123
2x USW Lite 8 POE Device version: 7.2.123
USW Flex Mini Device version: 2.1.6
Site Manager: 5.0.1

brave trout
#

hmm, mine is also.

tiny osprey
#

Did you migrate to the new Beta Matter JS server?
Also to the Beta Thread 1.4?
I am not so keen to move to Beta releases not to stack issues on issues and end up with even more problems

brave trout
#

Yeah, I did both. However I had a power blackout overnight and my SLZB is no longer doing thread duties. Home Assistant can't see the OTBR radio.

#

Not sure if it's related.

#

Rolling back the OTBR version seems to have fixed my reported issue for now, but not sure if it was the beta toggle that bricked everything or not.

violet meteor
#

Just adding a comment that I am now also experiencing "storms" every 6 hours. I run a very vanilla install: GA Green, ZBT-2 TBR, about 70 devices (mostly inovelli dimmers). I am using the matter.js beta

#

I believe fairly strongly that whatever the problem is, it is internal to HA. As whenever I restart matter the 6 hour clock starts from that moment.

#

Anyone ever figure this out ? It seems like not ?

lone roost
#

Is the ZBT-2 OTBR your only TBR?

My solution is to disable OTBR, to exclusively use my 7 Apple TBRs. I also have 75 Matter over Thread devices.

Did you try the OTBR beta? I never tried it.

violet meteor
#

Ya, my ZBT-2 is my only OTBR

#

I tried the OTBR beta once and something wasn't working correctly (not sure what) so I went back to the release version. Figured I'd wait for things to get a bit more stable and then try again at some point.

delicate geyser
#

For me the storms got smaller and smaller til they disappeared on their own. That happened one time to me before too. Although I'm about to update everything after being out of town a few weeks and I wonder if those storms will come back.

lone roost
violet meteor
#

I think I tested it around 3ish weeks ago.

#

For some reason when I tried it, my thread network didn't come up on boot. Possibly I need some beta ZBT-2 firmware too? I'm not sure? I just saw it broke and reverted back.

delicate geyser
#

So, I just updated my Matter Server today without doing anything else and 6 hours later on the dot I got a matter storm. So I think the update started these every 6-hour storms again.