#Thead suddenly down completely

1 messages · Page 1 of 1 (latest)

burnt light
#

I've been using Matter-over-Thread with Home Assistant for over a year now with only minor issues with individual devices from time to time. I run HAOS in a VM, and use a SkyConnect with OTBR.
Since a (clean) restart of the system yesterday, suddenly all my thread devices are completely down (unavailable in HA and offline in the Matter add-on), restarting the addons, the HAOS VM and even the host machine didn't help. I have no Matter-over-Wifi devices, so can't tell if those would still work or not.
I see no particularly suspicious logs (but I'm not an expert to judge the matter/otbr logs), except maybe this one from otbr, which I'm not sure if it was there before or not:
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::5086:6aff:fe89:85ea/vethf1d52f6/14
As far as I can tell, nothing changed with my network, hardware or anything else when it went down... I simply restarted the machine cleanly, like I have done many times before without any issues.
Could it be that the SkyConnect died? It still shows in lsusb on both the host machine and HAOS VM fine. Is there any easy thing I could try to test that? Do you have any other ideas what I could try to fix the issue? Thanks!

river belfry
#

sometimes removing usb from vm and re-adding AFTER reboot (not restart) helps. Also, is your ipv6 / dns up&running?

burnt light
#

I have to admit I'm not very experienced with networking and IPv6 in particular. I have enabled 6-to-4 (or what that is called) on my router, and my machines do have IPv6 addresses. I can ping one of them with ping6 from one of the other machines in my network, so in general it seems to be working. More importantly (maybe), nothing should have changed from when it worked.
I checked the logs more thoroughly, and indeed it seems to me that this log line:
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::c07:e6ff:fe9a:bf2b/veth9336ec7/14
is new since the issue arrived. Could it be that my router assigned that prefix elsewhere, or something like this (but I'm not really using any IPv6 actively apart from OTBR)? I've restarted my router and HAOS, but that also didn't fix it.

#

If I try to ping6 that address listed in the log line (fe80::c07:e6ff:fe9a:bf2b), I get "address unreachable". I can ping6 another address such as fe80::2d8:61ff:fe79:8c00 which is one of the other machines in my network.

river belfry
#

not a friend of ipv6 too, i guess you can disable ipv6 because the devices will talk v6 to eachother with local addresses (fe80:::), you could try that. Check if igmp snooping / multicast filter is turned off in your router because it could block mdns requests

#

i also do think that 6to4 tunnel is no need too.

#

I hope someone corrects me if i am wrong.

burnt light
# river belfry tried this?

Yes, I just tried it. I also disabled an option that says "IPv6 is secured" on my router (although as I said, it was working fine for a long time before, I didn't do any changes at all when it stopped working). Now that "error 99" is gone interestingly, but I still see a bunch of these in the OTBR logs:

00:01:15.422 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:129, chksum:4d20, ecn:no, to:36ad7376d15f656d, sec:no, error:NoAck, prio:net, radio:all
00:01:15.422 [N] MeshForwarder-: src:[fe80:0:0:0:2cf9:919c:d33e:c95a]:19788
00:01:15.422 [N] MeshForwarder-: dst:[fe80:0:0:0:34ad:7376:d15f:656d]:19788

And the Matter add-on still can't find any device at all.

#

Generally speaking, my IPv6 seems to be up and running. I can ping6 the address HA shows in the network screen from my desktop computer over the LAN.

#

The OTBR web interface shows this for network topology... interestingly there is no leader (which I think there was when I looked at it in the past). Otherwise the status page seems to be showing fine, from what I can tell

river belfry
#

what happens if you clear browser cache?

burnt light
#

The actual topology shown changes on every reload (including full reload with cleared cache) but otherwise it is the same qualitatively

river belfry
#

hmm, then we need bigger brainz for help here.

burnt light
#

FYI, I also tried running the HAOS VM based on two backups I have (one from yesterday and one from two weeks ago, where everything definitely still worked), and it shows the same result.

river belfry
#

Have you installed the silicon labs multiprotocol? >remove.. If not, good.. I read on the dark void of the ai world that it could help to swith the radio type from thread to zigbee (or vica versa) wait a minute and set radio type back to original setting

humble merlin
#

So, you can see the network topology, I would take that as an indication that Sky Connect didn't die. Then you also see a net split. And it shows that all but one of your devices are happily connected to one router. If the isolated router is your Sky Connect that would explain why nothing works.
One thing that is really strange though is, that your TBR did not promote itself into the leader role...
So what you could do is test the above hypothesis. If possible, power off all your thread devices (except for the HA TBR). Then turn them on again one by one, sequentially. Wait/see between every step until/if the device turned on last will appear in HA as online...

#

Unfortunately I have seen it numerous times that whole segments of thread devices just split from the "connected" part of the thread network, and the top-level router just does not "realize" that it is "top-level" and would have to re-connect or give up its role. And without intervention devices can stay in that split condition for days (or maybe forever? didn't try, haven't been patient enough).

river belfry
#

learning..

burnt light
#

@humble merlin That's an interesting hypothesis. Unfortunately it is probably quite hard for me to turn most or all of the devices off, as I have quite a few relais and other such mains-powered devices throughout my house and would have to go around, uninstall power outlets to get behind them and things like that... although if that is the only option, I might try.
Is there no other way to directly tell my OTBR to join and become the leader of the network, without doing what you suggest?

#

OTBR offers me to join these networks (all with the same ID so I guess that's multiple shards of the original network). Would that work? I can try out things as I have a backup of the VM state that I can easily revert to, as long as I don't mess with the devices in my home (I would rather not re-commission all of them as they are hard to reach and quite a few).

#

One thing I will try (since I can easily restore the state of the VM) is reset the border router and Matter as a whole, and see if I can commission a spare device I have to the new network, and if it works then... that would also confirm that my networking setup and the SkyConnect are fine.

burnt light
#

Ok, so I tried it - completely reinstalled the OTBR and Matter add-ons and integrations from scratch, created a new thread network. OTBR made itself the leader there. Commissioned a spare device (Eve Energy) to the new network and Matter controller. All worked fine and the device was correctly working.
So the issue seems really to be the thread network / OTBR not correctly joining it. If there is nothing I can do to explicitly tell it to join as a leader / promote itself, @humble merlin you say there is a chance it will just fix itself in a day or two? I'd rather try that than reset and recommission everything from scratch. I can also try turning off and turning back on a few thread devices around my SkyConnect (I cannot do all but I can do some by turning off and back on one or more fuses in my home). Will that help with OTBR not a leader?

surreal pivot
#

thread devices on a partitioned network should repair the partitioning and turn back into a unified network at some point if they can see devices from other partitions and have the same network data (pan id, channel, encryption keys, etc).

#

relies on you having a sufficient number of routing-eligible end devices that can see each-other tho, and buggy thread implementations could be problematic :(

humble merlin
#

How many BRs do you have? If your HA OTBR is the only one, it should always promote itself into the leader role. So I think it would be worth a try to just join with one of the suggested nodes above.

(And coming back to the topology graph above: If you could Identify the router all other devices are connected to, and were able to turn just that one off, that could also suffice for the other devices to look for a different upstream router -- BUT of course you would also have make your OTBR part of the original PAN again beforehand for this to work.)

surreal pivot
#

note that there's no particular reason that the leader in a thread network would be a border router; can be any device.

burnt light
#

I have just the one border router. My network is rather large (about 70 devices total, and perhaps half of that router-capable). I did what Boris suggested, turned off some devices around my SkyConnect (that were easy to do), restarted the OTBR add-on, and it became a router. Then I turned the devices back on, and I got the network back in working state. So far it seems to be all working again. Thanks for the help!

#

I hope it won't happen again... I have particularly seen issues (although isolated to those devices in the past) with Nanoleaf Essentials bulbs. They were always flaky, and some of them are in critical positions on a chokepoint of my network. I will replace them by Aqara T2 bulb's soon, that may help stabilise the network.

humble merlin
#

@burnt light please share your experience with the T2s. I have - for the same reason - just finished replacing 8 Nanoleaf GU10 bulbs with Philips Hue's new MoT capable bulbs. First impression: The Hues are already by spec 55lm less bright than the Nanoleafs which already makes >10%. But also are the Nanoleafs' LEDs located in a circle more close to the center of the bulb whereas the Hues' LEDs are located more towards the edge. You would think that this makes the Hues a bit brighter but no, because of the LEDs' edge location they get partially covered by the GU10 fixing ring. So in total, in fully mounted position they are 20% less bright than the Nanolafs.
Also, the Nanoleaf bulbs are a bit concave, whereas the Hue bulbs are flat, so they are more on spot, where the Nanoleafs cover a larger area.

#

Other than that: The Philips bulbs are at least on Matter 1.4.1 while the Nanoleafs are - afaik - on 1.3. That alone doesn't make them better Thread devices though...

surreal pivot
#

Most of the issues I have with Nanoleaf are with their Matter implementation, not thread. A few particular things to note - transitions, especially at low brightness, happen in visible steps rather than smoothly (this is a bug in their matter implementation; native control via the app results in smooth transitions), the on/off transition configuration is kinda buggy (and you can't separately configure on vs. off), and the On level configuration is completely broken (violates spec) - annoyingly, in a way that generates extra thread traffic.

#

I've been happy with the Aqara T2 bulbs, but I haven't used any GU10 stuff.

#

kinda weird that the hue devices apparently support simultaneous zigbee and thread connections; the aqara devices technically support both, but use a completely separate firmware for zigbee mode vs. thread mode (switching via the aqara app flashes the bulb to a different firmware)

burnt light
#

I'm using the bulbs outdoors for motion-activated lighting in my driveway. I've replaced one Nanoleaf with a T2 so far. To me the lighting seems fine, but I don't think I can judge that very well based on my usage (vs. using it indoors to light a room).
The Nanoleafs became unavailable (and then available after some time again) quite frequently; this might be due to being outdoors and the thread network not ideal in that location, but other thread devices are working fine there (and should provide routers near the nanoleafs as well). For a week or so that I now have the T2 installed, it subjectively seemed much more stable.
The Nanoleafs also seem to not properly support the "on power on, turn the light off" configuration that I want to use in my setting (they always turn on at power on). Especially when they were unavailable so my system can't turn them off and I had a temporary power failure or turned off power for something, this lead to situations where they were on for a few hours during full day light before they managed to rejoin the network. The T2 supports this perfectly.
A final issue (or annoyance) I've had with the Nanoleafs was that sometimes they would only turn back on to like 10% brightness on light.turn_on, not sure why. And explicitly setting the brightness to 100% in my automation wasn't possible either, as one of the bulbs would just not turn on if I did that for some reason.

#

I hope the T2's will fix all that, but I'll see. So far, the one I have definitely seems to be more stable in general, and also will avoid the "light is on after power cut" issue for sure. The others, I will see

river belfry
#

more people have issues with nanoleaf, so stability will improve (unless using tuyaah)

sturdy crow
river belfry
humble merlin
# surreal pivot kinda weird that the hue devices apparently support simultaneous zigbee and thre...

Still they do, and do so by default, don't know if two connection to different RF protocols work in parallel though. Just unboxed them, and commissioned them with the companion app directly to HA. no problem. A bit annoying was the fact that I had to enter the matter codes manually, because there was only one matter code on the box which turned out to be a new multi-device commissioning code (which probably no-one supports apart from Philips, and which breaks my personal filing system for matter codes). That's why I know they are on >= 1.4.1 . Also, no current firmware in DCL, no recent CSA certificate shown. I could identify the PID though, and the certified firmware (from a year ago, but the product only released recently) is quite a bit older than what I have on my bulbs. Will be interesting to see how updates work. But I guess the same goes for the T2. Aqara doesn't have the best history when it comes to matter firmware updates of their products.

sturdy crow
placid field