#Threshold at which Thread/Matter network falls apart with dozens of devices sending updates at once

1 messages Β· Page 1 of 1 (latest)

median glen
#

Looking for some guidance to help decide how best to set up automations with a large Thread network.

Context: I have 70 REED devices at the moment, all Inovelli White series dimmer switches. SkyConnect dongle on a beefy VM in my closet that hosts HA/Matter add-on. Latest versions of HA and Matter add-on. All devices commissioned okay initially, more or less working as expected, with only (what I believe to be) a transceiver signal power issue on the devices causing some frequent drops of certain devices, and sometimes commands/updates taking a little longer than I'd expect to propagate. But again, mostly working.

I discovered that attempting to send commands to dozens of these devices at once (say, 30-40 of them) caused what appeared to be a full Thread/Matter crash and reset. Thinking further about that, I would surmise that the commands being sent to the devices was probably not the real cause of the dump, but rather the significant number of intermediate state updates being sent from the devices back to the hub -- I'd guess 5-10 updates for each device over the course of 5-10 seconds (dim level now 5%, dim level now 13%, dim level now...).

I went searching for some guidance on how many messages a (semi-unreliable, perhaps weak-signal-strength) Thread network can support, what the built-in retry mechanisms are, how the Matter server may be influencing this (it's the thing deciding whether a device is still "up" or not, right?), and any anecdotal guidance from people in the field trying to run this many devices. I have come up fairly empty. πŸ™‚

Does anybody have such guidance? In particular, what is the Matter server's role in deciding the stability of the connections to end devices, and when it determines a node has been lost and must thus be re-found? Does Thread itself have any limitations I should know about that might come into play here? Any documentation pointers or anecdotes are appreciated. πŸ™‚

#

Threshold at which Thread/Matter network falls apart with dozens of devices sending updates at once

random path
#

Thread is a low bandwith network so do not send commands to huge amount of devices: no more than 5 at once on the safe side. You can stretch this number a bit if you have enough routing devices, and most important multiple border routers. But no, sending commands to all your 70 devices at once will definitely kill your network for several minutes.

#

The answer to this are groups btw which are not yet implemented.

median glen
#

Thread has a native "group" concept? Did not realize if so.

Good feedback on number of devices to attempt simultaneous commands, thank you! Is that number (5ish) based on your own experience, other users' anecdoties you've heard, or anything documented?

fleet kelp
#

I have 23 MoT Nanoleaf bulbs and because of this I have some light groups with a max of 3 bulbs in it. Sometimes one of the bulbs reacts a bit later as the the other two. But it always works and never crashes my Thread network.

median glen
rose swan
#

Matter over thread does have native groups, yeah - thread provides ipv6 networking, then matter groups are built on top of ipv6 multicast.

#

one of the issues with matter right now is that while matter devices are running transitions, the devices are constantly and frequently sending updates back to the matter controller (i.e. HA)

#

so even once groups are a thing, it's good to avoid running transitions on large numbers of devices at once.

fleet kelp
random path
random path
median glen
#

@random path do you have a PR or commit link for that change?

random path
#

Its now in matter server beta

torn bane
#

Yeah for the command side Matter's Group messaging should improve things. However, as you noted, the problem really is the subscription responses here. And there is no "Group Subscription" or something like that 😒 Also, depending on the light configuration/type etc. the outcome of commands can be different, so I guess logically there can't be really a group update πŸ€”

I'd guess that the lowered subscription ceiling should improve things a lot. Depending on the light you get multiple updates per second otherwise. Maybe we should make this even configurable, maybe per device in the future to further improve the behavior for such large setup πŸ€”

median glen
#

Thank you! So it feels like I could attempt a test that sends, say, 5-10 "turn on" commands at once, and with the much reduced traffic back from the switches, see if that results in any congestion that causes network instability. I could then keep bumping that simultaneous number up until I see approximately where failures begin. I could pick a safe threshold where I simply wait between sending commands to groups of N size each -- e.g., send commands to 10 devices, wait 2 seconds, send commands to the next 10 devices, etc.

#

Perhaps a future feature add: allow for messaging of devices and groups of devices where each command is added to a queue and the messages are sent out with a configurable jitter or delay automatically -- e.g., "send this command to each device/group with no more than 5 commands simultaneously and delay of 250ms between each batch thereafter." This would be a cleaner way for users to message without resorting to the manual waits that I'm describing above.

random path
#

Well, the result will be the same; you need to wait until the action is performed. Maybe you have a super optimal network where you can even send commands to 40 devices at once. Point is, sending commands to that many devices at once is bad in any protocol.

I also wonder why you even want to do so ?
What is the usecase of sending a command to all your switches at once?

median glen
# random path Well, the result will be the same; you need to wait until the action is performe...

I have a few automations that would be great to interact with at least 30 switches. We have, for example, about 20 switches for just the rear rooms of our main floor which each need to be set to a specific dim level to create a "dim" scene for evening. I'd like to actually expand to the entire floor, rather than just the rear of the house, and that would require around 40 switches. There are anywhere from 2-6 switches controlling lighting in each room/hallway/space, and there are 6-10 rooms to set at once to create full-floor lighting scenes. Further, the most intensive operation of all: setting the lights to "night" mode would involve commands to nearly every switch on the floor.

I am not sure what you mean by your first paragraph. The result may be the same, I agree with that. The effort the user (in this case, me) has to go through to implement his own batching-of-commands-and-waiting-between-batches strategy for every scene he wants to create is substantial. If we had an integration/something built into Matter implementation that would allow for this batching to be done more easily, we'd avoid all that effort across users. I have a feeling that there will be other users who want to send N commands that will exceed whatever thresholds their network will support (either because they have a lot of devices, or because they only have a couple of REEDs with so-so signal between them, etc.).

Maybe I am not thinking about it right, so please feel free to set me straight on what I am missing. But I am thinking about a "command group" that could be created in which all commands sent to the group would be automatically spaced out by a configurable amount of time, or which could be set to something like "no more than 5 commands per 2 seconds." It would iterate the commands in order, pausing as appropriate to ensure proper flow. This feels like a win to me...?

random path
#

Well, if we throtttle, we get complaints that lights do not turn on at the same time. Sure we can do some safeguarding but on the other hand, for your usecase (controlling such a huge amount of lights at the same time!) - the only solution is groups. In that case a multicast command is sent to all lights at the same time.

median glen
#

Yes, I agree -- it would not be a good solution to impose on all users. I think what I am considering is simply an optional concept to use if you need it -- if you have many devices to communicate to at once, there is a built-in principal for organizing those devices and metering out the commands to them to avoid overloading a MoT network. I think most people would not use this. It would only be useful for people with many of the same device that need to control them in an aggregate way.

Unfortunately, groups as they currently exist do not work -- it isn't possible to send different commands to each entity in a group, which is the primary use case (e.g., dim light 1 to 20%, dim light 2 to 30%, etc.). The only place groups would work in my case is to turn off all lights.

Further thinking -- would this kind of principal be useful to things other than just MoT networks? It sounds like from your experience, this would also potentially be useful to people with larger Zigbee networks, yes? I wonder if there are other use cases for a metering of commands to large numbers of entities at once.

#

Aside: do you know when the next Matter server release will be? I am eager to get my hands on that subscription change. πŸ˜„

torn bane
#

Further thinking -- would this kind of principal be useful to things other than just MoT networks? It sounds like from your experience, this would also potentially be useful to people with larger Zigbee networks, yes? I wonder if there are other use cases for a metering of commands to large numbers of entities at once.
Yeah Ioverwhelming a network with limited bandwidth with commands/traffic is generic to all smart home RF protocols I'd say (Zigbee/Z-Wave). That said, not sure if it'd be easy to implement a generic "rate limit" on HA level. At least for Matter, you'd only want to rate limit the Thread network (or rather, have a much higher limit for WiFi/Ethernet devices as the bandwidth is much higher).

#

How the rate limiting works can also be quite different. E.g. for Matter/Thread (which is UDP based) I could imagine a dynamic algorithm similar to TCP's congestion control alogrithms. Maybe thsi could be implemented in the MRP protocl πŸ€”

median glen
#

IMO it would have to either be at the base protocol level, as you're suggesting, OR at the user's level. I think "generic rate limit on HA level" is not quite capturing the idea, in that my suggestion is to allow the user to decide when they need to implement a group rate limit. The user knows (or can know) what kinds of devices they have and whether they're communicating via Thread or WiFi or something else. In my case, I would have the option to group up a bunch of commands using this primitive which I know are all part of the Thread network that could experience congestion. So we aren't saying "all devices in the Matter integration" or "all devices on Thread" or anything of that sort; merely, "all devices which are part of this command group, as specified by the user."

Even better, we could probably have the group be sort of dynamic with the use of labels on devices. I could label (and maybe even automatically label) any device which is added to Thread network X as being part of the device group for which I would like commands to be metered. This gives me the power to specify certain networks (or, with other rules, only certain classes of devices) as being ones that require such metering, and let the system then handle it appropriately.

#

Does HA have an existing concept that maps somewhat to this idea? I.e., there is an integration that's responsible for sending commands from devices, but there's a "shim" that's inserted in between the devices and the command sends. Sort of like something inserting itself into the middle of the integration's implementation to change its behavior...?

#

Perhaps such a shim isn't the right way to think about it. Perhaps it's more like... a wrapper around the entities or around the devices? E.g., "Turn on entity X." But entity X is actually a wrapper entity around the light entity, exposed as if it is a light entity itself (sort of a template?), and which handles the metering of commands as described above.

#

New integration: "Control Flow Entity Group"
Exposes any (supported) entity that's placed into the group as a new entity that meters its commands to the contained entities as a collective.

As one example: For a light entity, supports on, off, set dim level, set color, etc. If you add entity "Light 1" to group "Light Group A," the integration will create a new entity "Light 1 in A," and so on for all other entities added to "A."

When the "turn on" command is sent to "Light 1 in A," the integration checks its control flow state to decide when to send the command -- i.e., perhaps immediately, perhaps with a slight delay, etc., all dependent upon configuration and current state of the group.

Useful for all entities that communicate via a protocol that could experience congestion leading to dropped commands or stability problems in large networks, such as Thread, ZigBee, maybe others.

median glen
torn bane
#

We don't really have a release schedule but we intend to release 6.6.0 quite soon (today or tomorrow).

median glen
torn bane
#

Right, the add-on bump is still missing πŸ™ˆ Let me tackle that!

torn bane