Indeed, implementing a delay of around 0.5 seconds to create a pool of wakeword detections in order to select the one with the highest volume intensity seems reasonable and neccesary. Even so, I believe that 0.5 seconds is a negligible amount of time compared to the total activation, listening, processing, and response time. If I had to choose between a 0.5-second delay that ensures the correct satellite responds, versus having no delay but potentially wasting several seconds canceling and repeating a command—with no guarantee that the desired satellite will activate—I would prefer the former.
It’s also true that we cannot fully control all aspects that influence the audio intensity. However, if we can have more control over identifying intensity than over the speed of wakeword detection, for example by adjusting each satellite’s intensity level with an offset or by leveraging physical location to influence the result—something the current approach doesn’t allow—then that added control becomes valuable.
Furthermore, it's obvious that if I’m standing next to a satellite, the chances that my voice registers with greater volume on that one than on another satellite five meters away are significantly higher.