#colliding ok nabu wake words. How to turn the wave word volume down

1 messages · Page 1 of 1 (latest)

jovial bloom
#

*WAKE WORD

I have 2 atom echos and one PE.

the PE is in the living room and the echos in the study and bedroom.
But many times the echo in the study picks up the oke nabu wake word in the living room.

I tried to switch the oke nabu wake word to hey jarvis but that does not triggerd always by my dutch accent.

Is there a way to turn down the volume of the mic on the atom echo for the study?

I cant just drop the study because i have loads of room aware scripts. turn on/off fan or music for example. ("music on" instead of "i want music in the living room" or what ever the standard voice call is)

wooden jacinth
#

change this:

micro_wake_word:
  on_wake_word_detected:
    - voice_assistant.start:
        wake_word: !lambda return wake_word;
  vad:
  models:
    - model: ${micro_wake_word_model}

to

micro_wake_word:
  on_wake_word_detected:
    - delay: 50ms
    - voice_assistant.start:
        wake_word: !lambda return wake_word;
  vad:
  models:
    - model: ${micro_wake_word_model}
#

can play with the delay

#

that way it might detect it first but the VPE would report first so it would then get ignored?

#

there are some other MWW options to tune detection which you could maybe play with to reduce hits from the other room but not reduce int he room itself. although this would be a balancing act

wooden jacinth
#

update: I tried added the delay as above. and it seems this breaks something on the atom echo and causes errors in the i2s driver. I am gunna have a play with it though

jovial bloom
#

@wooden jacinth thanks for putting in efford.

wooden jacinth
jovial bloom
#

Isnt there some way to make a priority. Like if two colide pick this one.

wooden jacinth
#

its first come first serve from what i understand

jovial bloom
#

No they collide and both dont listen

wooden jacinth
#

its supposed to be that 1 is picked based on which one gets to HA first. which is what i am seeing

#

so the driver error I was getting earlier was me doing something strange. i think the delay where i had it is the way to go. i am experimenting a bit currently.

wooden jacinth
# jovial bloom Isnt there some way to make a priority. Like if two colide pick this one.

so (assuming stock firmware and configuration) the VPE will always be 300ms behind the AE because the VPE makes the wake sound and then starts VA.
a delay of 100ms seems to be enough to account for latency and always give priority to the non-delayed one.
so adding a 400ms delay to the AE solves, but if you turn off the wake sound on the VPE then only 100ms delay is needed

although this may cause you to have the reverse problem to be fair

wooden jacinth
#

@jovial bloom
I created a branch on my fork of the stock AE VA firmware

procedure to use here - https://gist.github.com/MichaelMKKelly/b547586a41b925f57bba3617a4740181

i agree having a priority system would be handy for those with devices near each other. as someone with 4 VA's on their desk currently i understand the pain. but this setup hopefully can help. although as i said... you might get the reverse problem with this.

jovial bloom
#

compiling

jovial bloom
#

is there a way to get esp compilations on a faster machine. ha runs on a pi4

#

sure i know esphome can be an docker but does that the same integration with ha?

#

@wooden jacinth I have compiled both my atom echos and now i don't have to whisper any more in my living room.

wooden jacinth
wooden jacinth
jovial bloom
#

that also works

#

Now how to get this in main line esphome?

wooden jacinth
#

I was just thinking, if this as an option within the MWW component which could then be exposed to the UI as a debug config option. but realistically I don't think its something that would happen, given the actual fix is for an edge case and fairly easy to manually add

wooden jacinth
#

@jovial bloom I updated this setup in the fork branch. It should now compile with ESPHome 2025.5 and enable preannounce sound and continued conversation on the AE
the delays required for it to reliably win over a VPE seemed to change though.

the default is now 500ms (up from 400ms) and the recommended change for if you disable the VPE wake sound is now 300ms (up from 100ms). I am not totally sure why these changes were needed but you can fine tune the exact number if you want.

the gist has been updated with relevant information.

let me know if you have any issues.

jovial bloom
#

great

timber escarp
#

I've encountered the same issue with two Atom Echo devices and two VPE units.

The problem with using a "first to detect, first to respond" approach is that it doesn't necessarily reflect the user's intent. For example, a device that's farther away might detect the wake word first due to better network conditions or hardware performance—but that doesn't mean it's the one the user was speaking to.

A better way to handle wake word detection across multiple devices would be to measure the audio volume of the wake word on each device and prioritize the one with the highest level. This would more accurately represent which device the user was closest to when speaking.

It's true that different devices may record audio at different gain levels, but this can be easily corrected by applying a calibration offset to each device, allowing for consistent comparison and reducing false triggers from duplicates.

slim hedge
# timber escarp I've encountered the same issue with two Atom Echo devices and two VPE units. T...

Well it will introduce another problem.
Now it registers pipeline session for first device that came, and discarding others.
With your approach server would have to wait for some time every time it registers wake word from some satellite, so others possibly could come in some time period after it (how long, 0.5s?) with louder wake word. That means, server will ALWAYS have additional delay before even starting processing data.

Not to mention that every device has its own characteristics, that can have effect on loudness measurement, starting with the mic itself and enclosure, and ending with environment and sound processing algorithm...

timber escarp
#

Indeed, implementing a delay of around 0.5 seconds to create a pool of wakeword detections in order to select the one with the highest volume intensity seems reasonable and neccesary. Even so, I believe that 0.5 seconds is a negligible amount of time compared to the total activation, listening, processing, and response time. If I had to choose between a 0.5-second delay that ensures the correct satellite responds, versus having no delay but potentially wasting several seconds canceling and repeating a command—with no guarantee that the desired satellite will activate—I would prefer the former.

It’s also true that we cannot fully control all aspects that influence the audio intensity. However, if we can have more control over identifying intensity than over the speed of wakeword detection, for example by adjusting each satellite’s intensity level with an offset or by leveraging physical location to influence the result—something the current approach doesn’t allow—then that added control becomes valuable.

Furthermore, it's obvious that if I’m standing next to a satellite, the chances that my voice registers with greater volume on that one than on another satellite five meters away are significantly higher.

#

Perhaps a better approach would be to implement a composite audio arbitration technique, similar to what Google may be using.

For example:

If a satellite detects the wakeword first and has a higher audio intensity than others, the confidence that it's the correct device is very high.

If a satellite detects the wakeword first, but there's a significant difference in audio intensity, it's reasonable to trust the one with the higher intensity.

If neither condition provides a clear winner, the wakeword detection confidence score can be used as a fallback.

This layered approach allows for more reliable arbitration in multi-device environments.

The are some patents for this
https://patents.google.com/patent/US10181323B2/en https://patents.google.com/patent/US20170076720A1/en

Would this be a problem?

slim hedge
wooden jacinth
wooden jacinth
# timber escarp Indeed, implementing a delay of around 0.5 seconds to create a pool of wakeword ...

0.5s is longer than you think. maybe you will understand the advantage of it and interact with it correctly but those of us that live with the "less technically initiated" who are now 6 months into every day explaining that its "okay nabu" not "hey nabu" or "hello nabu" or "NABU" and that screaming the wrong thing wont help... we will never be able to explain that a slightly delay is needed to ensure the first word or 2 doesn't get cut off. the short delay for the wake sound on the PE cuts off half a word sometimes as it is and thats making a noise that makes it obvious. the llm manages most of the time but its not perfect.

adding additional delay is never a good thing.