#M5Stack Atom Echo Media Player + Wake Word

1 messages · Page 1 of 1 (latest)

peak pollen
#

Updated esp_adf setting that I added trying to mimic too much.

sturdy mortar
peak pollen
#

The issue I am currently running into is that I can TTS to it, but it doesn't play the response to my commands through its speaker.

#
[D][voice_assistant:428]: Desired state set to AWAITING_RESPONSE
[D][voice_assistant:422]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[D][voice_assistant:422]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
...
[D][voice_assistant:557]: Speech recognised as: " What's my name?"
[D][voice_assistant:529]: Event Type: 5
[D][voice_assistant:562]: Intent started
...
...
[D][voice_assistant:585]: Response: "Sorry, I couldn't understand that"
[D][voice_assistant:529]: Event Type: 8
[D][voice_assistant:605]: Response URL: "http://192.168.107.5:8123/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_e6f38331f3_tts.piper.raw"
[D][voice_assistant:422]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[D][voice_assistant:428]: Desired state set to STREAMING_RESPONSE
[D][media_player:059]: 'M5Stack Atom Echo b83488' - Setting
[D][media_player:066]:   Media URL: http://192.168.107.5:8123/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_e6f38331f3_tts.piper.raw
[D][media_player:059]: 'M5Stack Atom Echo b83488' - Setting
[D][media_player:066]:   Media URL: http://192.168.107.5:8123/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_e6f38331f3_tts.piper.raw
[D][light:036]: 'M5Stack Atom Echo b83488' Setting:
[D][light:059]:   Red: 20%, Green: 100%, Blue: 0%
[W][component:214]: Component voice_assistant took a long time for an operation (0.06 s).
[W][component:215]: Components should block for at most 20-30ms.
[D][voice_assistant:529]: Event Type: 2
[D][voice_assistant:619]: Assist Pipeline ended
[W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.54 s).
[W][component:215]: Components should block for at most 20-30ms.
[W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.47 s).
[W][component:215]: Components should block for at most 20-30ms.```
#

Removed lines are "Event Type" logs

sturdy mortar
#

good news and bad news with that issue:

peak pollen
#

Give it to me straight doc

sturdy mortar
#

the bad news is that Piper is not compatible with the media_player with HA 2023.11 and older versions

#

the good news is that Piper is compatible with the media_player with HA 2023.12, which has launched in beta today

peak pollen
#

Oh

#

So that's why it worked earlier. I was using a different pipeline

#

That's so weird though because I can use ```service: tts.speak
data:
media_player_entity_id: media_player.m5stack_atom_echo_b83488_m5stack_atom_echo_b83488
message: media
target:
entity_id: tts.piper

just fine
#

I see, it works now with HA Cloud TTS pipeline

#

I am doing this all so I can set a timer... Which is that other thread and won't really prove useful because the script is killed when the next intent is run anyways...

#

I'll figure that out though.

Thank you @sturdy mortar, get some rest haha

sturdy mortar
peak pollen
sturdy mortar
#

Do you mean the 5 second loop of restarting listening when wake word is on? That's intended behavior

peak pollen
#

Ah, that makes sense....

sturdy mortar
#

No, it doesn't, at least in my book, but it is what it is. I.e. intended 😋

#

What would make more sense than just chopping un audio at random (albeit consistently random) 5s intervals, would be to have proper VAD and just sent the recording buffer contents afted voice activity was no longer detected

peak pollen
#

I figured it would be a continous audio stream to HA, since the satalites generally won't be powerful enough for voice detection on their own.

#

I know most of the Smart speakers do some form of local audio processing before calling home, but those units are probably more complex than an ESP or similar.

#

A PI zero would probably be capable though.

I am nowhere near savy enough in the development space, but does that mean 5 second audio streams are repeatedly sent to HA for processing?