#I really do appreciate the work that you
1 messages ยท Page 1 of 1 (latest)
The reason why your Onju was 'working' in duplex mode with this config and IDF 4.4.6 was just the cause of several hacks / misinterpretations. In duplex mode, the microphone and speaker audio streams use the same clock signals. Hence, the number of bits per channel needs to be the same in both directions. Even though these bits could be interpreted differently, the IDF-4.* I2S driver only allows to set one channel format which is used for both directions. In the case of the Onju board the microphones require 32bits per sample. The MAX98357 also accepts 32bit samples, so everything fine so far. But, most of the audio output signals like TTS and music are 16bit samples.
Unfortunately, the used re-sampler from the ADF SDK only supports 16bit outputs, so it is not of a help either. So, why did it work anyway before? The input channel format is set to 'only_right' in your config. Hence, only the right microphone channel is read and by the i2s driver and outputs a 1-channel 32-bit audio stream which can be handled by the microphone component just perfectly. As the channel format can't be set differently in the i2s driver for the other direction, it is also 'only_right' which makes the the i2s driver in 4.4.6 to expect a 1-channel 32-bit signal as its input. The re-sampler in your pipeline on the other hand does read the channel settings from your config which is set to default (left_right) for the output direction. So the re-sampler duplicates its input to two channels if it receives a mono signal at its input. But as mentioned earlier it supports only 16bit outputs, so the input to the i2s driver is a 2-channel 16bit audio stream which gets misinterpreted as a 1-channel 32bit audio stream and send to both channels of the i2s stream. For mono signals this misinterpretation only increases the volume thats why it seemed that everything was just working fine.
With 4.4.7 however, the i2s driver expects a two channel signal for the 'only_right' format. (Don't ask me why, probably it is not meant to be used at all for this direction). Anyway, because of that, the misinterpretation from 2-channel 16bit to 1-channel 32-bit does not work anymore...
So as you can see, with the current set of components it is not that easy to run the Onju in duplex mode. I guess the cleanest way of supporting it would be to extend the re-sampler to also support the conversion from 16bit to 32bit, but I haven't done it so far. So for now, you should either stay with 4.4.6 and live with the dirty hack, or you should try configuring the speaker and microphone in exclusive i2s mode.
The reason why duplex mode does work with e.g. the S3-Box3 and the M5Stack-Core-S3 is that you can configure their ADCs to 16bit per sample, so the i2s can be set to 16bit for both directions.
Awesome! Finally some light has been shed onto those many shadows. Thanks for the explanations.
Looks like we have some stuff to figure out and some work to do
I haven't looked at this in a month or two so my memory is rusty, but would the i2s_write_expand function help? I thought that would convert 16-bit to 32-bit automatically. Or does using the adf framework mean you can't use it?
thanks for pointing at it, I will check it
Thanks again Kevin for reminding me about this option. At least with my esp32-s3-n16r8 dev-board duplex now works with 32bit. I committed the "fix" to the dev-next branch. Maybe anyone with an Onju could try if this fix helps with that board as well.
I sure will!
I'm still getting no audio with the following changes:
dev-next branch
- source:
type: git
url: https://github.com/gnumpi/esphome_audio
ref: dev-next
components: [ adf_pipeline, i2s_audio ]
refresh: 0s
"recommended" version of esp-idf:
framework:
type: esp-idf
version: recommended
and both with and without forcing channel: right for the output
Is there a dependency on a particular ESPHome version? I'm currently on 2024.5.2 because 2024.5.5 caused the microphone audio from my Onju to be completely unrecognizable
Here are the logs from boot, if they help: https://dpaste.org/NTnBR
can you try set both to channel: left ?
it works! it works!
It also seems to speak the response to commands faster now. Previously, it appeared to get stuck after on_tts_end (speaking) for several seconds (rapidly flashing green LEDs) and now it fairly quickly transitions to listening and shortly thereafter speaks the response. It's a bit of a weird sequence now, but it seems faster. My wakeword seemed to be working reliably over several iterations, too, but I'm running ESPHome 2025.5.2 right now
here's my entire config if anyone is interested: https://dpaste.org/9qZND
I don't think you needed it, but I can also confirm this works
The last thing that continues to vex me is that often with perfectly clear and loud audio input, HA decides to continue to listen for the full 15s before doing anything. It seems almost random when it detects the end of speech and when it just times out after 15s, and then executes the command as expected
Exhibit A - HA listened for a full 15s, most of which is dead silence, and then did what I asked
I see this reported frequently by other folks, and I certainly see it often
i guess you've tried tweaking the "Finished speaking detection" setting
yeah, it's all "aggressive"
I'm looking in the code and I see what HA is explictly disabling noise suppression
class WebRtcVad(VoiceActivityDetector):
"""Voice activity detector based on webrtc."""
def __init__(self) -> None:
"""Initialize webrtcvad."""
# Delay import of webrtc so HA start up is not crashing
# on older architectures (armhf).
#
# pylint: disable=import-outside-toplevel
from webrtc_noise_gain import AudioProcessor
# Just VAD: no noise suppression or auto gain
self._audio_processor = AudioProcessor(0, 0)
def is_speech(self, chunk: bytes) -> bool:
"""Return True if audio chunk contains speech."""
result = self._audio_processor.Process10ms(chunk)
return cast(bool, result.is_speech)
does that audio sample sound ambiguous in any way to you?
I was thinking that maybe the noise supression level should be passed in there
I don't think the problem is the amount of silence (1.0s vs. 2.0s vs. 0.5s), but that it's not detecting silence very well.
noise suppression set to "4" still listens for the full 15s with this:
This ia a bit quiet, but definitely not noisy, so i don't think silence detection is the issue
I'll look into any ALC functionality that the DAC mai have
I'm just considering the difference between the speaking portion and the silence and it's pretty obvious to me which is which
whether it's based on level or some smarter algo, it's hard to see how that 10+s of silence could be interpreted as speech, even if it doesn't seem completely silent
in the end, it processes the stream properly and executes the command
so it manages to do accurate STT on it
Yup. Just not VAD
Which is super annoying
I bet it works better if you speak up, though. From the same distance, i mean
yes, or if I get closer
that's with my Box-3, but I see the same with my Onju. From right up close, it's fast. From across the room, it listens for 15s and then works fine
<looking deep into the bowels of webrtc-noise-gain now...>
it gets quite mathy in there
in all fairness, the STT algo that I'm using is NC cloud, so industrial strength cloud power
but I feel like a "simple" level differential check would be more effective than what we currently have for VAD
some louder stuff followed by some much less loud stuff == done
I'm 94.8% sure that is how we (used to) do it, but apparently the threshold is too large
there are different modes for VAD that set different levels of aggressiveness
default is "0", but there it goes up to "3" (very aggressive)
// Set aggressiveness mode
int WebRtcVad_set_mode_core(VadInstT* self, int mode) {
int return_value = 0;
I'm not writing another word on this topic until Kevin is finished typing ๐
Unfortunately, webrtc_vad isn't very good overall, so tweaking settings can't make it up for it. My rough benchmarks with the VAD model for microWakeWord show it is already better than baseline webrtc_vad, and I really handicapped the model to make it as fast as possible.
as I sorta mentioned earlier, microwakeword works awesome
I need to work with Mike to get a custom trained VAD going for detecting this exact thing.
very few false positives and it always responds when it's listening
it's a crapshoot what happens after it detects the wakeword, though ๐
Any way to port that model to be used inside of HA as well?
I start working for Nabu Casa on the 17th, so I'm hoping to work on getting a better VAD model up and running within the wyoming pipline.
Thank the flying spaghetti monster!
The biggest problem is running the preprocessor model in streaming mode. The Python libraries only work for full audio at a time, and the Python tflite micro interpreter isn't very portable (it doesn't work on Apple chips for example)
I have C++ code that generates the features, but I haven't looked into packaging that in an easy to use way.
I think the easiest would be to use a different preprocessor for this purpose instead, probably just MFCCs. This requires a different model to be trained, but it would be a lot more portable.
I hope I can make great things! It will be nice to work on this stuff full-time
Since i understand only about a third of what you just said, i will conclude that the best news here is that somebody is looking into this
It's on the list of things I want to do, that's for sure! We haven't discussed the priorities of the things to work on, so I don't know how soon it will be.
for now, I can deal with walking up to my Assist devices or waiting for them to timeout as long as they eventually do the thing I asked
what takes them out of the running is when they simply stop listening at all or stop speaking at all
I had gotten into a routine of walking up to my Onju, telling it to do something, having it ignore me, unplugging it and plugging it back in, and finally just talking to the Alexa device next to it ๐
I also might have cursed at it
Well if wake word detection is still okay, add a second wake word with my PR. In the action yaml for ESPHome, you can detect the other wake word and use it to reset the device instead of starting the pipeline.
ha, that's a thought ๐
I should just make a curse word model for this purpose...
Depending on what I'm doing, that may cause a lot of extra unnecessary resets...
Okay, Nabu! FU!
Restarting...