#I really do appreciate the work that you

1 messages ยท Page 1 of 1 (latest)

willow basin
#

The reason why your Onju was 'working' in duplex mode with this config and IDF 4.4.6 was just the cause of several hacks / misinterpretations. In duplex mode, the microphone and speaker audio streams use the same clock signals. Hence, the number of bits per channel needs to be the same in both directions. Even though these bits could be interpreted differently, the IDF-4.* I2S driver only allows to set one channel format which is used for both directions. In the case of the Onju board the microphones require 32bits per sample. The MAX98357 also accepts 32bit samples, so everything fine so far. But, most of the audio output signals like TTS and music are 16bit samples.

#

Unfortunately, the used re-sampler from the ADF SDK only supports 16bit outputs, so it is not of a help either. So, why did it work anyway before? The input channel format is set to 'only_right' in your config. Hence, only the right microphone channel is read and by the i2s driver and outputs a 1-channel 32-bit audio stream which can be handled by the microphone component just perfectly. As the channel format can't be set differently in the i2s driver for the other direction, it is also 'only_right' which makes the the i2s driver in 4.4.6 to expect a 1-channel 32-bit signal as its input. The re-sampler in your pipeline on the other hand does read the channel settings from your config which is set to default (left_right) for the output direction. So the re-sampler duplicates its input to two channels if it receives a mono signal at its input. But as mentioned earlier it supports only 16bit outputs, so the input to the i2s driver is a 2-channel 16bit audio stream which gets misinterpreted as a 1-channel 32bit audio stream and send to both channels of the i2s stream. For mono signals this misinterpretation only increases the volume thats why it seemed that everything was just working fine.
With 4.4.7 however, the i2s driver expects a two channel signal for the 'only_right' format. (Don't ask me why, probably it is not meant to be used at all for this direction). Anyway, because of that, the misinterpretation from 2-channel 16bit to 1-channel 32-bit does not work anymore...

#

So as you can see, with the current set of components it is not that easy to run the Onju in duplex mode. I guess the cleanest way of supporting it would be to extend the re-sampler to also support the conversion from 16bit to 32bit, but I haven't done it so far. So for now, you should either stay with 4.4.6 and live with the dirty hack, or you should try configuring the speaker and microphone in exclusive i2s mode.

The reason why duplex mode does work with e.g. the S3-Box3 and the M5Stack-Core-S3 is that you can configure their ADCs to 16bit per sample, so the i2s can be set to 16bit for both directions.

spiral wren
#

Awesome! Finally some light has been shed onto those many shadows. Thanks for the explanations.

wild root
#

Looks like we have some stuff to figure out and some work to do

vague agate
#

I haven't looked at this in a month or two so my memory is rusty, but would the i2s_write_expand function help? I thought that would convert 16-bit to 32-bit automatically. Or does using the adf framework mean you can't use it?

willow basin
#

thanks for pointing at it, I will check it

willow basin
#

Thanks again Kevin for reminding me about this option. At least with my esp32-s3-n16r8 dev-board duplex now works with 32bit. I committed the "fix" to the dev-next branch. Maybe anyone with an Onju could try if this fix helps with that board as well.

mortal rain
#

I sure will!

mortal rain
#

I'm still getting no audio with the following changes:
dev-next branch

  - source:
      type: git
      url: https://github.com/gnumpi/esphome_audio
      ref: dev-next
    components: [ adf_pipeline, i2s_audio ]
    refresh: 0s

"recommended" version of esp-idf:

  framework:
    type: esp-idf
    version: recommended

and both with and without forcing channel: right for the output
Is there a dependency on a particular ESPHome version? I'm currently on 2024.5.2 because 2024.5.5 caused the microphone audio from my Onju to be completely unrecognizable

willow basin
#

can you try set both to channel: left ?

mortal rain
#

it works! it works!

#

It also seems to speak the response to commands faster now. Previously, it appeared to get stuck after on_tts_end (speaking) for several seconds (rapidly flashing green LEDs) and now it fairly quickly transitions to listening and shortly thereafter speaks the response. It's a bit of a weird sequence now, but it seems faster. My wakeword seemed to be working reliably over several iterations, too, but I'm running ESPHome 2025.5.2 right now

wild root
#

I don't think you needed it, but I can also confirm this works

mortal rain
#

The last thing that continues to vex me is that often with perfectly clear and loud audio input, HA decides to continue to listen for the full 15s before doing anything. It seems almost random when it detects the end of speech and when it just times out after 15s, and then executes the command as expected

#

I see this reported frequently by other folks, and I certainly see it often

wild root
#

i guess you've tried tweaking the "Finished speaking detection" setting

mortal rain
#

yeah, it's all "aggressive"

#

I'm looking in the code and I see what HA is explictly disabling noise suppression

#
class WebRtcVad(VoiceActivityDetector):
    """Voice activity detector based on webrtc."""

    def __init__(self) -> None:
        """Initialize webrtcvad."""
        # Delay import of webrtc so HA start up is not crashing
        # on older architectures (armhf).
        #
        # pylint: disable=import-outside-toplevel
        from webrtc_noise_gain import AudioProcessor

        # Just VAD: no noise suppression or auto gain
        self._audio_processor = AudioProcessor(0, 0)

    def is_speech(self, chunk: bytes) -> bool:
        """Return True if audio chunk contains speech."""
        result = self._audio_processor.Process10ms(chunk)
        return cast(bool, result.is_speech)
#

does that audio sample sound ambiguous in any way to you?

#

I was thinking that maybe the noise supression level should be passed in there

#

I don't think the problem is the amount of silence (1.0s vs. 2.0s vs. 0.5s), but that it's not detecting silence very well.

wild root
#

I'll look into any ALC functionality that the DAC mai have

mortal rain
#

I'm just considering the difference between the speaking portion and the silence and it's pretty obvious to me which is which

#

whether it's based on level or some smarter algo, it's hard to see how that 10+s of silence could be interpreted as speech, even if it doesn't seem completely silent

wild root
#

The 15s is just the timeout

#

In case no voice is detected

mortal rain
#

in the end, it processes the stream properly and executes the command

#

so it manages to do accurate STT on it

wild root
#

Yup. Just not VAD

#

Which is super annoying

#

I bet it works better if you speak up, though. From the same distance, i mean

mortal rain
#

yes, or if I get closer

#

that's with my Box-3, but I see the same with my Onju. From right up close, it's fast. From across the room, it listens for 15s and then works fine

#

<looking deep into the bowels of webrtc-noise-gain now...>

#

it gets quite mathy in there

#

in all fairness, the STT algo that I'm using is NC cloud, so industrial strength cloud power

#

but I feel like a "simple" level differential check would be more effective than what we currently have for VAD

#

some louder stuff followed by some much less loud stuff == done

wild root
mortal rain
#

there are different modes for VAD that set different levels of aggressiveness

#

default is "0", but there it goes up to "3" (very aggressive)

#
// Set aggressiveness mode
int WebRtcVad_set_mode_core(VadInstT* self, int mode) {
  int return_value = 0;
wild root
#

I'm not writing another word on this topic until Kevin is finished typing ๐Ÿ˜…

vague agate
#

Unfortunately, webrtc_vad isn't very good overall, so tweaking settings can't make it up for it. My rough benchmarks with the VAD model for microWakeWord show it is already better than baseline webrtc_vad, and I really handicapped the model to make it as fast as possible.

mortal rain
#

as I sorta mentioned earlier, microwakeword works awesome

vague agate
#

I need to work with Mike to get a custom trained VAD going for detecting this exact thing.

mortal rain
#

very few false positives and it always responds when it's listening

#

it's a crapshoot what happens after it detects the wakeword, though ๐Ÿ™‚

wild root
vague agate
#

I start working for Nabu Casa on the 17th, so I'm hoping to work on getting a better VAD model up and running within the wyoming pipline.

mortal rain
#

dude!

#

I expect great things ๐Ÿ™‚

wild root
#

Thank the flying spaghetti monster!

vague agate
#

The biggest problem is running the preprocessor model in streaming mode. The Python libraries only work for full audio at a time, and the Python tflite micro interpreter isn't very portable (it doesn't work on Apple chips for example)

#

I have C++ code that generates the features, but I haven't looked into packaging that in an easy to use way.

#

I think the easiest would be to use a different preprocessor for this purpose instead, probably just MFCCs. This requires a different model to be trained, but it would be a lot more portable.

#

I hope I can make great things! It will be nice to work on this stuff full-time

wild root
#

Since i understand only about a third of what you just said, i will conclude that the best news here is that somebody is looking into this

vague agate
#

It's on the list of things I want to do, that's for sure! We haven't discussed the priorities of the things to work on, so I don't know how soon it will be.

mortal rain
#

for now, I can deal with walking up to my Assist devices or waiting for them to timeout as long as they eventually do the thing I asked

#

what takes them out of the running is when they simply stop listening at all or stop speaking at all

#

I had gotten into a routine of walking up to my Onju, telling it to do something, having it ignore me, unplugging it and plugging it back in, and finally just talking to the Alexa device next to it ๐Ÿ™‚

#

I also might have cursed at it

vague agate
#

Well if wake word detection is still okay, add a second wake word with my PR. In the action yaml for ESPHome, you can detect the other wake word and use it to reset the device instead of starting the pipeline.

mortal rain
#

ha, that's a thought ๐Ÿ™‚

vague agate
#

I should just make a curse word model for this purpose...

mortal rain
#

exactly

#

no need to change your behavior at all

#

It just works

vague agate
#

Depending on what I'm doing, that may cause a lot of extra unnecessary resets...

wild root