Satellites | Home Assistant | Page 1

mint glade Feb 17, 2023, 8:25 PM

#

Starting a thread to discuss satellite hardware.

#

@mortal forge Precise works OK, but it's a recurrent net (GRU) and not quantized (8-bit weights) so not yet working on an NPU.

#

Google has a number of quantized CNN models for training here: https://github.com/google-research/google-research/tree/master/kws_streaming

#

They use TensorFlow, so not sure how easily they could be ported to the ESP32-S3 NPU.

mortal forge Feb 17, 2023, 8:33 PM

#

All neural networks that I'm aware of use tensors, it all depends on if they have a lib to do hw offloading, like with GPU/TPU/Cuda, etc. The ESP32 has built in NNI support and esprissif provides the libs. We need to see if anyone has done the integration of torchaudio with the esp32-S3 NNI hardware

mint glade Feb 17, 2023, 8:36 PM

#

Tensorflow Lite Micro is supported on the ESP32, which I believe includes the mel spectrogram functions we'd need.

#

https://www.tensorflow.org/lite/microcontrollers

mortal forge Feb 17, 2023, 8:38 PM

#

Do you know what backend libraries torchaudio uses. I.e like pytorch uses OpenCV

#

I don't want to write this all in python and find out it's not performant enough. Would rather look at doing it in C++ from the start

#

Looks like the decision was already made for us. "TensorFlow Lite for Microcontrollers is written in C++"

mint glade Feb 17, 2023, 8:41 PM

#

Yeah, it's a shame it's not Rust but you gotta take what's there 😄

candid meadow Feb 17, 2023, 8:41 PM

#

it's all going to be compiled languages for microcontrollers

#

if you want to do anything serious that is 🙂

mortal forge Feb 17, 2023, 8:42 PM

#

agreed.. 100%

candid meadow Feb 17, 2023, 8:42 PM

#

For the S3 -> the biggest problem is that it looks from their docs that Espressif needs to train your models

mortal forge Feb 17, 2023, 8:42 PM

#

We will need to put the effort in up front, but the rewards will be worth it. cheap hw that is very performant

candid meadow Feb 17, 2023, 8:42 PM

#

right, that's the goal

mortal forge Feb 17, 2023, 8:43 PM

#

You can't train the models on box. You have to train it off box

#

Model is a model. doesn't matter where you train it. That's the point.

mint glade Feb 17, 2023, 8:43 PM

#

Espressif's audio framework (ADF) is integrated with their custom model toolchain, though.

#

If we can bypass that, it would be awesome.

candid meadow Feb 17, 2023, 8:43 PM

#

Sure, but if you want to use Espressif WakeNet it's all custom https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/ESP_Wake_Words_Customization.html

mortal forge Feb 17, 2023, 8:44 PM

#

Seriously? That must be one of their closed blobs

mint glade Feb 17, 2023, 8:44 PM

#

If we can get a Tensorflow Micro model trained, exported, and integrated into the ADF framework we'll be sitting good. And then Espressif will deprecate the S3 😄

candid meadow Feb 17, 2023, 8:45 PM

#

I guess one can run other wake engines on ESP32

#

it's not clear if WakeNet is using some secret instruction set in their chips

mortal forge Feb 17, 2023, 8:45 PM

#

Yeah, I was just thinking that. That's the whole point, we would end up creating our own engine and using their HW acceleration

#

While it would be nice to leverage what they have already done, I was more interesting in the HW capabilities of the platform, than the libs they provide. I'm going to have a hard time convincing my wife to move from "Alexa" to "Nihao Xiaoxin"

#

Though they do have some ok ones in there already.. May be worth looking at

#

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/README.html#esp-open-wake-word

mint glade Feb 17, 2023, 8:49 PM

#

"Hi, ESP" works on the newer boards (S3 based).

mortal forge Feb 17, 2023, 8:49 PM

#

So does "Alexa" surprisingly

mint glade Feb 17, 2023, 8:49 PM

#

As Stuart said in the Github thread, it would be great if we could convince one audio engineer/programmer to implement some open source algorithms for us.

mortal forge Feb 17, 2023, 8:53 PM

#

I'm going to put the Product Manager hat on for a second and think about actually delivering functionality for a second. I'm half tempted to suggest we leverage what Espressif has done already, as it works and is already done. The wake words aren't as horrible as I poked fun at and could be used out of the box.

That said we could go back at any time and replace that functionality with our own module, when we have the luxury of time.

Thoughts?

#

I would almost rather want to put effort into a cheap HW platform and getting that working.

From a HW perspective, I don't think I would even worry about adding a speaker / amplifier to it initially.

#

It would give us the ability to start capturing audio from a standard platform / source

mint glade Feb 17, 2023, 8:56 PM

#

The Korvo-2 board is a good starting place: https://www.digikey.com/en/products/detail/espressif-systems/ESP32-S3-KORVO-2/15822448

#

I've successfully tested the wake word and audio playback. It's only a few more steps to add the UDP audio streaming.

mortal forge Feb 17, 2023, 8:57 PM

#

That's awesome! Is it in python or C?

mint glade Feb 17, 2023, 8:58 PM

#

It's in C using their framework: https://docs.espressif.com/projects/esp-adf/en/latest/design-guide/dev-boards/user-guide-esp32-s3-korvo-2.html

#

(my Korvo doesn't have a screen, but same idea)

#

You basically install their toolchain (based on CMake), set some options, compile, and flash the board.

mortal forge Feb 17, 2023, 9:01 PM

#

Just making sure. If they closed their blob off, it could be compiled in C and still being used by a python lib they offer. C works just as well

#

Looking at the hw specs of the Korvo

mint glade Feb 17, 2023, 9:07 PM

#

I don't worry about them closing their blob(s) off, but they are pretty aggressive above deprecation. The older LyraT board already doesn't work with the latest framework.

mortal forge Feb 17, 2023, 9:09 PM

#

It's understandable though. A lot of the GPUs etc aren't backward compatible for some of the tensor acceleration. The technology is advancing too quickly to remain backward compatible for long periods of time

#

Unfortunately HW are like toasters, every couple years you'l need to get new ones. I'm in this cycle myself now, thus my interest in this project.. 😉

#

k, I've got two Korvos on order. Should be here in a couple days

mint glade Feb 17, 2023, 9:15 PM

#

Awesome, thanks for looking into this. I'd suggest reading up on ADF audio streams: https://docs.espressif.com/projects/esp-adf/en/latest/api-reference/streams/index.html#

#

They have an HTTP stream example that could be adapted pretty easily.

#

We'll do silence detection on the base station side, so we just need a small protocol to say "stop sending audio".

mortal forge Feb 17, 2023, 9:17 PM

#

Have you evaluated the network performance of each of these protocols? Why leverage a TCP based protocol? Are they long lived session where you don't have to establish a TCP handshake?

#

I don't see a network protocol they support that is UDP based

mint glade Feb 17, 2023, 9:20 PM

#

I'd prefer UDP or RTP (on top of UDP). It is possible to use UDP, I just don't think they have a ready-made stream for it.

mortal forge Feb 17, 2023, 9:20 PM

#

Guess they want to ensure the data reliably gets there. If you use UDP, you might not get the full sample, now that I think of it

#

They are using the network to handle retransmissions of lost packets.

mint glade Feb 17, 2023, 9:21 PM

#

Since we're doing on-device wake word recognition, I'm not as worried about streaming performance. Our commands are 5-10 seconds or less.

mortal forge Feb 17, 2023, 9:22 PM

#

They also support RAW, so nothing preventing us from sending it anyway we like. Even over MQTT, if you wanted

mint glade Feb 17, 2023, 9:22 PM

#

This would be a fun way to save a few bits: https://github.com/phoboslab/qoa

mortal forge Feb 17, 2023, 9:23 PM

#

I like it

mint glade Feb 17, 2023, 9:24 PM

#

Gotta head to a meeting; let's chat more later 👍

mortal forge Feb 17, 2023, 9:27 PM

#

kk.. btw, terse look shows

AUDIO_STREAM_WRITER, the connection is [raw] ->[codec-mp3]->[i2s]. So no reason we couldn't use [raw] ->[codec-mp3]->[QOA]

lapis ore Feb 18, 2023, 8:41 PM

#

Can sound stupid , but can we cut the responses and change with rgb light. ESP music stream player is a wish from long time of users , and no one achieved it. Let’s make it simple like you did with intense - first wave only hassio.turn.on second month response come. There has many environment that no need a voice response - work shop , bathroom and etc. Cheep trigger devise will be need it. And one RGB can give many information back, or simple 7 segment display… if i2c is limited by microphones and display is need it. Let’s focus of catch key word and sent it to HA.
I read the comments on this topic and otherone. I thing that every animal in world have a year and year is important thing of sound sensor. Will fix many problem from echos and etc. all tests are fail without any “year”. Owl have one of best and simple ears , if my memory is good , they have just holes in scull with some spirals shape to the ear sensor and all is cover behind the most fluffy feathers. Ears of owls are not symmetric and they are on different highs from eye level. But owl have more that 360 degree head movement and ability to hear mouse heartbeat 20 meter away and under 20cm of snow ( don’t get my number and example straight ) and can hit the spot where the mouse is it.
Other thing is can we lower the input to one frequency- a lower one, that every human say in in normal voice on any language. To can kill high sounds echos,
Microphone and program dos not need sound to be perfect clear to understand the sentence . We will read only high and low waves.

lapis ore Feb 18, 2023, 9:03 PM

#

From google frequently of male voice is 120 hz , female is 300hz mean we didn’t want any sound out range 80-350hz , we dosent need voice recognition.

mortal forge Feb 19, 2023, 2:13 PM

#

My plan is use NNI to detect wake word locally on the satellite. Keep the satellite as simple as possible.

Agree with user feedback method, keep it simple and light weight. I personally do not intend to use the satellite as a speaker. RGB notification and sounds prompt / ding is good enough for me. If we want to include audio feedback, it would be easy enough to store the sounds locally on flash and just play them.

Btw, ESP music stream player should be attainable. You just have to use an ESP with enough power. I.e ESP32 instead of ESP8266. Not all ESP is equal. ESP32-S3 has HW NNI built in and what we plan to use to detect wake word locally.

As far as acoustics of microphones and enclosure, that is outside my current skillset. We would need access to someone with an Anechoic chamber.

vernal garden Feb 22, 2023, 4:46 AM

#

hello, not very experienced but this project is pretty intriguing and will love to learn more. Just wanted to put this vid here that I watched a few years back https://youtu.be/re-dSV_a0tM

lapis ore Feb 26, 2023, 1:00 PM

#

Nice example how thing can work on esp32 and have left space for improves. I thing we must start with collecting data for wake up words.

https://github.com/pschatzmann/arduino-audio-tools
That is the library for arduino for audio streaming and etc.
only need to turn it work with two microphones with we read is give best audio source.

vernal garden Mar 19, 2023, 11:18 PM

#

https://www.hackster.io/news/espressif-s-latest-esp-sr-speech-recognition-library-boosts-accuracy-and-optimizes-memory-usage-90f6c24a2ecd

mint glade Mar 20, 2023, 1:51 AM

#

Nice 🙂

tulip reef Mar 25, 2023, 10:21 AM

#

Had some of you a look into microphone arrays? Or are they omitted to get a first simple and cheap solution? Nevertheless there are some available dev-boards I like to share:
https://www.seeedstudio.com/ReSpeaker-Mic-Array-v2-0.html

https://www.matrix.one/products/voice

#Satellites