#Satellites
1 messages · Page 1 of 1 (latest)
@mortal forge Precise works OK, but it's a recurrent net (GRU) and not quantized (8-bit weights) so not yet working on an NPU.
Google has a number of quantized CNN models for training here: https://github.com/google-research/google-research/tree/master/kws_streaming
They use TensorFlow, so not sure how easily they could be ported to the ESP32-S3 NPU.
All neural networks that I'm aware of use tensors, it all depends on if they have a lib to do hw offloading, like with GPU/TPU/Cuda, etc. The ESP32 has built in NNI support and esprissif provides the libs. We need to see if anyone has done the integration of torchaudio with the esp32-S3 NNI hardware
Tensorflow Lite Micro is supported on the ESP32, which I believe includes the mel spectrogram functions we'd need.
Do you know what backend libraries torchaudio uses. I.e like pytorch uses OpenCV
I don't want to write this all in python and find out it's not performant enough. Would rather look at doing it in C++ from the start
Looks like the decision was already made for us. "TensorFlow Lite for Microcontrollers is written in C++"
Yeah, it's a shame it's not Rust but you gotta take what's there 😄
it's all going to be compiled languages for microcontrollers
if you want to do anything serious that is 🙂
agreed.. 100%
For the S3 -> the biggest problem is that it looks from their docs that Espressif needs to train your models
We will need to put the effort in up front, but the rewards will be worth it. cheap hw that is very performant
right, that's the goal
You can't train the models on box. You have to train it off box
Model is a model. doesn't matter where you train it. That's the point.
Espressif's audio framework (ADF) is integrated with their custom model toolchain, though.
If we can bypass that, it would be awesome.
Sure, but if you want to use Espressif WakeNet it's all custom https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/ESP_Wake_Words_Customization.html
Seriously? That must be one of their closed blobs
If we can get a Tensorflow Micro model trained, exported, and integrated into the ADF framework we'll be sitting good. And then Espressif will deprecate the S3 😄
I guess one can run other wake engines on ESP32
it's not clear if WakeNet is using some secret instruction set in their chips
Yeah, I was just thinking that. That's the whole point, we would end up creating our own engine and using their HW acceleration
While it would be nice to leverage what they have already done, I was more interesting in the HW capabilities of the platform, than the libs they provide. I'm going to have a hard time convincing my wife to move from "Alexa" to "Nihao Xiaoxin"
Though they do have some ok ones in there already.. May be worth looking at
"Hi, ESP" works on the newer boards (S3 based).
So does "Alexa" surprisingly
As Stuart said in the Github thread, it would be great if we could convince one audio engineer/programmer to implement some open source algorithms for us.
I'm going to put the Product Manager hat on for a second and think about actually delivering functionality for a second. I'm half tempted to suggest we leverage what Espressif has done already, as it works and is already done. The wake words aren't as horrible as I poked fun at and could be used out of the box.
That said we could go back at any time and replace that functionality with our own module, when we have the luxury of time.
Thoughts?
I would almost rather want to put effort into a cheap HW platform and getting that working.
From a HW perspective, I don't think I would even worry about adding a speaker / amplifier to it initially.
It would give us the ability to start capturing audio from a standard platform / source
The Korvo-2 board is a good starting place: https://www.digikey.com/en/products/detail/espressif-systems/ESP32-S3-KORVO-2/15822448
I've successfully tested the wake word and audio playback. It's only a few more steps to add the UDP audio streaming.
That's awesome! Is it in python or C?
It's in C using their framework: https://docs.espressif.com/projects/esp-adf/en/latest/design-guide/dev-boards/user-guide-esp32-s3-korvo-2.html
(my Korvo doesn't have a screen, but same idea)
You basically install their toolchain (based on CMake), set some options, compile, and flash the board.
Just making sure. If they closed their blob off, it could be compiled in C and still being used by a python lib they offer. C works just as well
Looking at the hw specs of the Korvo
I don't worry about them closing their blob(s) off, but they are pretty aggressive above deprecation. The older LyraT board already doesn't work with the latest framework.
It's understandable though. A lot of the GPUs etc aren't backward compatible for some of the tensor acceleration. The technology is advancing too quickly to remain backward compatible for long periods of time
Unfortunately HW are like toasters, every couple years you'l need to get new ones. I'm in this cycle myself now, thus my interest in this project.. 😉
k, I've got two Korvos on order. Should be here in a couple days
Awesome, thanks for looking into this. I'd suggest reading up on ADF audio streams: https://docs.espressif.com/projects/esp-adf/en/latest/api-reference/streams/index.html#
They have an HTTP stream example that could be adapted pretty easily.
We'll do silence detection on the base station side, so we just need a small protocol to say "stop sending audio".
Have you evaluated the network performance of each of these protocols? Why leverage a TCP based protocol? Are they long lived session where you don't have to establish a TCP handshake?
I don't see a network protocol they support that is UDP based
I'd prefer UDP or RTP (on top of UDP). It is possible to use UDP, I just don't think they have a ready-made stream for it.
Guess they want to ensure the data reliably gets there. If you use UDP, you might not get the full sample, now that I think of it
They are using the network to handle retransmissions of lost packets.
Since we're doing on-device wake word recognition, I'm not as worried about streaming performance. Our commands are 5-10 seconds or less.
They also support RAW, so nothing preventing us from sending it anyway we like. Even over MQTT, if you wanted
This would be a fun way to save a few bits: https://github.com/phoboslab/qoa
I like it
Gotta head to a meeting; let's chat more later 👍
kk.. btw, terse look shows
AUDIO_STREAM_WRITER, the connection is [raw] ->[codec-mp3]->[i2s]. So no reason we couldn't use [raw] ->[codec-mp3]->[QOA]
Can sound stupid , but can we cut the responses and change with rgb light. ESP music stream player is a wish from long time of users , and no one achieved it. Let’s make it simple like you did with intense - first wave only hassio.turn.on second month response come. There has many environment that no need a voice response - work shop , bathroom and etc. Cheep trigger devise will be need it. And one RGB can give many information back, or simple 7 segment display… if i2c is limited by microphones and display is need it. Let’s focus of catch key word and sent it to HA.
I read the comments on this topic and otherone. I thing that every animal in world have a year and year is important thing of sound sensor. Will fix many problem from echos and etc. all tests are fail without any “year”. Owl have one of best and simple ears , if my memory is good , they have just holes in scull with some spirals shape to the ear sensor and all is cover behind the most fluffy feathers. Ears of owls are not symmetric and they are on different highs from eye level. But owl have more that 360 degree head movement and ability to hear mouse heartbeat 20 meter away and under 20cm of snow ( don’t get my number and example straight ) and can hit the spot where the mouse is it.
Other thing is can we lower the input to one frequency- a lower one, that every human say in in normal voice on any language. To can kill high sounds echos,
Microphone and program dos not need sound to be perfect clear to understand the sentence . We will read only high and low waves.
From google frequently of male voice is 120 hz , female is 300hz mean we didn’t want any sound out range 80-350hz , we dosent need voice recognition.
My plan is use NNI to detect wake word locally on the satellite. Keep the satellite as simple as possible.
Agree with user feedback method, keep it simple and light weight. I personally do not intend to use the satellite as a speaker. RGB notification and sounds prompt / ding is good enough for me. If we want to include audio feedback, it would be easy enough to store the sounds locally on flash and just play them.
Btw, ESP music stream player should be attainable. You just have to use an ESP with enough power. I.e ESP32 instead of ESP8266. Not all ESP is equal. ESP32-S3 has HW NNI built in and what we plan to use to detect wake word locally.
As far as acoustics of microphones and enclosure, that is outside my current skillset. We would need access to someone with an Anechoic chamber.
hello, not very experienced but this project is pretty intriguing and will love to learn more. Just wanted to put this vid here that I watched a few years back https://youtu.be/re-dSV_a0tM
Nice example how thing can work on esp32 and have left space for improves. I thing we must start with collecting data for wake up words.
https://github.com/pschatzmann/arduino-audio-tools
That is the library for arduino for audio streaming and etc.
only need to turn it work with two microphones with we read is give best audio source.
Nice 🙂
Had some of you a look into microphone arrays? Or are they omitted to get a first simple and cheap solution? Nevertheless there are some available dev-boards I like to share:
https://www.seeedstudio.com/ReSpeaker-Mic-Array-v2-0.html