#How can I modify the firmware hosted in the home-assistant-voice-pe repo to run on my Voice PE?

1 messages Β· Page 1 of 1 (latest)

hot loom
#

I trained my own MicroWakeWord model to use with my Voice PE, and I found that the ESPHome yaml files are found in https://github.com/esphome/home-assistant-voice-pe (and the models are defined here https://github.com/esphome/home-assistant-voice-pe/blob/f8a0cfc0d17e9f7f544558bf585ce05b6e01571f/home-assistant-voice.yaml#L1533 ) - but there are three YAML files. Which one should I feed into the ESPHome flasher? Is it sufficient to just run esphome run home-assistant-voice.yaml and if so, what are the other two YAML files for? Also, I'm safely guessing I will break OTA updates by doing so, is it possible to OTA update with that YAML without having to wipe every time?

GitHub

Home Assistant Voice PE. Contribute to esphome/home-assistant-voice-pe development by creating an account on GitHub.

GitHub

Home Assistant Voice PE. Contribute to esphome/home-assistant-voice-pe development by creating an account on GitHub.

zealous pier
hot loom
#

awesome, thank you! i trained it using the Jupyter notebook given in the microwakeword repository, but made some major changes to it:

  • tuned the hyperparameters significantly
  • added MeloTTS as another way of creating samples
  • modified the code to accept multiple "wake words" which I simply used to train on different pronounciations (my wake word is "hey glados"
#

oh and i made it run for 3 million steps, basically overfitting like crazy lol

#

i run on a local GPU using the "local runtime" functionality of Colab, but I suspect you can just use the GPU in Colab, you'll just have to be... very patient...

#

i plan on committing the notebook once i know the model works well, or at least as good as i can get with synthetic data, but for now you are welcome to observe the beautiful mess it is lol

zealous pier
hot loom
#

thanks! the yaml file worked just fine, simply replaced the mycroft model with my json file and it all looks great so far! it's very nice to have a voice pe with my own model and my own wake word lol

#

the false positives were infuriating last time around. I certainly overfit it like crazy, threw around 18 hours of GPU time on it (even more than last time). this version looks a little bit better

zealous pier
hot loom
#

once you have the tflite file at the end, you'll have to create a JSON file for the model. you can see hints on what probability cutoff to use printed at the end of the training step, the tensor arena size is honestly a complete mystery to me, and I don't recommend changing any other number in there:

{
    "type": "micro",
    "wake_word": "hey glados",
    "author": "John Karabudak",
    "website": "https://www.johnthenerd.com/",
    "model": "hey_glados.tflite",
    "trained_languages": ["en"],
    "version": 2,
    "micro": {
      "probability_cutoff": 0.7,
      "feature_step_size": 10,
      "sliding_window_size": 5,
      "tensor_arena_size": 30000,
      "minimum_esphome_version": "2024.7.0"
    }
}
#

and at the end, i just took the yaml file you mentioned at the top, replaced one of the three wake words (you can maybe add a new one, i didn't bother) with mine, and flashed it using esphome

#

oh, and as a final note, the settings I configured take forever to run. i'm talking 18+ hours on my 4060Ti. if you're not patient, you can try doubling the learning rate and significantly decreasing the number of steps. or just play around with all the parameters, see what works best. good luck! please do tell us if you manage to make a good model!

hot loom
#

also, it's kind of annoying to have to connect all the voice pe's to my computer and update everything one by one every time I want to test a new iteration of my model or overall receive any sort of update. is there a better way?

hollow turtle
#

Adopt them in esphome

#

Then you can change their yaml in there and wirelessly flash them

weak orbit
#

@hot loom I have just done exactly the same thing, including the same wake word, and even recorded my own voice samples to train against and I am still seeing a lot of false activations. Seems like GlaDOS might be a very popular wake word so hopefully with everyone's efforts combined we can get a micro wake word that doesn't suck.

#

And what Nick said is correct. Flash the standard firmware from the provided repo and then adopt it in esphome, then in the esphome yaml just added the changes you want to make and they will add to or override the base values when you compile the firmware locally. That way you can run updates more easily without having to copy and paste thee base yamal and apply your own changes.

hot loom
#

great, thanks a lot! for now my model looks like it doesn't have as many false positives, but it's too early to tell. I'll keep experimenting and let you guys know!

hot loom
#

now I seem to have the opposite problem, it often takes at least two attempts to wake the device up but false positives don't seem to be happening nearly as often. I modified the hyperparameters to heavily prefer false negatives over false positives, so maybe some slight tweaking on the opposite direction might achieve something decent

weak orbit
#

@hot loom I just got finished training a "Hey GLaDOS" wake word.

INFO:absl:Cutoff 0.99: frr=0.0500; faph=0.000
INFO:absl:Cutoff 0.96: frr=0.0400; faph=0.187
INFO:absl:Cutoff 0.94: frr=0.0300; faph=0.375
INFO:absl:Cutoff 0.86: frr=0.0200; faph=0.562
INFO:absl:Cutoff 0.84: frr=0.0200; faph=0.750
INFO:absl:Cutoff 0.79: frr=0.0200; faph=0.937
INFO:absl:Cutoff 0.78: frr=0.0200; faph=1.125
INFO:absl:Cutoff 0.67: frr=0.0200; faph=1.312
INFO:absl:Cutoff 0.63: frr=0.0200; faph=1.500
INFO:absl:Cutoff 0.54: frr=0.0100; faph=1.687
INFO:absl:Cutoff 0.51: frr=0.0100; faph=1.875
INFO:absl:Cutoff 0.51: frr=0.0100; faph=2.000```
#

frr is the percentage chance of a failure to active. so the worst is projected to be 5%.
If I want to reduce this to 3% but I am okay with 0.375 false activations per hour then it's not bad to limit it to 0.94

hot loom
#

how well does it work? I'm still experimenting, having some issues with balancing the false positives and negatives

#

I'm currently training another model with 3m steps, where the melotts is only used for testing. I'm hoping for more accurate faph and frr results

#

my current understanding is that the frr and faph given is not necessarily accurate since we use mostly the same dataset for training and testing

#

but I'm not 100% sure since there is in fact one dataset that's only used in testing

bright kite
weak orbit
#

My mission today is to have my partner and I both side down with the studio mic and just bash out a load of samples. Already written a converter to shift them down to 16000, split each clip by the silence between them and then remove the silence from each clip. So hopefully by tonight I will have a personalised model train. I will keep the AI generated samples but lower them a little and add more variety of ways to say GlaDOS.

hot loom
#

that's probably the most effective way. you may find it helpful to augment the data in similar ways as done in the script, and could possibly modify the speed/pitch to generate more samples from the audio data

#

meanwhile I keep training with the AI dataset, trying different hyperparameters. my current biggest problem is balancing faph and frr

zealous pier
dusty grail
#

I have noticed that Wyoming voice, HA native and ESP voice all share a common wake word file with .tflight file extension in their set up. Really going out on a limb here, but is that file common amongst the three systems or uniquly generated for each application? I have a version of GlaDOS wake word local running on Wyoming Satellite that I’ve been pleased with. If it is unique, could I use the training data and just retrain for ESP voice?

zealous pier
dull cloak
#

What’s helpful for building sample sets? Presumably since it’s for your own usage, using your voice alone is good, but if there’s benefit to more voice samples, I’d happily contribute some time if it results in a generally useful model.

Are you also artificially layering in background noise into samples? Or does organic background noise help?

hot loom
#

I layer in background noise the same way the example microwakeword training notebook does, but maybe I should find some more datasets to augment with. I'm guessing more real samples would help - right now I'm stuck between "hard to activate" and "many false positives", I keep trying to tune hyperparameters but it doesn't help all that much

jovial pecan
#

thanks for sharing the colab code and glados model - ran it on my pe and it seems to work decently well at 0.7 despite the false negatives, honestly impressed that this is even achievable

#

i wonder if the HA dev team or kevin (props for even making this remotely possible) has any existing or planned initiatives/websites (besides the one for okay nabu) to submit crowd source voice samples, which would allow us to train a community repository of functional micro wake words? i believe most of us here would be more than happy to contribute πŸ™‚

inland edge
rustic trail
# hot loom I layer in background noise the same way the example microwakeword training note...

Welcome to the last year of my life 🀣 It's hard to get these things figured out, and what works for one wake word doesn't work for others. A couple tips:

  1. See https://github.com/kahrendt/microWakeWord/blob/main/documentation/data_sources.md for more datasets to augment with. The ones downloaded in the notebook are minimal, I wanted to keep it small for a basic training example
  2. Using only Piper TTS samples for the validation and test sets leads to models taht benchmark well but perform poorly and vice versa. I suggest using a differenet source for your validation and test sets. Ideally they are real samples, but I had good luck with using the HA Cloud voices for the validation and test sets.
hot loom
#

interesting - I did try using another TTS engine for the validation set but didn't get great results. a database for real voices could absolutely work

#

I happen to have the compute for running a server for such, but I don't know the first thing about the legal side of doing such a thing (considering that I now keep voices of internet strangers)

rustic trail
#

You have to be careful. The wake word collective has some terms that people accept before submitting their samples

hot loom
#

that's exactly what I'm scared of. don't mind hosting stuff, but not if it can bring people knocking at my door

jaunty onyx
hot loom
#

I can't help but wonder if quantization is really necessary if we are willing to give up the ability to run multiple wake words. it could perhaps improve the quality, but not sure if the ESP32 has the processing power to shuffle the fp16 weights around

hot loom
rustic trail
hot loom
#

interesting, in that case it's not worth messing with

rustic trail
hollow turtle
#

I wonder if we could come up with some mechanism where we could capture "near misses" for wake words from the devices (like something that fell just outside the threshold within a certain margin) that a user could then listen to and determine if they should have worked, and the upload those samples as potential training data? Bet that would help alot with training πŸ˜„

hot loom
#

perhaps another potential solution is to write an esphome plugin that records all positive wake words spoken to it, stores it locally on some server

#

yes exactly lol

hollow turtle
#

or even false positive wakes I guess, for a negative training set

hot loom
hot loom
hollow turtle
#

Yeah I mean I think we shouldn't try to scrape anything automatically. Rather, let them have those samples, and if they so choose, upload them, using a similar mechanism to what we already have that allows people to submit voice samples.

#

could even be the same site that exists now, just with an option to upload an audio file

#

and maybe something to denote it is a positive or negative sample

hot loom
#

the issue with uploading is that there are potential legal issues associated with it - since we have to store user voices

#

i was just thinking of uploading everything to the user's laptop, where they can go train themselves lol

hollow turtle
#

Well no, the site has that disclaimer you have to agree to

#

so you are conciously saying it is ok for Nabu to use the samples you are uploading, it's your choice πŸ™‚

hot loom
#

that is fair enough. i certainly do not understand the legal side of it - of course that's a much better solution

hollow turtle
hot loom
#

yep, I used it to submit a few okay nabu samples myself lol

hollow turtle
#

You have to check the box that says "I agree to the terms" so that pretty much solves that issue

#

If you click that and then upload samples, I wouldn't really see a problem πŸ˜…

rustic trail
hot loom
#

oh, in that case yes, it could be problematic. if I end up doing so, I'll just use it with the validation/test set and leave a big scary disclaimer in my colab notebook

rustic trail
#

The training script downloads AudioSet because its small to get started. I don't use that for any of the official wake words

hot loom
#

I see

rustic trail
#

AudioSet is kind of weird since they pulled it from youtube videos, so there are potential problems

#

Collecting samples on the device is somethign we would like to do eventually. We have to be careful about it though, just because the admin decides to collect them doesnt' necessarily mean all people in the house consent

#

So I don't know if automatically saving them is the best idea. Perhaps just a mode to record samples in a session, much like how are wake word collective works

#

As we were developing the VPE, we did have some hacked together sample collection thing, mainly for gathering false negatives. We only used it internally at NC (really only me and Mike ran it if I remember correctly...)

hot loom
#

that is true. I mean, we could just periodically announce that we are doing so, with voice commands to enable/disable data collection

rustic trail
#

Uploading the sample also took a fair amount of time and delayed the start of the voice pipeline. This could probably be worked around though

hot loom
#

everyone in the household will find out very quickly that their data is being collected, and they could just intuitively ask the pe to stop doing so

#

I suspect that will be very dependent on the unique circumstances of whoever is running it

#

we could even do things like deleting all data since the last announcement any time someone asks for data collection to stop

hollow turtle
#

Yeah it's a bit of a gray area, because one of the main things people like about these devices is the fact no one is collecting your data πŸ˜‰

rustic trail
#

Exactly... even if its opt-in it still doesn't feel great

hollow turtle
#

So have to make sure the user is very well informed of what is going on and assured it is all their decision and completely under their control, and probably disabled by default I guess?

hot loom
#

yes, completely agreed. definitely should be disabled by default

hollow turtle
#

Probably a setting more for the power user that wants to help out and is fully OK with the samples being used to improve wake

#

Probably even something to enable on a per-device basis, so they can decide exactly where samples come from

hot loom
#

I feel like it simply not leaving the user's control is even better, but that's difficult

rustic trail
#

I think just enabling a collection mode for the next minute would be useful, so its just a temporary session

hollow turtle
#

like maybe you are cool with your living room device providing samples, but not cool with them coming from your bedroom 🀣

#

Yeah, I am just thinking of something I had happen today, where I said the wake word a few times while music was playing and it didn't hear me, would be neat to log those and see if they can be useful samples

#

but yeah I could do what you said Kevin and just flip that mode on and try it a few times

hot loom
#

that's true. I still like the idea of giving power users the choice of keeping the data entirely local, but that's the kind of thing that would become a custom esphome plugin anyway lol

#

I also think false positives would be very hard to diagnose that way - my colleagues are very familiar with the sound of glados now πŸ˜†