Hello everybody. I am pretty happy with the voice PE, but the microphone has trouble picking up my wake-word. I was wondering if there is any way to improve the microphones… also, it’s really not good at filtering a single voice source. If anybody else in the room is talking it will mix into the command. I also think the time from wake-word until it actually listens to your voice could be way shorter.
#Upgrading Voice PE microphones?
1 messages · Page 1 of 1 (latest)
its not perfect, there's not really anything you can do about it at this stage.
The firmware is open source, isn’t it? Couldn’t it be installed on a ESP32 with better hardware connected?
I wonder though if it’s a hardware problem or a software problem
It's not hardware problem. For your cases to work, there should be real beam forming and voice prints implemented, to extract actual request from other voices. There's no firmware for that yet, at least no open source solutions to any hardware you can buy.
you could potentially build something with a better microphone array. but at that point your not really talking about a VPE anymore.
you could definitely look at the hardware design + the firmware of the esp32 and XMOS chip as reference to build something.
as for filtering 1 voice out of a crowd. this is realistically probably not going to be happening in hardware. you could set it to instead stream the microphone to an external service which you could perform your own audio processing on. but that is probably going to inject quite a bit of delay on using it. and there is no recomendations for such a thing currently.
Does anybody have some experience with the M5Stack Atom Echo in regards to microphone capabilities? Is it at least somewhat reliable?
I wish the voice preview edition would feature airplay… that way it could synch in a music group within music assistant :/
doesn't work with the generic sync group?
the atom echo is a nice little device for basic dev/debugging on some voice stuff at a desk but realistically its not a "production" device
It does, but it doesn’t sync at all. You’ll have like multiple seconds offset between devices.
Thanks. I got one in transit anyway. But yeah, I am looking into microphone arrays and a ESP32 with audio out so I can hook it up to a good speaker… I really want to move away from Al the Alexa devices…
respeaker stuff is worth looking at. but realistically custom local voice currently wont nessasarily directly compare to devices with companies throwing billions at them
the AE is cheap enough that its nice to have on the desk to mess with for fun and inital testing of some things
I do have a couple echo 4, which have audio-in. Right now I got a couple audio-cast connected to them soon can stream my local videogame soundtrack through them. I can still use Alexa commands while also being able to listen to my music. I would like to at least replace the audio cast with voice assistants … the audio casts have airplay… which is cool for syncing… but the ESP32 assistant is lacking that… maybe I should just have 3 devices: Good speaker with audio-in, audio-cast (which has all the capabilities regarding good music playback) and a custom ESP32 with good microphones to use for voice commands…
Also, my voice PE just now completely stopped listening to wake up words and I have to press the button
things will get better with time. its in super activate development. so its currently "in the worst state that it will ever be in" and its pretty good already
that is unusual, the same with all the stock wakewords?
one thing I feel like is important to mention here
I assume you use the VA in german?
naturally english a way better supported language
especially for the STT part
this is a good point
the wake words also suffer from not having enough german dialect "okay nabu" samples
considering we are up against companies like amazon/google/apple I am surprised how good this works tbh
I am always so excited when a new HA release kicks in and I can update my yaml
yeah, home assistant has a few thousand volunteers helping out with samples. amazon just slerp a few million samples from their users without asking
I run both a VA PE & respeaker lite and I don't really see a difference between them when it comes to wake word or STT performance
that being said I haven't done a fair comparison either
yeah, for most practical purposes they are going to be the same
they run pretty similar hardware after all
one other project which might be worth mentioning is this:
https://github.com/rhasspy/wyoming-satellite
I haven't used that at all, so I can't speak for it's performance
but it uses a raspberry pi instead of esp32
so you have a lot more choice over the hardware
It works again now… it didn’t work for bout 10 minutes without me having to press the button
This is correct. I do use it in German, but I am also trying to pronounce “Okay nabu” very cleanly…
i am english and have some trouble with it too to be fair
I’ll have to take a look at respeaker lite. I just added a few automations to use the voice PE to add sport a viticulture to a to-do list, then another one which creates a prompt with all my activities, sends it to mistral AI, and responds with a summary of my activities and recommendations. Loving it
so do I
especially my gf
the training set for female voices is even smaller I assume
i find "okay nar boo" to work a bit better. or do a silly voice
Heh, gotta try that 😄
the WW models will improve
funny enough she has the same issue with alexa
even that doesn't have enough female data
I love how highly customizable all this is. Just sad the voice PE can’t start a conversation… I have my pipeline setup to use navy casa cloud
the voice pe can start a conversation and even continue one
pretty sure that was added recently
U thought so too, but I did try and remember I got an error bout it not being supported. Will have to check again. Would make things easier
maybe your software on the VA isn't updated?
April 2, 2025
Shouldn’t I have gotten an update prompt?
did you adopt the device?
Oh, it’s not showing up in ESPHome as installed device… but as discovered
I should pair it, right?
no
if you adopt it you need to update manually
Okay, good
dont adopt it
Alright. Let’s see the firmware version
check on the device page if the firmware entity is enabled
you want 2025.4.0
Turned on the “beta firmware” entity
with 25.3.4 and home assistant 2025.4 you should be good to use start conversation
It seems to be up to date
you shouldnt need beta but its also unlikelty to break anything
Will try again later then.
Alright
do you have an LLM attached?
I must have messed up earlier then.
Only nabu casa…
So, that’s the problem then
I should start paying for a LLM or run my own… which is to power intensive
it only ramps up power during queries
Had planned to get a Jetson Orin nano super earlier, but people say it’s too weak
the base model doesnt have alot of memory
i got a new gpu for my server the other day and it works great
Which one?
it has power spikes but they are short so actual energy usage is not that much
i got a 5060ti 16gb
Not cheap…
got at MSRP for 400 GBP which isnt awful
previous gen stuff was selling for more than that on ebay
3060 12gb here used for like 240€
But I'd recommend more than 12gb vram to run 14b models tbh
thats why i went with 16gb
you could go with multiple 12gb too depending on your server setup
Other pcie slot is used by my hba sadly haha
Gotta saw that slot open!
lol, don't tempt me.
How much did the whole setup cost? Looks pricey. Can’t one hook up a public AI agent like mistral or open ai or similar?… at least until I have enough money to invest on a dedicated home-run AI server
You sure can use third-party LLM
Would probably be cheaper for now. But token costs must be watched for sure
i dont know for sure tbh, i built that server a while back. just upgraded the gpu recently. it has an old 1650 super for transcoding before the upgrade
Is that server exclusively for HA?
no, in fact HA itself runs on another system, this is just a bigger system that does the processing
the mini pc on the shelf with the external drive runs HA and frigate
That's an x4 slot no?
I wonder how much worse ai will be on it
they also have the benefit of running much better models!
can't exactly big models locally lol
pcie speed doesnt make much of a difference if its all in vram
the set still needs to be loade into vram from system ram no?
yeah but once the model is loaded its there. unless your changing models alot its not a huge thing
I think I could split my x16 up into 2x8 👁️
that system is a supermicro server board so it does all sorts off stuff. its all pretty configurable
On what kind of device is your HA running? At first I was running mine on a raspberry pi 4. moved to a Thin Client (HP T530). However, if I read it correctly if I would like to have good tts / stt capabilities and the ability to connect to a LLM (even if not hosted on own hardware) I should once again move to a new machine. The documentation claims Intel NUC would be good for fast tts/stt
ha runs on a n150 mini pc, TTS runs on it but i offload STT to the bigger system which can use a bigger model and gpu acceleration
Okay. So, I probably won’t be willing to spend huge money on local LLM in the future, but I’d still like to be able to have my local voice devices act as conversation starters… which can’t be done with the local voice pipeline… and thus I’d have to at least setup the tts/stt pipeline and then connect to a external LLM… which means I’d have to make sure the machine HA is running on is capable enough. Gosh, if I were to do this I’d have to bind all Zigbee devices again… well, trying to figure out which intel NUC (or similar) would be a good choice now
zigbee devices should "in theory" move with your adaptor. just setting up the adaptor on a new system should allow you pick up the network where you left it
although i am not really that expeirenced with that tbh
my views on hardware including my setup can be found here
So, TTS is good on the listed Beelink devices?
The Beelink S12 is currently available for around 180€ (new)
Beelink S13 for 190€ (new).
Though the S13 is capped at 16GB ram? Not sure, product description is not quite clear in that regard
Could that be solved by sending all voice to Home Assistant? Or does it require more computing power than that?
These NUCs „only“ support up to 16GB RAM, right?
I would like to have HA running reliably on it with many Zigbee devices and Music Assistant
And of course local TTS/STt
Would STT still run reliably on the NUC 150 with 16GB ram?
assuming it has a gpu faster whisper can use yes
https://docs.linuxserver.io/images/docker-faster-whisper/
the container I use has nvidia support only
Faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models. This container provides a Wyoming protocol server for faster-whisper.
there seem to be alternative containers for intel igpu
(I cant vouch for how good it works tho)
Hm, I hadn’t planned to add any more hardware to the NUC. Especially not a external GPU. Don’t want to spend that much money on it right now
the nuc should have an igpu no?
Oh, I think it does. Hold on. I thought I’d be talking bout additional hardware
Inter Graphics 1000MHz (24EUs)
That’s the gpu it seems
Hardware requirements
Intel UHD Graphics for 11th generation Intel processors or newer Intel Iris Xe graphics Intel Arc graphics Intel Server GPU Intel Data Center GPU Flex Series Intel Data Center GPU Max Series
this contaienr has those hardware requirements
your igpu should have a bit more info
which cpu does the nuc have?
Inter Twin Lake N150
yeah I am clueless if this igpu is supported lol
they are mostly limited to 16gb ram yes, i just use standard piper on the CPU for TTS as its not particually an issue.
for STT i am not really that sure about running on the box itself although i have heard that the tiny model does work pretty quickly. but not sure if that makes use of the igpu or not but i dont think so.
i have always offloaded whisper to a bigger machine because i have been able to
Hm, maybe I should hold off then until I got a dedicated machine with its own gpu to handle all the inference and heavy loading
You could always use the cloud if you have problems getting it to work properly
Until you have better hardware
this is also true
I thought Whisper could run on an Intel N200 CPU.
i imagine it can, but the size of model you can use in any reasnable way will be limited
if will run pretty quick if your running tiny i imagine. but accuracy may be hit or miss
What would it take for an accurate model?
with any level of responsiveness? realistically... a gpu
Now I see why so much ML stuff is cloud-based.
You need a lot of compute power with an extremely low duty cycle. That means that client-side compute is massively underutilized. Server-side compute has much better utilization.
Do you need a discrete GPU or is an integrated GPU sufficient?