#The Coral is not compatible with
1 messages ยท Page 1 of 1 (latest)
@normal hearth said that the coral edgetpu doesn't have nearly enough memory for even the smaller STT whisper models. I guess that is true, but haven't confirmed.
Many of us use a Coral for the Frigate NVR which processes image frames, but I have no idea if the entire object recognition model needs to fit in the memory of the Coral or not.
I don't suppose anyone is working on it.
If the the coral edgetpu's are a no go, is there any hardware additions I can buy that would help with STT performance and accuracy?
I currently have a minipc with an i5-4210Y CPU, and whisper seems a bit slower than my previous hermes/rhasspy setup (although it had far fewer sentences to train on, and couldn't do wildcards at all).
A GPU can speed things up. I don't have one but am considering something.
https://github.com/toverainc/willow-inference-server?tab=readme-ov-file#benchmarks
here's how a GPU fares #voice-assistants-archived message
that's an nvidia tesla P4 with a small model and a beam size of 5
This is correct. To start whisper-tiny (as one example) even when heavily quantized (int8 via ctranslate2 that only supports CPU and CUDA) is 75MB. Even if distil-whisper or other memory optimized implementations are used (tiny isn't even available ATM) whisper-tiny still roughly 5x the size of available Coral memory (6-8MB). To make matters worse, the Coral only supports TensorFlow Lite which doesn't implement a lot of the fundamental operations that are used by Whisper. I don't like saying things are impossible but this is almost certainly one of those cases. This comes up a lot and on the surface it seems like it would make sense but the object classification/recognition vision models used for something like Frigate only need to identify/classify something like 200 different objects. This is a much easier and memory conservative task vs understanding the full spectrum of human voice, language, grammar, etc. The last time I talked to @magic knot he talked about potentially taking some alternative speech recognition approaches that would likely use more limited grammars, structure, etc. I think he's also mentioned this on things like the Year of Voice chapter X streams and various GH issues and conversations here.
We have more up to date and inclusive benchmarks here: https://heywillow.io/components/willow-inference-server/#benchmarks The TLDR is an eight year old Nvidia card (GTX 1070/Tesla P4) that can be purchased used for $100 is 10x faster than my $1k-ish ThreadRipper at a fraction of the cost and power. Challenge here is of course needing the host form-factor to take a dual slot PCIe card. The result is sub-500ms processing for voice command durations (< 3sec) even with the larger models. With Willow we consider small/beam size 2 to be the usable minimum for the kinds of speech, grammar, and conditions used for things like voice assistants in far-field use cases. GPUs are so fast many of our users just go all out and use something like large with beam 5.
One alternative right now is Vosk: https://github.com/rhasspy/wyoming-vosk
It's very fast, but you need to tell it up front what sentences you will speak.
Thank you, that was the definitive explanation I was looking for.
Vosk looks interesting. If I am using English with --allow-unknown, does it basically use the "wildcard" sentence config that is currently working well with whisper?
Or are you saying that the the performance is only better in "Limited" mode that does not work with --allow-unknown at all?
NICE! Definitely a great option!!! Of course with HA voice functionality exploding and growing by leaps and bounds it could get tricky (impossible?) for users that want to do arbitrary commands like "set thermostat to x degrees" or "Spotify play x" I imagine?
Performance is just fine with --allow-unknown, but it's still biased towards recognizing the sentences that you provide. Depending on your situation, this could be a feature or a bug ๐
In my case, I'm using Vosk for a pipeline that my kids use in the living room. They can turn the TV on and off by saying anything remotely like "turn on TV", but saying things like "draw me a picture of a unicorn" will silently fail with "unknown".
Yeah, it definitely isn't meant for those kinds of situations. With a proper fallback system, I could see Vosk being a fast "first line of defense" for the easy/common phrases and Whisper being a fallback.
Beautiful!
YouTube: data: - sentences: - "(play|watch|listen to) {video} on Youtube"
lists: video: wildcard: true
This works very well for me right now. I want to try if vosk can do it faster
A quick test seems to work for me. I just left Vosk on the open-ended mode (no limited, no unknown) and it got all of the "play X on YouTube" sentences. It's probably not going to catch certain movie or band names, but not too bad in general.
Is vosk available soon as an HA addon?
OMG that is so much faster than fast-whisper it seems
I can see vosk being default at some point
Lol, yeah it is quite a bit faster (with trade-offs, for sure)! For a lot of use cases, I think the "old" tech works just fine. And there are better Vosk models available too; plus, I'd like to get Coqui STT going at some point.
noticing some tradeoffs, like saying numbers results in spelled out words "play eighties rock on you tube"
what's been annoying is that it often mistakes the word "light" with "like" or "late".
I tried following the instructions here: https://alphacephei.com/vosk/adaptation, but they don't seem to change the model that it downloads/caches. Not sure if I'm missing a step, but can't notice any difference (perceived or file hash).
Anyway, I've bandaged this issue with an expansion rule edits
For the add-on, you will need to put your model in /share/vosk/models/<language>/<model_name>. So something like /share/vosk/models/en/my-model. It's supposed to automatically pick it up, but I haven't tested it extensively.
yeah, that's what I did. I didn't see any changed files from the built in model from /data
the instructions just say to use test_words.py... but not how to actually change or create the model. It seems like it is supposed to alter the recognizer.cc in realtime. But if it doesn't change anything in the file system, I'm not sure how to see if it changed anything, or how to make it survive reboot
Went back to Whisper, since Vosk has a lot of accuracy issues. I will deal with the slower performance of whisper until I get a better cpu. My mini pc doesn't have any expansion for a GPU and the coral tpu doesn't help with STT.
Is there a recommendation for CPU (cores/threads/clockspeed/boostclock/etc.)?