#Sentence correction for any wyoming STT using a Proxy

1 messages · Page 1 of 1 (latest)

strong latch
#

Hi,

I’m working on creating a Wyoming service that acts as a proxy between Home Assistant and an STT service. My idea is that this proxy will apply sentence correction logic similar to what Wyoming Vosk provides, so that this feature can be available for any STT service compatible with the Wyoming protocol.

In simple terms, the proxy works as follows:

It exposes a Wyoming endpoint (e.g., 10301) and connects to an STT service on another port (e.g., 10300).

Home Assistant sends audio events to the proxy at 10301; the proxy transparently forwards them to 10300.

The STT responds with a transcription on 10300.

The proxy receives the transcription, applies sentence correction logic,

Then it returns the corrected sentence to Home Assistant.

Additionally, the proxy also handles the Describe event so it can function as a fully transparent proxy.

Regarding this, I have two questions:

Is my approach of adding sentence correction as a proxy acceptable or reasonable? This way, I can add sentence correction to any Wyoming-compatible STT service (it is already working with wyoming faster whisper). However, this wouldn’t work for STT services integrated directly into Home Assistant, such as Google Cloud or Nabu Casa’s subscription-based STT. Ideally, sentence correction would be part of the Assist pipeline in Home Assistant, with correction services implemented via Wyoming, but as far as I know, this doesn’t exist.
Or is there a better way?

In my proof of concept, I’m using the sentence correction code from Wyoming Vosk (I literally copied the sentences.py file).
@clear python , would it be okay if I use your code for my proxy?

By the way, I’m not a Python developer, so my code isn’t pretty or high quality—just functional for now. If I have Mike’s permission to use the Vosk sentence corrector, I could release an initial version in the next few days.

EDIT:

First, more or less, working version here:

https://github.com/Cheerpipe/wyoming_rapidfuzz_proxy

Thanks in advance!

GitHub

Contribute to Cheerpipe/wyoming_rapidfuzz_proxy development by creating an account on GitHub.

clear python
#

This is an interesting idea! You're welcome to use the code for anything you'd like 🙂
Some improvements to the process could definitely be made. First off, pulling the user's exposed device/area/floor names directly from HA would be great (like Speech-to-Phrase does: https://github.com/OHF-Voice/speech-to-phrase/blob/main/speech_to_phrase/hass_api.py). This would also pull in sentence triggers.
As other folks have mentioned elsewhere, correcting sentences at the text level works well for some languages but not for others. Operating at the phoneme level would be best, but is much harder.

#

For Whisper specifically, I've been working on biased decoding: https://github.com/OHF-Voice/whisper-bidec
This modifies the probabilities of tokens coming out of the Whisper decoder to align with a set of sentences. The plan is to use the sentences from Speech-to-Phrase with the user's device/area/floor names, so that Whisper will more likely recognize HA commands with that user's things. The nice thing is that sentences outside of that command list can also still be recognized.
The downside to the biased decoding is that I can't use faster-whisper anymore; I have to use HuggingFace's transformers library.

strong latch
#

Hi Mike, it's exciting to receive such a positive response from the mind behind the voice infrastructure.
I'm learning Python in the process, so I believe this will be an adventure in which, hopefully, my idea will grow and improve over time.
As you said, I’d like to eventually fetch entities, areas, and floors directly from HASS. I think I’ll tackle that once I have this containerized and tested, because things will surely break along the way.
For now, I'm using my proxy with Azure Speech Services with excellent results. I'm subscribed to Nabu Casa, so I have access to Nabu's STT, but since it's not Wyoming, I can't use my proxy with it. (and I would love it)

Ihe truth is, my project doesn’t add anything new or additional to what you’ve already built for Assist. However, I really like my approach of separating phrase correction from the STT itself, allowing different kinds of correction methods to be used with whichever speech-to-text service each person prefers.

In audio processing pipelines, it’s common to see effects applied in series — maybe that could be a future direction for Assist, where different text processing components could be added in a chain.

Now that I have your approval, I’ll start working on a first version to publish.

Excited to see what the future holds for Assist. Thanks for all the amazing work.