#STP a less capable "drop in replacement" for Whisper?

1 messages · Page 1 of 1 (latest)

plucky cape
#

Morning everyone, this might be an obvious one, but I wasn't able to fully confirm it via documenation(s) I found: does the speech-to-phrase service work the exact same way as e.g. Whisper when it comes to input/output, just with a limited detection range? Audio in, transcribed text back?
And related to that: when both STP and Whisper are installed in HA directly (not as a standalone container on a separate machine), do they then have different endpoint paths (?) or ports so that they can be addressed separately?

noble pond
#

First off, the can both indeed exist on the same machine and be added as different Wyoming integrations.

As to the differences between how they work:

Whisper will transcribe any and all speech into text (even hallucinating speach from silence). You will get a full copy of the text sent to your conversation agent, even if it doesn't match an intent phrase.

STP will try to match speech to a given set of phrases. It will not provide arbitrary text that does not match one of your phrases.

In practice, the main difference is that STP won't be useful with an LLM because you cannot make unstructured queries since it will only return text for those that match known phrases.

plucky cape
#

@noble pond , sorry for the late reaction, somehow I had missed the notification for your answer. thanks a lot so far! 🙂 But what I'm still unsure about: is speech-to-phrase a service category/type of its own from a wyoming perspective, so similar to how e.g. TTS also is a separate service type? Or does wyoming internally "see" and treat STP as "just another" STT option?

#

but the configuration for the whisper host is basically just the ip and port of the wyoming server (which in my case would be my HA server).

#

so I was kind of hoping that when I only run STP on my HA server as part of wyoming, that I'd be able to use STP as a less powerful STT service

noble pond
#

Ah. Yes. It’s the same thing as far as HASS is concerned. Accepts audio, returns text. It’s just a difference in how it transcribes.

#

That said, I’m not sure the Rhasspy STP will be useful for anything other than Home Assistant as the sentences it’s trained on are for Home Assistant intents.

Unless you provide a set of phrases to train against that match what your Pebble is expecting, it won’t work out.