#Speech Recognition and TTS

1 messages · Page 1 of 1 (latest)

slim walrus
#

TLDR: Vedal, what does the setup for Neuro-sama's speech recognition and TTS look like?
-# MSEdge? OpenAIWhisper? AlltalkV2? Kokoro?

-# I wanted to ask privately, but this seems to be the more appropriate approach.
I am an AI enthusiast, and, admittedly, I have only recently found out about Neuro-sama. Prior to this, I've been pulling my hair out trying to figure out how to run a speech recognition and TTS model that work in tandem with the language models I run.
Options I've explored for TTS are mentioned above (excluding Whisper), but the models and engines available either lack speed or quality.

So it's why I'd like to ask if you might point me in the direction you took for Neuro-Sama. What handles its STT and TTS (if those two are handled by third parties and not original engines/models)?

I'd imagine your incredibly busy, so thank you in advance for taking a moment to read this.

lavish hearth
#

Whisper is not TTS, it's STT. Vedal uses Azure TTS voice "Ashley" (I think) pitched up by ~3 semitones for Neuro. I'm not sure for Evil. The STT might be whisper.

#

Also just from personal experience I can tell you that running these models is pretty expensive and a bit of a pain

#

I also doubt Vedal would respond. He is pretty secretive when it comes to this stuff

#

It seems like the STT model picks up on more sounds than whisper would though. Whisper is more for transcription and Neuro can pick up on the "uh, hmmm, huh, etc" which most STT models ignore

#

so it could be something else

#

If you have a really big GPU you could train a model yourself. It's extremely difficult (and expensive sometimes) but possible

slim walrus
#

Ah yeah sorry I didn't mean to imply Whisper was TTS. And Azure TTS huh? I guess that means TTS isn't being handled locally?
I'm just now hearing about Azure TTS and it seems like a Microsoft version of ElevenLabs.