What re you building If I may ask | ElevenLabs | Page 1

errant nimbus Jan 30, 2023, 2:15 PM

#

I'm targeting Video Calls, so similar use case.

twin oar Jan 30, 2023, 2:18 PM

#

We've got a platform for automating various healthcare processes, like ordering prescription refills or booking visits, and we're using a mix of different STTs

#

A good, human-sounding STT voice is a gamechanger

errant nimbus Jan 30, 2023, 8:44 PM

#

You mean TTS? 😉

#

Btw, what are you using for Speech-to-Text, btw? We're using Azure. I've tried Whisper, but inference is rather slow, even on a machine with GPU. I mean - it's not like super slow, but it's not faster than Google or Azure. What about you?

twin oar Jan 30, 2023, 9:45 PM

#

yeah, of course I meant TTS, for STT we use a custom solution

#

and yeah we've found that Whisper is too slow for realtime, though it has some interesting features like automatic translation

errant nimbus Jan 30, 2023, 9:59 PM

#

What latency do you got on your STT? And do you wait for an end of a sentence or you transcribe it and use mid-sentence?

#

I'm thinking a lot about this problem. Cause, waiting for the sentence end makes waiting longer -> bigger latency overall.

twin oar Jan 30, 2023, 10:57 PM

#

We use interim results if that's what you mean. We need it for the interpretation to happen as fast as possible, which includes operating on incomplete text as it's being uttered.

#

And the refining as more text becomes available

errant nimbus Jan 31, 2023, 12:14 AM

#

Yea, I'm trying to do the same. If I find out the sentence is finished, then I push the text through our pipeline. That's my way of doing the interim results. But I'm not sure if it's the best one. How do you know if the interim result is ok, will not change? I'm using Azure STT at the moment.

twin oar Jan 31, 2023, 12:16 AM

#

haven't tried azure, but we simply return a flag that indicates if transcription for a particular fragment of audio is final or not, and when it's final, move on to the next fragment

#

that's similar to what google STT does

errant nimbus Jan 31, 2023, 1:15 AM

#

Hmm... but do you wait until user stops speaking? Because that's what Google is doing with is_final flag. Or you just catch for end of sentence or you see, that some part of the text is not changing, so it's considered final.

#What re you building If I may ask