#What re you building If I may ask
1 messages · Page 1 of 1 (latest)
We've got a platform for automating various healthcare processes, like ordering prescription refills or booking visits, and we're using a mix of different STTs
A good, human-sounding STT voice is a gamechanger
You mean TTS? 😉
Btw, what are you using for Speech-to-Text, btw? We're using Azure. I've tried Whisper, but inference is rather slow, even on a machine with GPU. I mean - it's not like super slow, but it's not faster than Google or Azure. What about you?
yeah, of course I meant TTS, for STT we use a custom solution
and yeah we've found that Whisper is too slow for realtime, though it has some interesting features like automatic translation
What latency do you got on your STT? And do you wait for an end of a sentence or you transcribe it and use mid-sentence?
I'm thinking a lot about this problem. Cause, waiting for the sentence end makes waiting longer -> bigger latency overall.
We use interim results if that's what you mean. We need it for the interpretation to happen as fast as possible, which includes operating on incomplete text as it's being uttered.
And the refining as more text becomes available
Yea, I'm trying to do the same. If I find out the sentence is finished, then I push the text through our pipeline. That's my way of doing the interim results. But I'm not sure if it's the best one. How do you know if the interim result is ok, will not change? I'm using Azure STT at the moment.
haven't tried azure, but we simply return a flag that indicates if transcription for a particular fragment of audio is final or not, and when it's final, move on to the next fragment
that's similar to what google STT does
Hmm... but do you wait until user stops speaking? Because that's what Google is doing with is_final flag. Or you just catch for end of sentence or you see, that some part of the text is not changing, so it's considered final.