While eleven_turbo_v2 is basically real time, I've found it doesn't support custom voices with the style attribute. So we use eleven_multilingual_v2 for our custom voices (created in VoiceLab).
Are there performance tips for speeding up custom voice generation?
What we've tried:
- Streaming API (
/v1/text-to-speech/{voice_id}/stream) - this didn't work as expected. It generates most of the audio before starting the stream. - Websocket API (
/v1/text-to-speech/{voice_id}/stream-input) - this starts sending audio much faster, but you can't expect it to be "real time", as the audio isn't guaranteed to be sent fast enough, so the user could get audio lag/skipping, which is not a great UX. - Reducing API latency (helpdesk article) - setting
optimize_streaming_latencyhelps some, but doesn't quite get us to "real time" yet.
What we're planning to try:
- Divide up our text into sentences and parallelize the calls to see if we can get the full audio back faster, and stitch it together after the fact.
- Open the websocket connection as early as possible and start streaming text to it before we have our full transcript ready.
Anything else we could try to get closer to "real time" for custom voices? Willing to try literally anything 🙂
Loving the product, thanks!