Performance tips for Text to speech (custom voices) | ElevenLabs | Page 1

hidden bluff Jan 6, 2024, 4:40 PM

#

While eleven_turbo_v2 is basically real time, I've found it doesn't support custom voices with the style attribute. So we use eleven_multilingual_v2 for our custom voices (created in VoiceLab).

Are there performance tips for speeding up custom voice generation?

What we've tried:

Streaming API (/v1/text-to-speech/{voice_id}/stream) - this didn't work as expected. It generates most of the audio before starting the stream.
Websocket API (/v1/text-to-speech/{voice_id}/stream-input) - this starts sending audio much faster, but you can't expect it to be "real time", as the audio isn't guaranteed to be sent fast enough, so the user could get audio lag/skipping, which is not a great UX.
Reducing API latency (helpdesk article) - setting optimize_streaming_latency helps some, but doesn't quite get us to "real time" yet.

What we're planning to try:

Divide up our text into sentences and parallelize the calls to see if we can get the full audio back faster, and stitch it together after the fact.
Open the websocket connection as early as possible and start streaming text to it before we have our full transcript ready.

Anything else we could try to get closer to "real time" for custom voices? Willing to try literally anything 🙂

Loving the product, thanks!

patent yarrow Jan 6, 2024, 6:59 PM

#

you've pretty much hit on all the possible latency optimizations - I tend to vary between the "split into sentences and parallelize" and "open websocket connection early" approaches depending on the situation

#

but no, there isn't much to be done beyond that - multilingual v2 is just very slow

#

especially if you enable style

#

if all you need is english, you could try out eleven_english_v2 if you have the alpha access

#

that one is fast enough to be realtime even with style on

polar pier Jan 6, 2024, 10:48 PM

#

I'm also trying to make this as performant as possible. Does changing the output format on a TTS request reduce latency. E.g mp3_44100_128 -> mp3_44100_64?

patent yarrow Jan 6, 2024, 11:01 PM

#

polar pier I'm also trying to make this as performant as possible. Does changing the output...

I've tested this - no

#

I did like, 25 runs, and the difference between formats was less than the run to run variance

hidden bluff Jan 7, 2024, 1:27 PM

#

this is great info @patent yarrow thanks!

will try out eleven_english_v2 -- is there any info on getting alpha access? hadn't heard of it yet afaik 🙏

patent yarrow Jan 7, 2024, 1:40 PM

#

hidden bluff this is great info <@219227787356012555> thanks! will try out `eleven_english_v...

I believe this is the URL https://elevenlabs.io/request-alpha-access
but I don't know if it's currently still available for new signups

ElevenLabs

Text to Speech & AI Voice Generator | ElevenLabs

Create natural AI voice instantly in any language - perfect for video creators, developers and businesses.

#Performance tips for Text to speech (custom voices)

What we've tried:

What we're planning to try: