#Yes, but it is still confused when i am

1 messages · Page 1 of 1 (latest)

pulsar sapphire
#

In that scenario, if the first message has for example 51 characters, you immediately get an audio response since it has more characters than the first value in the list (50).

If the first message has 49 characters, then an audio response won't be generated since it's below the threshold of 50. The next message to be sent will then likely generate audio as that'll put it over 50 characters.

Then if you send another message that has 110 characters, it'll once again not generate audio because it's waiting for you to hit the threshold of 120 characters, as defined by that array.

If i send a message with 50 character, it is generated, then if i send 70 more character (50+70=120), it is generated again then if i have to send 40 more to have a new generation?

In this example, the first message (50 characters) will generate audio. The 2nd message (70 characters) won't because the counter resets and it needs to hit 120 characters before audio is generated.

All this is overridden if you send flush=true, which triggers audio generation immediately even if you haven't hit the threshold of characters.

The point of all this is to get faster speech generation at the beginning of a conversation, then ease off as it goes on.

naive night
#

thanks @pulsar sapphire. But as a conversation is a sequence of multi turn. what i want to do is to be faster at each beginning of a turn! Now, a "turn" is business logic view. How elevenlabs would understand what a turn is? Taking a concrete example: for the first turn, it could be a list of [ChunkA_1, ChunkA_2, ChunkA_3] and the second turn could be another list of [ChunkB_1, ChunkB_2, ChunkB_3]. I understand that elevenlabs see it as [ChunkA_1, ChunkA_2, ChunkA_3, ChunkB_1, Chunk_2, ChunkB_3], right?

#

Maybe adding a flush=True after ChunkA_3, would reset the counter, and 11labs would start again counting character from zero when seeing ChunkB_1. is it correct?

naive night
#

i have re-read your answer, and you definitely answers my next messages

naive night
#

As a side note, i now understand what i need and what 11labs doe s not provide, i would need to have a flag like isFlushed=True which will be given at the last audio chunk of audio generated with flush=True. This would enable me to have a long lived websocket session and be able to know when i have my last audio aoutput for each turn (provided that i send flush=True at each of turn). I try to track alignement characters, but sometimes the output generated is not exactly output expected, so i have to put some heuristic, and when there is heursitic, there is always a possible problem of mismatch and error

naive night
#

To understand better the whole strategy, i have done some testing. For each tests (5), i print the schedule strategy used, the command sent to the websocket and the result obtained with timing:

#

First, i notice, that even if i said 50 as first value of schedule strategy (first test), the first audio buffer contains only 9 characters. I would expect the sum of first audio chunks equal to 50, but no! @pulsar sapphire would you have explanation on that? thanks