#I don’t think I understand what you are
1 messages · Page 1 of 1 (latest)
Let's see if I can explain it better. Each conversation is a series of turn which basically includes (1) get transcript, (2) call LLM, (3) do in parallel: (3-1) streaming of llm output to 11labs websocket (producer), and (3-b) listen to websocket synthetized audio frames (consumer). How do I know for certain that I have received all synthetized audio frames from one turn, to be able to close consumer for this turn? If i understand well, i can not send the EOS message from producer for this turn, as it will closed the websocket which will prevent to use it during the whole session. I can not either check if the final synthetized speech as same chars than the one sent in producer, as i have observed that sometimes they are different (some punctuation, or space added in 11labs result).
Let me clarify: you are talking about your custom agent, which utilizes 11labs TTS streaming?
I connect to 11labs websocket to do TTS in streaming. I push chunks from llm ouptut to websocket. What i would like to do is that if my llm return 3 chunks ["hi", how" "are you"] in one turn, i would like to know when i got from 11labs weboscket all the audio frames that corresponds to "hi how are you".
You can tell that the audio generation has ended by checking the isFinal field in the data that you receive from TTS WS
So that is the point i do not understand. Because to get a isFinal in one of the audio generated i need to send a EOS text in the websocket. And this will also closed the websocket, no?
If by EOS text you mean something that ends with a dot (or equivalent) - then no, it will not close the websocket. I personally haven't used it but based on the documentaion and common sense this should not happen
by EOS i mean a empty string "". Otherwise what are the case where you get a isFinal=True?
Ok, empty string will end the connection. Why is it a problem?
if i have to send en empty string at each turn, i expect then at each turn the websocket will be closed. Then, i go back to the solution of opening and closing a websocket at each turn, which is not what you recommended (having a websocket opened during the whole conversation)