#is the speech-to-speech streaming api referring to streaming the uploaded source audio or streaming
1 messages · Page 1 of 1 (latest)
The Speech-to-Speech streaming API refers to streaming the generated audio. When you use the streaming API, it streams back the audio as it is being generated, resulting in a much lower response time from request to the first byte of audio received. This is particularly recommended for applications that require low latency.
For instance, in the case of the Text-to-Speech streaming API, the regular endpoint renders the audio file before returning it in the response. However, the streaming endpoint streams back the audio as it is being generated.
Similarly, for applications where the text prompts can be streamed to the text-to-speech endpoints, this allows for prompts to be fed to the endpoint while the speech is being generated. You can also configure the streaming chunk size when using the websocket, with smaller chunks generally rendering faster.
In the context of Speech-to-Speech, the process involves providing the path to an audio file that you would like to convert from one voice to another when calling the API. The response from the API request is then read in chunks and written to the output file, effectively streaming the generated audio.
how can i expect to the streamed audio to be returned from the API? is there a audio length threshold which if not met will just return the entire generated file?