#how to detect tts end , when use elevenlabs websocket tts api
1 messages · Page 1 of 1 (latest)
Short answer:
- Listen for a JSON message with isFinal: true from the TTS WebSocket. That’s the server’s signal that no more audio chunks will follow. (elevenlabs.io)
- After sending your last text, send {"text": ""}. This tells the server the sequence is complete; you’ll receive the final message and then the socket will close. You can also treat the WebSocket close as end-of-utterance. (elevenlabs.io)
Minimal pattern (language-agnostic):
- Receive loop:
- If message.audio exists, append/decode and play.
- If message.isFinal is true, stop reading and finalize playback.
- When done sending text:
- Optionally send {"flush": true} with the last chunk to force any buffered text to generate promptly.
- Then send {"text": ""} to finalize and close. (elevenlabs.io)
Operational notes:
- The connection auto-closes after inactivity (default 20 seconds); you can keep it alive by periodically sending " " (space), or adjust inactivity_timeout (up to 180 seconds) via query param. (elevenlabs.io)
Docs:
- How-to guide with example showing isFinal handling and end-of-sequence close. (elevenlabs.io)
- API reference for the text-to-speech WebSocket, including inactivity_timeout and message types (Audio Output, Final Output). (elevenlabs.io)
Sources:
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
WebSocket | ElevenLabs Documentation
In my design, the WebSocket will be multiplexed/reused, so I do not intend to disconnect after each request. Instead of the connection-closing semantics, I need to identify the end-of-audio marker for a specific text segment to know when the generation for that particular utterance is finished.
Short answer
- The server tells you an utterance is finished via a final JSON message with isFinal: true. On the multi-context endpoint, that final message also includes contextId so you can correlate which utterance finished. (elevenlabs.io)
What to use, depending on your socket mode
- Single-stream websocket (/v1/text-to-speech/{voice_id}/stream-input):
- Receive loop: collect audio chunks from messages that contain audio; stop for the current utterance when you receive a message where isFinal is true. You can keep the connection open and start the next utterance without disconnecting. (elevenlabs.io)
- Multi-context websocket (/v1/text-to-speech/{voice_id}/multi-stream-input):
- Send text with a context_id per utterance.
- On the receive side, route each message by contextId. When you get { isFinal: true, contextId: "<id>" }, that specific utterance/context is done. You can close just that context with { context_id, close_context: true } without closing the socket. (elevenlabs.io)
Practical tips for reused/multiplexed sockets
- Force completion of buffered text: send { flush: true } (or with context_id on multi-context) to make the model emit what it has, after which you will receive the isFinal message for that utterance/context. (elevenlabs.io)
- Keep the socket alive between utterances:
- Single-stream: send " " periodically; an empty string "" sends EOS and closes the socket. You can also raise inactivity_timeout (max 180s) via a query param. (help.elevenlabs.io)
- Multi-context: to prevent a specific context from timing out, you can send an empty text on that context to reset its timer; closing a context is done with close_context, not by sending "". (elevenlabs.io)
References
- Real-time TTS over WebSocket (guide): isFinal in receive loop, flush behavior, keep-alive guidance. (elevenlabs.io)
- WebSocket API reference (single-stream): message schema, Final Output object, query params (e.g., inactivity_timeout, sync_alignment). (elevenlabs.io)
- Multi-Context WebSocket API reference: contextId in responses, isFinal per context, close_context/flush messages. (elevenlabs.io)
- Help Center: keeping the WebSocket open; "" closes the single-stream socket.
可以单独标记isFlush吗
不可以。协议里没有名为 isFlush 的字段;应使用 flush 布尔标志。
-
单路流端点(/stream-input):
- 在发送最后一段文本时携带 "flush": true,用于清空缓冲并立即出音。官方指南示例:await websocket.send({"text": "...", "flush": true})。生成结束时,服务端会发送最终消息(Final Output),你可据此判定一段音频已完成。请勿发送空字符串 ""(这会触发 EOS 并关闭连接);若需要保持连接,发送单个空格 " "。(elevenlabs.io)
-
复用/多路并发场景,建议改用多上下文端点(/multi-stream-input):
- 可以单独发送仅含 "flush": true 的消息来结束某个 context(还可不带 text),例如:{"context_id":"ctx1","flush":true}。对应 context 的生成结束时,服务端会返回包含 isFinal: true 且带 contextId 的最终消息,作为该段音频完成的权威标记。(elevenlabs.io)
-
额外说明:部分示例代码用的是 is_final(下划线)键名读取最终标记,但 API 参考以 isFinal(驼峰)为准。(elevenlabs.io)
总结:没有 isFlush;请使用客户端消息中的 "flush": true(可与文本同发,或在多上下文端点中单独发送)并监听服务端最终消息的 isFinal 标志来判定单段 TTS 的结束。(elevenlabs.io)
Sources:
Generate audio in real-time | ElevenLabs Documentation
Multi-Context WebSocket | ElevenLabs Documentation
Multi-Context Websocket | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
isFinal 真正含义是什么,收到flush 一定会收到isFinal