#how to detect tts end , when use elevenlabs websocket tts api

1 messages · Page 1 of 1 (latest)

hearty pebbleBOT
#
AI-generated Answer for hurrytospring0798

Short answer:

  • Listen for a JSON message with isFinal: true from the TTS WebSocket. That’s the server’s signal that no more audio chunks will follow. (elevenlabs.io)
  • After sending your last text, send {"text": ""}. This tells the server the sequence is complete; you’ll receive the final message and then the socket will close. You can also treat the WebSocket close as end-of-utterance. (elevenlabs.io)

Minimal pattern (language-agnostic):

  • Receive loop:
    • If message.audio exists, append/decode and play.
    • If message.isFinal is true, stop reading and finalize playback.
  • When done sending text:
    • Optionally send {"flush": true} with the last chunk to force any buffered text to generate promptly.
    • Then send {"text": ""} to finalize and close. (elevenlabs.io)

Operational notes:

  • The connection auto-closes after inactivity (default 20 seconds); you can keep it alive by periodically sending " " (space), or adjust inactivity_timeout (up to 180 seconds) via query param. (elevenlabs.io)

Docs:

  • How-to guide with example showing isFinal handling and end-of-sequence close. (elevenlabs.io)
  • API reference for the text-to-speech WebSocket, including inactivity_timeout and message types (Audio Output, Final Output). (elevenlabs.io)

Sources:
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
WebSocket | ElevenLabs Documentation

civic tinsel
#

In my design, the WebSocket will be multiplexed/reused, so I do not intend to disconnect after each request. Instead of the connection-closing semantics, I need to identify the end-of-audio marker for a specific text segment to know when the generation for that particular utterance is finished.

hearty pebbleBOT
#
AI-generated Answer for hurrytospring0798

Short answer

  • The server tells you an utterance is finished via a final JSON message with isFinal: true. On the multi-context endpoint, that final message also includes contextId so you can correlate which utterance finished. (elevenlabs.io)

What to use, depending on your socket mode

  • Single-stream websocket (/v1/text-to-speech/{voice_id}/stream-input):
    • Receive loop: collect audio chunks from messages that contain audio; stop for the current utterance when you receive a message where isFinal is true. You can keep the connection open and start the next utterance without disconnecting. (elevenlabs.io)
  • Multi-context websocket (/v1/text-to-speech/{voice_id}/multi-stream-input):
    • Send text with a context_id per utterance.
    • On the receive side, route each message by contextId. When you get { isFinal: true, contextId: "<id>" }, that specific utterance/context is done. You can close just that context with { context_id, close_context: true } without closing the socket. (elevenlabs.io)

Practical tips for reused/multiplexed sockets

  • Force completion of buffered text: send { flush: true } (or with context_id on multi-context) to make the model emit what it has, after which you will receive the isFinal message for that utterance/context. (elevenlabs.io)
  • Keep the socket alive between utterances:
    • Single-stream: send " " periodically; an empty string "" sends EOS and closes the socket. You can also raise inactivity_timeout (max 180s) via a query param. (help.elevenlabs.io)
    • Multi-context: to prevent a specific context from timing out, you can send an empty text on that context to reset its timer; closing a context is done with close_context, not by sending "". (elevenlabs.io)

References

  • Real-time TTS over WebSocket (guide): isFinal in receive loop, flush behavior, keep-alive guidance. (elevenlabs.io)
  • WebSocket API reference (single-stream): message schema, Final Output object, query params (e.g., inactivity_timeout, sync_alignment). (elevenlabs.io)
  • Multi-Context WebSocket API reference: contextId in responses, isFinal per context, close_context/flush messages. (elevenlabs.io)
  • Help Center: keeping the WebSocket open; "" closes the single-stream socket.
civic tinsel
#

可以单独标记isFlush吗

hearty pebbleBOT
#
AI-generated Answer for hurrytospring0798

不可以。协议里没有名为 isFlush 的字段;应使用 flush 布尔标志。

  • 单路流端点(/stream-input):

    • 在发送最后一段文本时携带 "flush": true,用于清空缓冲并立即出音。官方指南示例:await websocket.send({"text": "...", "flush": true})。生成结束时,服务端会发送最终消息(Final Output),你可据此判定一段音频已完成。请勿发送空字符串 ""(这会触发 EOS 并关闭连接);若需要保持连接,发送单个空格 " "。(elevenlabs.io)
  • 复用/多路并发场景,建议改用多上下文端点(/multi-stream-input):

    • 可以单独发送仅含 "flush": true 的消息来结束某个 context(还可不带 text),例如:{"context_id":"ctx1","flush":true}。对应 context 的生成结束时,服务端会返回包含 isFinal: true 且带 contextId 的最终消息,作为该段音频完成的权威标记。(elevenlabs.io)
  • 额外说明:部分示例代码用的是 is_final(下划线)键名读取最终标记,但 API 参考以 isFinal(驼峰)为准。(elevenlabs.io)

总结:没有 isFlush;请使用客户端消息中的 "flush": true(可与文本同发,或在多上下文端点中单独发送)并监听服务端最终消息的 isFinal 标志来判定单段 TTS 的结束。(elevenlabs.io)

Sources:
Generate audio in real-time | ElevenLabs Documentation
Multi-Context WebSocket | ElevenLabs Documentation
Multi-Context Websocket | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation

civic tinsel
#

isFinal 真正含义是什么,收到flush 一定会收到isFinal