#i want to know how to use mutli-context websocket
1 messages · Page 1 of 1 (latest)
Here’s how to use ElevenLabs’ Multi-Context WebSocket for TTS, based solely on the official docs:
What it is
- A single WebSocket where you can run up to 5 independent TTS “contexts” (audio generations) concurrently. Designed for conversational agents to handle barge‑ins, interleaving, and smooth turn-taking; not for generating unrelated streams. Not available on the eleven_v3 model. (elevenlabs.io)
Endpoint and auth
- Connect to: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/multi-stream-input
- Send header: xi-api-key: YOUR_API_KEY
- Useful query params: model_id, output_format, inactivity_timeout (default 20s, max 180s), sync_alignment (adds timing per chunk). (elevenlabs.io)
Message types you send
- Start or continue a context: { "text": "...", "context_id": "ctxA", ["voice_settings": {...}] }
- Include voice_settings on the first message that opens a context. (elevenlabs.io)
- Flush buffered text to generate audio now: { "context_id": "ctxA", "flush": true } (elevenlabs.io)
- Close a specific context (e.g., on interruption): { "context_id": "ctxA", "close_context": true } (elevenlabs.io)
- Keep a context alive without adding speech (resets timeout): { "context_id": "ctxA", "text": "" } or set inactivity_timeout in the URL. (elevenlabs.io)
- Close the socket (ends all contexts): { "close_socket": true } (elevenlabs.io)
What you receive
- Streaming audio chunks as base64 in messages like:
- { "audio": "...", "is_final": false, "contextId": "ctxA" }
- A final marker when the context completes: { "is_final": true, "contextId": "ctxA" }
- If sync_alignment=true, alignment/timing is included per chunk. (elevenlabs.io)
Operational guidance and limits
- Use one WebSocket per end‑user session; manage multiple contexts inside it.
- Stream text in small chunks and set flush: true at sentence ends to improve responsiveness and quality.
- Close unused contexts promptly; the server allows up to 5 concurrent contexts per connection.
- Default context inactivity timeout is 20s; can be raised to 180s via query param; send empty text to keep alive. (elevenlabs.io)
Getting started examples and setup
- For foundational WS setup (connecting, sending text, saving audio), follow the general WebSocket TTS guide and then switch the URL to the multi‑context endpoint above. (elevenlabs.io)
Note on models
- The docs page notes v1 TTS models are deprecated and scheduled for removal on 2025‑12‑15; use current recommended models such as eleven_flash_v2_5 with this endpoint. Multi‑context is not supported on eleven_v3. (elevenlabs.io)
If you share your language/runtime and target
but it was not working as expected
Got it. To help troubleshoot the multi‑context WebSocket, here are the exact checks and message patterns the docs require. If any of these differ in your code, that’s likely the cause.
Connection and model
- Use this endpoint: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/multi-stream-input, and send your API key in the xi-api-key header. Include model_id in the query string. (elevenlabs.io)
- WebSockets (including multi‑context) are not available for eleven_v3. Use a supported TTS model (for low latency the docs recommend eleven_flash_v2_5). (elevenlabs.io)
Required message flow per context
- First message for a new context must include context_id and can include voice_settings; the example initializes with "text": " " (a single space). (elevenlabs.io)
- Continue sending text with the same context_id. Use flush: true at the end of a sentence/turn to force generation; otherwise audio may buffer until thresholds are met. (elevenlabs.io)
- Close just one context with {"context_id":"...", "close_context": true}. Close the entire socket with {"close_socket": true}. (elevenlabs.io)
Concurrency and limits
- Max 5 concurrent contexts per WebSocket connection. Exceeding this will cause issues. (elevenlabs.io)
- Default inactivity timeout is 20 seconds; you can raise it (up to 180) with inactivity_timeout in the WebSocket URL. (elevenlabs.io)
Keeping things alive (important distinction)
- Multi‑context: to prevent a specific context from timing out while idle, send {"context_id":"...", "text": ""}. (elevenlabs.io)
- Standard TTS WebSocket: a fully empty string "" closes the connection; to keep a connection open, send a single space " ", or configure inactivity_timeout. (elevenlabs.io)
Common “it’s not working” causes mapped to fixes
- No audio or delayed audio: you didn’t flush at sentence end or you’re below chunk thresholds. Add flush: true at turn boundaries or adjust generation_config.chunk_length_schedule. (elevenlabs.io)
- Context unexpectedly ends: you hit the 20s idle timeout; either periodically keep the context alive as above or raise inactivity_timeout (max 180s). (elevenlabs.io)
- Connection closed immediately: missing xi-api-key header, using the wrong endpoint, or sending "" on the standard stream. Verify headers/URL and keepalive behavior. (elevenlabs.io)
- Interruption handling stutters: close the old context, start a new one, and flush it. (elevenlabs.io)
If you share:
- the exact endpoint URL you’re using,
- model_id,
- the first 2–3 messages you send (including context_id/flush fields),
- any close codes or error text,
I’ll pinpoint the issue against the
"Here is the list of our doctors and their specializations: ",
"Doctor John Smith specializes in General Medicine, ",
"Doctor Sarah Johnson in Cardiology, ",
"Doctor Michael Brown in Pediatrics, ",
"Doctor Emily Davis in Dermatology, ",
wss://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/multi-stream-input?model_id=eleven_turbo_v2
im sending flush after every sentence
Short answer: yes—flushing after each complete sentence is the recommended pattern for the Multi‑Context Text‑to‑Speech WebSocket. (elevenlabs.io)
How to structure your messages
- Connect to: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/multi-stream-input?model_id=eleven_turbo_v2, with your xi-api-key header. The server returns base64 audio chunks in the "audio" field and a final message with "is_final": true. (elevenlabs.io)
- Use one context_id (e.g., "doctors") and send each sentence as a text chunk; end the sentence with punctuation, then send flush: true for that context. This yields natural prosody and low latency. (elevenlabs.io)
Example frames for your script
- {"context_id":"doctors","text":"Here is the list of our doctors and their specializations. ","flush":true} (elevenlabs.io)
- {"context_id":"doctors","text":"Doctor John Smith specializes in General Medicine. ","flush":true} (elevenlabs.io)
- {"context_id":"doctors","text":"Doctor Sarah Johnson specializes in Cardiology. ","flush":true} (elevenlabs.io)
- {"context_id":"doctors","text":"Doctor Michael Brown specializes in Pediatrics. ","flush":true} (elevenlabs.io)
- {"context_id":"doctors","text":"Doctor Emily Davis specializes in Dermatology. ","flush":true} (elevenlabs.io)
- Optionally finish with {"context_id":"doctors","close_context":true} and/or {"close_socket":true}. (elevenlabs.io)
Practical tips
- If you stream full sentences (as above), consider adding auto_mode=true in the query string to reduce buffering latency. (elevenlabs.io)
- Keep the socket alive: it auto‑closes after 20s of inactivity. Either set inactivity_timeout (max 180s) or periodically send a single space " " (empty string "" closes the socket). (help.elevenlabs.io)
- WebSockets aren’t available for eleven_v3; your chosen model_id=eleven_turbo_v2 is supported. (elevenlabs.io)
References
- Multi‑Context WebSocket API reference (message shapes, flush, contexts). (elevenlabs.io)
- Multi‑context WebSocket guide (best practice: flush at sentence boundaries). (elevenlabs.io)
- WebSockets guide (model availability). (elevenlabs.io)
- Help Center: keeping the WebSocket open.
didn;t worked im sending flush True in every continued context but only getting
Started writing audio for context 'tomorrow_weather' -> tomorrow_weather.wav
{'chars': ['H', 'e', 'r', 'e', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'l', 'i', 's', 't', ' ', 'o', 'f', ' ', 'o', 'u', 'r', ' ', 'd', 'o', 'c', 't', 'o', 'r', 's', ' ', 'a', 'n', 'd', ' ', 't', 'h', 'e', 'i', 'r', ' ', 's', 'p', 'e', 'c', 'i', 'a', 'l', 'i', 'z', 'a', 't', 'i', 'o', 'n', 's', ':', ' ', 'D', 'o', 'c', 't', 'o', 'r', ' ', 'J', 'o', 'h', 'n', ' ', 'S', 'm', 'i', 't', 'h', ' ', 's', 'p', 'e', 'c', 'i', 'a', 'l', 'i', 'z', 'e', 's', ' ', 'i', 'n', ' ', 'G', 'e', 'n', 'e', 'r', 'a', 'l', ' ', 'M', 'e', 'd', 'i', 'c', 'i', 'n', 'e', ',', ' '], 'charStartTimesMs': [0, 104, 221, 290, 337, 395, 441, 476, 522, 546, 569, 592, 639, 673, 731, 789, 836, 882, 906, 929, 975, 1010, 1045, 1091, 1138, 1184, 1265, 1335, 1393, 1440, 1509, 1579, 1625, 1660, 1695, 1741, 1800, 1823, 1846, 1881, 1927, 1962, 2009, 2055, 2113, 2171, 2229, 2276, 2322, 2380, 2450, 2531, 2682, 2752, 2798, 2844, 2926, 3100, 3274, 3715, 3796, 3866, 3913, 3959, 3994, 4029, 4098, 4145, 4203, 4249, 4307, 4377, 4447, 4516, 4609, 4667, 4748, 4865, 4981, 5062, 5120, 5178, 5224, 5271, 5341, 5433, 5503, 5631, 5666, 5712, 5747, 5782, 5851, 5898, 5956, 6002, 6037, 6084, 6130, 6165, 6211, 6258, 6316, 6374, 6432, 6478, 6548, 6594, 6769, 6792], 'charDurationsMs': [104, 117, 69, 47, 58, 46, 35, 46, 24, 23, 23, 47, 34, 58, 58, 47, 46, 24, 23, 46, 35, 35, 46, 47, 46, 81, 70, 58, 47, 69, 70, 46, 35, 35, 46, 59, 23, 23, 35, 46, 35, 47, 46, 58, 58, 58, 47, 46, 58, 70, 81, 151, 70, 46, 46, 82, 174, 174, 441, 81, 70, 47, 46, 35, 35, 69, 47, 58, 46, 58, 70, 70, 69, 93, 58, 81, 117, 116, 81, 58, 58, 46, 47, 70, 92, 70, 128, 35, 46, 35, 35, 69, 47, 58, 46, 35, 47, 46, 35, 46, 47, 58, 58, 58, 46, 70, 46, 175, 23, 174]}
Short answer: your WebSocket connection is working, but you’re only handling the alignment metadata and not the audio frames. Flush works, but audio is returned in messages with an "audio" field that you must base64‑decode and append to a file.
What to verify and change:
- Read and write audio frames: on each incoming WS message, if data.audio exists, base64‑decode it and append to your output. Don’t rely on the alignment object; that’s just timing metadata. Example logic is shown in the WebSockets guide (check for data["audio"] and write bytes).
- Multi‑context message fields: the audio chunks arrive as {"audio": "...", "is_final": false, "alignment": ..., "normalizedAlignment": ..., "contextId": "..."}; when you receive {"is_final": true} for that context, stop writing/close the file. Note the field name is is_final in multi‑context, not isFinal.
- Flushing: flush forces generation of buffered text, but you still need to read the resulting audio frames. You can send flush either together with text ("flush": true) or as a separate message with just {"context_id": "...", "flush": true}.
- Buffering thresholds: if your sentences are short, they may not trigger generation without flush because of chunk_length_schedule. Using flush at end of each sentence is correct; alternatively tune chunk_length_schedule.
- Explicit output format and file extension: set output_format in the WS URL (e.g., output_format=mp3_44100_128) and save with a matching extension (e.g., .mp3). The docs show saving MP3 when streaming. Writing MP3 bytes to a .wav filename will not produce a valid WAV file.
- Model/latency tips: for low latency use eleven_flash_v2_5; you can also try auto_mode=true on the multi‑context endpoint to reduce buffering for sentence‑sized inputs.
Known‑good minimal sequence for a single context (tomorrow_weather):
- Connect: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/multi-stream-input?model_id=eleven_flash_v2_5&output_format=mp3_44100_128 with header xi-api-key.
- Send: {"text":" ","context_id":"tomorrow_weather","voice_settings":{...}} (first message initializes the context).
- Send your sentence(s): {"text":"Tomorrow’s forecast is ... ","context_id":"tomorrow_weather"}; then flush: {"context_id":"tomorrow_weather","flush":true}.
- On receive: when message contains data.audio, base64‑decode and append to tomorrow_weather.mp3; when data.is_final is true for that context, close the file.
- Optionally close the context: {"context_id":"tomorrow_weather","close_context":true}.
If you implement the above read path (check data.audio and write bytes) and set a matching output_format/extension, you should get playable audio rather than only alignment arrays.
[
{
"audio":"+49TtpEmEAWQmfZOyoyia0isw3umC3r5RstwOqCuhVWSj2UdU8aOCBelWqlZdE5Dq6iw8MHFSP3LrH4eVzDpw3GrfYbio57p5Y8W5Cs95YUnFn2ZP2YOv/+5LE7IJYefMCzBkhiy88IAGDMDlEse+xq4sRHlKLfjO7nrB7nscuRJGEqgwQzu4nxrt88U2W66iRGEZLIZdK+H1Ih0uLj+Mydi14ePth1QyGp0e7l8yMDgfVB4wV0p2J7/rhJHOi6BQemKlGZlI7PBIgOYnEJQsbKuvOK1ka6J3l9/3e/6TMzXch4lFzKKHNR+0D06cOaS9N19FEC1gXDpdjKGr6hEeRHra0wd28Li11KZSbvwcZXV0cXHyrF56sZW3eXLGMYgYOVjrmIvqzCdrPULHV39R2axTRyJ1QoZXRS0t5f9WFzCt99tSjS5zTDdmVzGEo0suyi5GJTZOsmx5eOqtUep+uI5olZMSeVSmcGJqaHYUcsSMnTggsMQtmxOhKJMaJRgJRcEsRlRmJjIjHC1KdvCHUgHZ22vWRvU9I1Av+7WfOMTH7GzXvef5iAizpNY4TixUi3SRalZqM00okBweHXpQEqsNXt2Ikzb6EicSKnE7TiRVe75+dQpWoWlepmMxjpRuLVrkT3Lz+NE2eRl28MCyrdqnnxqnqLcXL//uSxOsD2O3fAAwZgwsUOyABgzA5sehWRJ2KJT+N+1DxMp1HrurT1MuaKx8hfAXIlHLztIV4o8PCqWiUcHDw4EPi0X4xCbPD0jmDlTAoj4bHb7xcWt0HZw8aQmAbnxaOyMKjES8OiuMBHKDtDBaIwnUHpFiYwHxdG/WM+o2+z8V8rTrwNa6GvDT5M8Nynb6ooH908Ga6KBK7akSwJzkiLYeQNnG3o400o1do0nko9jUGRlzTjKQhIyJx4SObgdgYaVkfSiqlhKQEBOUmklOSiZVhPUrnKrjGU4Uigi5d3Im5HjusILk2zahRElaRwwukhGH+Fljqk2I6XPoUa5ua4puho6QB8SE6EgWRoBWVJz6EHCccskE94BlhcohWDIHHiq7KgNkBMSuwuEgqfQmkO4SEIMIUiizDbdKPWR7Ba5a+Kbr8v8rPCN5Xy0w7GAuZzaqgCLeTcFfpFVhXInljFmEYTe9Mo70r9jJpXdtYQ1792S07w9kIuYbnaM3hPs1mqmA+Q5S15iKBiy5Xd6KKLmqr8crt8osXrqXon4snPIDbqA==",
"isFinal": null,
"normalizedAlignment": {
"chars": [
],
"charStartTimesMs": [
6838,
6861
],
"charDurationsMs": [
174,
23,
198
]
},
"alignment": {
"chars": [
"n",
"e",
",",
" "
],
"charStartTimesMs": [
6861
],
"charDurationsMs": [
58,
46,
174,
23,
198
]
},
"contextId": "tomorrow_weather"
}
]