#How do I best handle pronunciations with the websocket text to speech API?
1 messages · Page 1 of 1 (latest)
Short answer:
- Use a pronunciation dictionary for recurring terms, and SSML phoneme tags for one‑off fixes. In WebSocket TTS, attach dictionaries when you initialize the connection, and enable SSML parsing if you want to send <phoneme> or <break> tags.
Details and references:
-
Attach pronunciation dictionaries at connect time
- WebSocket TTS requires dictionaries to be specified in the initial “Initialize Connection” message (not per text message). This is the recommended way to keep brand names, acronyms, and technical terms consistent. (elevenlabs.io)
- Dictionaries are standard PLS lexicons; you can define rules using IPA or CMU phonemes and alias substitutions. Create/manage them via the Pronunciation Dictionaries APIs or Studio before using them in TTS. (elevenlabs.io)
- For the REST TTS endpoints the field is pronunciation_dictionary_locators (id + version), applied in order; the WebSocket initialize step follows the same “attach at start” model. (elevenlabs.io)
-
Use SSML for per‑utterance control
- Enable SSML parsing on the WebSocket by adding enable_ssml_parsing=true to the connection query parameters. Then you can send SSML in your text messages. (elevenlabs.io)
- Phoneme tags: Supported on Eleven English v1, Eleven Turbo v2, and Eleven Flash v2 models; phonemes are currently English‑only. Use <phoneme alphabet="ipa|x‑cmu" ph="…">word</phoneme>. (help.elevenlabs.io)
- Pauses: Use <break time="…s" /> (up to 3s). All models except Eleven V3 support SSML break tags; Eleven V3 uses [pause] style tags (note: WebSocket TTS is not available for Eleven V3). (help.elevenlabs.io)
-
Model/language and normalization tips that affect pronunciation
- If the prompt is short/ambiguous (e.g., numbers, acronyms), set language_code on the WebSocket URL to steer normalization/pronunciation toward the right language. (elevenlabs.io)
- If normalization (e.g., reading numbers) hurts your use case, tune apply_text_normalization (auto/on/off) in the WebSocket query. (elevenlabs.io)
-
Practical workflow
- Build and test a PLS dictionary (IPA/CMU and/or alias) for your domain terms; upload via API or Studio. (elevenlabs.io)
- Open the WebSocket with enable_ssml_parsing=true and include your pronunciation dictionary references in the initialize message. (elevenlabs.io)
- For occasional fixes, embed SSML <phoneme> in the streamed text; for recurring terms, rely on the dictionary. If needed, use phonetic re‑spelling as a fallback. (help.elevenlabs.io)
If you want, I can point you to the exact endpoints to create/list dictionaries and show an example initialize message payload next. (elevenlabs.io)
Yes, can I have the API for initialising the websocket with dictionaries?
Yes. Use the Text‑to‑Speech WebSocket endpoint and include your pronunciation dictionaries in the very first “initialize connection” message.
-
Endpoint and handshake
- URL: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
- Required header: xi-api-key: YOUR_API_KEY
- Optional query params include model_id, output_format, etc. (elevenlabs.io)
-
Where to pass dictionaries
- Pronunciation dictionaries must be set in the initial message (they are not changeable later in the same connection). (elevenlabs.io)
-
Field to use
- Use pronunciation_dictionary_locators, an array of objects, each containing pronunciation_dictionary_id and version_id. This field uses the same locator shape as the HTTP Text‑to‑Speech API. (elevenlabs.io)
Example initialize message (JSON sent right after opening the socket):
{
"text": " ",
"pronunciation_dictionary_locators": [
{ "pronunciation_dictionary_id": "DICT_ID_1", "version_id": "VERSION_ID_1" }
]
}
You may also include voice_settings and generation_config in the same initialize message if needed. (elevenlabs.io)
Notes
- WebSockets are not available for the eleven_v3 model. (elevenlabs.io)
References
- WebSocket API reference (handshake, message flow): WebSocket | ElevenLabs Documentation. (elevenlabs.io)
- WebSocket guide (dictionaries must be in Initialize Connection): Generate audio in real-time. (elevenlabs.io)
- HTTP Text‑to‑Speech body (locator object shape): Create speech. (elevenlabs.io)
- Cookbook example using pronunciation_dictionary_locators (locator fields shown): Using pronunciation dictionaries. (elevenlabs.io)
Sources:
WebSocket | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Create speech | ElevenLabs Documentation
WebSocket | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
WebSocket | ElevenLabs Documentation
Generate audio in real-time | ElevenLabs Documentation
Create speech | ElevenLabs Documentation
Using pronunciation dictionaries | ElevenLabs Documentation
I have tried creating a dictionary and passing it in as part of pronunciation_dictionary_locators. But when the voice gets to Saoirse it just pauses and skips the word. The dictionary looks like this:
<?xml version='1.0' encoding='utf-8'?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd" alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>C#</grapheme>
<alias>C-Sharp</alias>
</lexeme>
<lexeme>
<grapheme>Saoirse</grapheme>
<phoneme alphabet="ipa">ˈsɪərʃə</phoneme>
</lexeme>
</lexicon>
Thanks for the snippet — the XML itself is fine, but the behavior you’re seeing matches a known limitation: phoneme entries are skipped by models that don’t support them. ElevenLabs documents that phoneme tags only work with these TTS models: Eleven Flash v2, Eleven Turbo v2 and Eleven English v1. Other models (including the v2.5 models commonly used for WebSocket streaming) will silently skip phoneme entries, which presents as a pause or missing word. (help.elevenlabs.io)
You have two supported fixes:
- Keep WebSocket streaming on a v2.5 model and switch your Saoirse rule to an alias (word substitution) instead of a phoneme. Alias rules work with all models. (elevenlabs.io)
- If you specifically need phonemes, set the WebSocket model to a phoneme‑compatible one (eleven_flash_v2 or eleven_turbo_v2) via the model_id query parameter on the WebSocket URL. The WebSocket API supports model_id in the handshake. (elevenlabs.io)
Including dictionaries in the WebSocket init
- Pronunciation dictionaries must be provided in the initial “Initialize Connection” message (not per text chunk). (elevenlabs.io)
- Use pronunciation_dictionary_locators with up to 3 locators, applied in order. Each locator contains the dictionary id and version_id. (The same locators structure is documented on the streaming TTS endpoints and applies here.) (elevenlabs.io)
Example initialize message (shape shown here; include your API key, chosen model_id, and your dictionary/version IDs):
{
"text": " ",
"xi_api_key": "<YOUR_API_KEY>",
"model_id": "eleven_turbo_v2",
"pronunciation_dictionary_locators": [
{"pronunciation_dictionary_id": "<DICT_ID>", "version_id": "<VERSION_ID>"}
]
}
Docs tips for your PLS file
- ElevenLabs supports IPA and CMU; PLS files are case‑sensitive (add entries for both Saoirse and saoirse if needed). (elevenlabs.io)
Relevant docs
- WebSocket guide: pronunciation dictionaries must be in Initialize Connection. (elevenlabs.io)
- WebSocket API reference (model_id in handshake). (elevenlabs.io)
- Stream TTS reference (locator structure, up to 3, applied in order). (elevenlabs.io)
- Pronunciations Editor note on phoneme‑supported models. (help.elevenlabs.io)
- Cookbook: phoneme tags only work with eleven_flash_v2, eleven_turbo_v2, eleven_monolingual_v1; others skip the word. (elevenlabs.io)
- Models overview (to pick a compatible model id). (elevenlabs.io)
If you switch to an alias for Saoirse (keeping your C# alias as
This is what I am using:
wss://api.elevenlabs.io/v1/text-to-speech/uZ0CK4nTskrg7ESCCAQr/stream-input?enable_ssml_parsing=true&model_id=eleven_turbo_v2&output_format=pcm_24000
{
text: ' ',
voice_settings: { stability: 0.5, similarity_boost: 0.8, speed: 1 },
xi_api_key: '<redacted>',
pronunciation_dictionary_locators: [
{
pronunciation_dictionary_id: 'xYlMUZJfo5gAsM4ChMP1',
version_id: 'KNV2rydOuAHR6GZ9dAqj'
}
]
}
Short answer: your WebSocket init payload shape looks correct, and pronunciation dictionaries must be sent only in the initial “Initialize Connection” message. If “Saoirse” is being skipped, it’s usually a model/phoneme support or dictionary-version issue.
What to verify and adjust:
- Send dictionaries in the very first message only. Pronunciation dictionaries are applied at initialization (not per-message). (elevenlabs.io)
- Use a model that supports phoneme rules. Phoneme tags/dictionaries are supported on Eleven Turbo v2 and Flash v2 (and English v1). Turbo v2.5 does not support phoneme tags; on unsupported models the word may be skipped. You are on eleven_turbo_v2, which is supported. (help.elevenlabs.io)
- Confirm you are passing the latest dictionary version_id that actually contains the Saoirse rule. You can download the exact PLS applied via: GET /v1/pronunciation-dictionaries/:dictionary_id/:version_id/download and check the contents. (elevenlabs.io)
- PLS structure: put the alphabet on the lexicon element; phoneme content should follow the examples in docs (IPA or CMU). Also remember rules are case‑sensitive and the first matching rule wins—add entries for all casings you need. (elevenlabs.io)
- If IPA keeps failing, try CMU Arpabet for this word; CMU is noted as more predictable with current models. (help.elevenlabs.io)
- To isolate whether the issue is with the dictionary or the model, test the same word once using SSML <phoneme> in the text (with enable_ssml_parsing=true) on a supported model; if SSML works but the dictionary doesn’t, re-check the version_id and PLS formatting. (help.elevenlabs.io)
For reference:
- WebSocket guide (real-time TTS, and that dictionaries must be in the init message). (elevenlabs.io)
- WebSocket API reference (endpoint/params). (elevenlabs.io)
- Cookbook showing pronunciation_dictionary_locators usage and PLS examples (same object structure you’re sending). (elevenlabs.io)
- Best practices for pronunciation
Ok, I have tried it with SSML in the string.
Hi! My name is <phoneme alphabet="ipa" ph="ˈsɪərʃə">Saoirse</phoneme>. I like to write in <sub alias="C-Sharp">C#</sub> and I live in Worcestershire
The pause instead of Saoirse is gone but the sub is not respected.
Short answer: ElevenLabs does not currently support the SSML <sub> tag. The only SSML tags documented for Text-to‑Speech are:
- <break> (pauses, up to 3s)
- <phoneme> (CMU/IPA; supported on English v1, Turbo v2, and Flash v2)
Because <sub> isn’t parsed, “C#” won’t be replaced by “C‑Sharp,” which is why your example ignores it. Use one of these supported options instead:
- Wrap the term in a phoneme tag: <phoneme alphabet="cmu-arpabet" ph="...">C#</phoneme> (or IPA), and keep enable_ssml_parsing=true. (help.elevenlabs.io)
- Apply an alias via a pronunciation dictionary (map C# → C‑Sharp). Dictionaries are supported on the HTTP Text‑to‑Speech endpoints via the pronunciation_dictionary_locators field; the WebSocket docs do not list dictionary support, so if you need aliases to apply automatically, use the HTTP “Create speech”/“with timestamps” endpoints. Otherwise, pre‑replace “C#” with “C‑Sharp” in the text you stream. (elevenlabs.io)
References:
- SSML break/phoneme support and model limits. (help.elevenlabs.io)
- Prompting/best‑practices (phoneme tags, alias via dictionaries; dictionaries usable with Speech Synthesis via API). (elevenlabs.io)
- WebSocket API (enable_ssml_parsing). (elevenlabs.io)
- HTTP TTS endpoints supporting pronunciation_dictionary_locators. (elevenlabs.io)
Sources:
Do pauses and SSML phoneme tags work with the API? – ElevenLabs
Create speech | ElevenLabs Documentation
Do pauses and SSML phoneme tags work with the API? – ElevenLabs
Best practices | ElevenLabs Documentation
WebSocket | ElevenLabs Documentation
Create speech | ElevenLabs Documentation
Does eleven_flash_v2_5 or eleven_turbo_v2_5 support dictionaries at all?
Yes. Both eleven_flash_v2_5 and eleven_turbo_v2_5 can use pronunciation dictionaries, but only alias entries are applied; phoneme entries are ignored. Phoneme-based rules are supported on older models (e.g., Eleven Turbo v2 and Eleven Flash v2), not on v2.5. (elevenlabs.io)
Sources:
Pronunciation dictionaries | ElevenLabs Documentation
Is there any benefit to using dictionaries over inline SSML?