I'm generating Japanese speech from Japanese text using the ELEVEN LABS API, but the quality is oft | ElevenLabs | Page 1

serene raptorBOT Sep 21, 2025, 4:24 PM

#

AI-generated Answer for pingtinker

Here are the most reliable ways to improve Japanese TTS accuracy with the ElevenLabs API, based only on our docs:

Model and language control

Prefer Flash v2.5 for low-latency multilingual generation, or Multilingual v2 for the most life‑like prosody. Switch models via model_id. (help.elevenlabs.io)
When using Turbo v2.5 or Flash v2.5, enforce Japanese by setting language_code to "ja" in your Create speech request. This is an API-only, model‑specific feature. (elevenlabs.io)

Text normalization (key for numbers, dates, acronyms)

Use apply_text_normalization to have the service spell out numbers/dates. For v2.5 models, you can use auto or off (on is not allowed); on Multilingual v2 you can set on if you prefer strict normalization. (elevenlabs.io)
Additionally, enable apply_language_text_normalization=true to improve pronunciation in Japanese; note this increases latency. (elevenlabs.io)
Regardless of settings, you can also pre-write numbers/dates/acronyms in words to remove ambiguity. (help.elevenlabs.io)

Voice choice and accent

Use a Japanese native voice for the cleanest accent: pick one from Voice Library with the Japanese filter, or clone a Japanese speaker with PVC (Japanese is supported). Default/Voice‑Design voices are English and can carry an English accent when speaking other languages. (help.elevenlabs.io)

Pronunciations for tricky words/names

Apply pronunciation dictionaries in API calls via pronunciation_dictionary_locators (up to 3 per request). In Studio, you can build these with the Pronunciations Editor. Note: phoneme tags are supported with Flash v2, Turbo v2, and English v1. (elevenlabs.io)
If a word still fails, phoneme SSML is only supported on English v1, Turbo v2, and Flash v2. Otherwise, write the word more phonetically as a fallback. (help.elevenlabs.io)

Voice settings (tune per voice)

Adjust stability, similarity_boost, use_speaker_boost, style, and speed per request via voice_settings; typical starting points are stability ≈0.5 and similarity ≈0.75, then tweak minimally. Avoid high style if you see instability. Slightly reducing speed can improve intelligibility. (elevenlabs.io)

Stitching long texts

For multi‑chunk synthesis, pass previous_text/next_text or previous_request_ids/next_request_ids to maintain continuity across segments, reducing odd prosody shifts. (elevenlabs.io)

little dust Sep 21, 2025, 4:42 PM

#

When generating voice from the screen on ELEVEN LABS's homepage, the Japanese voice is generated with high accuracy.
This is also likely the result of using a model for Japanese voice generation. How should the parameters be set to generate Japanese voice with this level of accuracy?
For example, what model was used, and how are the various parameters set?

serene raptorBOT Sep 21, 2025, 4:42 PM

#

AI-generated Answer for pingtinker

Short answer: ElevenLabs hasn’t published which exact model/parameters the website’s Text to Speech UI uses. However, the API’s default model for high‑quality multilingual synthesis (including Japanese) is eleven_multilingual_v2, and that’s the closest match to what you hear in the web generator. For low‑latency with explicit language forcing, use Turbo v2.5 or Flash v2.5 with language_code="ja". (elevenlabs.io)

What to set in the API for accurate Japanese

Model
- Highest fidelity: model_id = "eleven_multilingual_v2" (auto language detection; do not send language_code to this model). (elevenlabs.io)
- Low latency with language enforcement: model_id = "eleven_turbo_v2_5" or "eleven_flash_v2_5" and language_code = "ja". (elevenlabs.io)
Voice choice
- Pick a voice whose accent matches your target region; all ElevenLabs voices are multilingual, but accent fidelity improves when the base voice matches the language/region. (elevenlabs.io)
Voice settings (per request or saved on the voice)
- stability: controls randomness/emotion; higher = steadier/less expressive.
- similarity_boost: how strictly to match the target voice.
- style: style exaggeration; recommended to keep at 0 for stability unless you need a stronger persona.
- use_speaker_boost: slightly improves similarity at a small latency cost.
- speed: keep near 1.0; extreme values may degrade quality.
- Common starting point from the product guide: stability around 50, similarity around 75; keep style at 0. Tune minimally from there. (elevenlabs.io)
Text normalization and numbers/dates
- Flash v2.5 prioritizes latency over normalization; for phone numbers, dates, currencies, either pre‑normalize your Japanese text or (Enterprise only) enable apply_text_normalization with v2.5 models. For the best built‑in normalization, prefer Multilingual v2. (elevenlabs.io)
Pronunciation control (names/loanwords/brand terms)
- Attach a pronunciation dictionary (alias rules work with all models) to force specific readings. Use phoneme rules only on models that support them; otherwise use alias rules. (elevenlabs.io)

Minimal examples

High‑fidelity (recommended)
- model_id: "eleven_multilingual_v2"
- body: { text: "日本語のサンプル文です。", model_id: "eleven_multilingual_v2", voice_settings: { stability: 0.5, similarity_boost: 0.75, style: 0, use_speaker_boost: true, speed: 1.0 } }
- Note: Do not include language_code with Multilingual v2. (elevenlabs.io)
Low‑latency with enforced Japanese
- model_id: "eleven_turbo_v2_5" (or "eleven_flash_v2_5")
- body: { text: "日本語のサンプル文です。", model_id: "eleven_turbo_v2_5", language_code: "ja", voice_settings: { stability: 0.5, similarity_boost: 0.75, style: 0 } }
- If numbers/dates sound odd, pre‑expand them in the text; Enterprise users can enable apply_text_normalization for v2.5 models. (elevenlabs.io

little dust Sep 22, 2025, 2:13 AM

#

For example, when generating speech using “川で魚をつかまえました.”, the output should be pronounced as “Kawa de sakana o tsukamaemashita”, but the pronunciation is not very accurate.
Correct: “Kawa de sakana o tsukamaemashita”
Current incorrect output: “Kawa de hama o tsukamaemashita”

Why does this happen?

This is my code:

const response = await fetch("https://api.elevenlabs.io/v1/text-to-speech/GxxMAMfQkDlnqjpzjLHH?output_format=mp3_44100_128", {
  method: "POST",
  headers: {
    "xi-api-key": "API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    "text": "川で魚をつかまえました。",
    "model_id": "eleven_multilingual_v2",
    "language_code": "ja"
  }),
});

#I'm generating Japanese speech from Japanese text using the ELEVEN LABS API, but the quality is oft