#Why does the SDK not provide the same endpoint as the elevenlabs homepage? It seems the homepage us
1 messages ¡ Page 1 of 1 (latest)
Short answer: theyâre different endpoints by design, and the SDK exposes both. Youâre likely calling the raw-audio stream, while the homepage uses the JSON-with-timestamps stream.
-
The homepageâs behavior matches POST /v1/text-to-speech/{voice_id}/stream/with-timestamps, which streams JSON objects that include base64-encoded audio plus character-level alignment. In the SDK this is client.text_to_speech.stream_with_timestamps(...), which yields a stream of JSON chunks. See âStream speech with timing.â (elevenlabs.io)
-
The SDK method client.text_to_speech.stream(...) calls POST /v1/text-to-speech/{voice_id}/stream, which streams raw audio bytes over HTTP chunked transfer (not JSON/SSE). This is intended for lowest-latency playback without timing data. See âStream speechâ and the Streaming overview. (elevenlabs.io)
Why audio may differ between the two:
- If youâre not passing identical parameters (model_id, voice_settings, output_format, etc.) to both endpoints, the synthesis can vary. Both endpoints share the same request fields. (elevenlabs.io)
- Generation isnât fully deterministic unless you provide a seed; even then itâs âbest effort,â not guaranteed. To minimize variation, set the same seed on both requests. (elevenlabs.io)
What to do:
- If you want the homepage behavior (JSON chunks with timestamps), call stream_with_timestamps in the SDK and parse audio_base64/alignment from each chunk. (elevenlabs.io)
- If you only need audio for immediate playback, use stream (raw bytes). (elevenlabs.io)
For completeness: thereâs also a non-streaming âwith-timestampsâ endpoint that returns a single JSON with base64 audio and alignment if you donât need live streaming. (elevenlabs.io)
Sources:
Stream speech with timing | ElevenLabs Documentation
Stream speech | ElevenLabs Documentation
Stream speech | ElevenLabs Documentation
Stream speech | ElevenLabs Documentation
Stream speech with timing | ElevenLabs Documentation
Stream speech | ElevenLabs Documentation
Create speech with timing | ElevenLabs Documentation
The audio is completely different though, it says a different word.
I.e. Japanese 㲠is prounced Hai instead of Hee
I'm using the exact same payload and parameters, the only difference is the response type.
Short answer: the response type alone should not change what is spoken. Both endpoints synthesize with the same request fields (voice_id, model_id, voice_settings, etc.); they only differ in how the result is delivered (raw audio stream vs JSON chunks with base64 audio and alignment). Please compare against the API docs for Stream and Stream-with-timestamps. (elevenlabs.io)
Why youâre hearing different words (and how to fix it)
- Nondeterminism: TTS generations are stochastic. Use the seed parameter to get bestâeffort repeatability across runs/endpoints. Determinism is not guaranteed. (elevenlabs.io)
- Model mismatch: If one call is defaulting to a different model than the other (or than the website), output can change. Explicitly set model_id on both requests (e.g., eleven_multilingual_v2, eleven_flash_v2_5, etc.). The Help Center lists this as the most common cause of âsounds differentâ between surfaces. (help.elevenlabs.io)
- Japanese pronunciation: For kana like ă˛, enable apply_language_text_normalization=true (Japaneseâonly) to improve readings; note it can increase latency. If you use Turbo v2.5 or Flash v2.5, also set language_code="ja" to enforce Japanese. (elevenlabs.io)
- Voice settings variance: Differences in stability/similarity/style can noticeably change pronunciation. Ensure identical voice_settings on both calls. (help.elevenlabs.io)
- Latency âoptimizationsâ: If youâre passing optimize_streaming_latency, be aware itâs deprecated and higher levels can affect quality and even normalization. Leave it unset/None. (elevenlabs.io, help.elevenlabs.io)
Checklist to make both endpoints speak identically
- Set the same: voice_id, model_id, output_format, voice_settings. (elevenlabs.io)
- Add seed with a fixed integer on both requests. (elevenlabs.io)
- For Japanese, set apply_language_text_normalization=true; if on Turbo/Flash v2.5, also set language_code="ja". (elevenlabs.io)
- Ensure optimize_streaming_latency is not used. (elevenlabs.io)
If a specific token still mispronounces, you can force a reading with a pronunciation dictionary (alias or phoneme, modelâdependent) and attach it via pronunciation_dictionary_locators.
Does the API cache responses and/or is there any different caching behaviour between plan types?