#I want the conversational ai agent to read a fixed script without skipping any text or doing any ch
1 messages · Page 1 of 1 (latest)
Short answer: don’t have the agent’s LLM “generate” your script. For word‑for‑word delivery, stream your script directly through ElevenLabs Text‑to‑Speech (TTS), and only use the agent shell for transport/UI. Then, if you wrap this in a Conversational AI agent, disable interruptions so nothing cuts the read short.
Implementation outline (all from official docs)
-
Use TTS WebSockets to stream your script
- Endpoint: text‑to‑speech WebSocket. Stream text chunks and send flush=true at sentence boundaries so all buffered text is generated; closing the socket also forces any remaining buffered text to render. The guide also documents alignment for word‑level timestamps if you want to verify nothing was skipped.
- The API reference shows enable_ssml_parsing (set true if you include SSML like <break/> or phonemes). It also documents inactivity timeout and multi‑context behavior.
-
Respect model character limits; choose a model accordingly
- Flash v2.5 and Turbo v2.5: up to 40,000 characters (~40 minutes) per request. Multilingual/English v1/v2: up to 10,000 characters.
- For scripts longer than a single request, split into chunks and use Request Stitching to keep prosody consistent across chunks.
- If you prefer a fully managed long‑form workflow, Studio (formerly Projects) is recommended for “extremely long‑form content such as books.”
-
Control pauses and pronunciations explicitly (so the read matches your script)
- Pauses: use SSML <break time="1.5s" /> (up to 3s).
- Phonemes: SSML phoneme tags are supported on specific models (e.g., Eleven Flash v2, Eleven Turbo v2) when using the API.
-
If you embed this inside a Conversational AI agent, prevent interruptions
- Disable interruptions so the agent can’t be barged in on while reading “when the complete delivery of information is crucial.” Configure this in the agent’s Advanced settings under Client Events.
-
Operational tips for reliable full reads
- The WebSocket buffers text until a threshold; use flush=true to force generation of any buffered text at the end of each chunk/sentence. Keep the socket alive per the guide (20s inactivity behavior is documented).
Notes/constraints from docs
- There is no documented “verbatim mode” for agent LLM outputs. To guarantee no paraphrasing, send your exact script to TTS (WebSocket or REST) instead of asking the agent to rephrase or summarize. Then, if needed, let the agent handle only call control/UX with interruptions disabled.
If you share the approximate length (characters) and whether you need SSML, I can point you to a minimal chunking/stitching template from the docs that matches your case.
i dont want to integrate websockets
No problem—you can avoid WebSockets entirely by using the Text-to-Speech REST API and, for long scripts, splitting the text within model limits.
What to use
- REST endpoint: POST /v1/text-to-speech/{voice_id} to convert text into audio. Supports output_format, model_id, apply_text_normalization, and continuity fields. (elevenlabs.io)
Make it read the script verbatim (no changes, no skipped text)
- Disable normalization so numbers/dates aren’t rewritten: set apply_text_normalization="off". (elevenlabs.io)
- If you need strict pronunciations for names/acronyms, attach a pronunciation dictionary via pronunciation_dictionary_locators, or use phoneme tags on supported models (Flash v2, Turbo v2, English v1). (elevenlabs.io)
Handle long scripts without WebSockets
- Send the whole script in one request if it fits the model’s character limit (e.g., Flash v2.5 and Turbo v2.5 up to 40,000 characters; Eleven v3 up to 10,000). Otherwise, split into multiple requests. (help.elevenlabs.io)
- Preserve natural continuity across chunks using previous_text/next_text or previous_request_ids/next_request_ids when you stitch the audio together. (elevenlabs.io)
Audio format and quality
- Choose output_format (e.g., mp3_44100_128, wav/pcm, ÎĽ-law, Opus) per your playback needs. (elevenlabs.io)
Optional language control
- You can enforce a language with language_code on Turbo v2.5 and Flash v2.5 (only those models support it). (elevenlabs.io)
Docs to implement
- Create speech (REST): fields, formats, continuity, normalization. (elevenlabs.io)
- Model limits (how many characters per request). (help.elevenlabs.io)
- Supported audio formats. (elevenlabs.io)
- Pronunciation dictionaries (API/SDK usage). (elevenlabs.io)
- Phoneme tags availability. (help.elevenlabs.io)
If you want, tell me your target model and approximate script length, and I’ll outline exact request parameters and chunk counts based on the limits above.
so how do i implement it in the conversational ai agent
Here’s how to make your agent read a fixed, long script verbatim without WebSockets, using only the ElevenLabs HTTP APIs:
-
Use the Text-to-Speech “convert” REST endpoint. POST your full script to /v1/text-to-speech/{voice_id}. Set apply_text_normalization to off so numbers, dates, etc. are not altered. Keep the same voice_id and model_id across all requests.
-
If the script fits in one request, pick a model with a high character limit (e.g., eleven_flash_v2_5 or eleven_turbo_v2_5 allow up to 40,000 characters per request). Otherwise, split the script into chunks under your chosen model’s limit.
-
For long scripts that you split, preserve continuity across chunks:
- Include previous_text and/or next_text with each chunk so prosody flows naturally when you concatenate audio. You can also thread previous_request_ids/next_request_ids if you need to regenerate a section later without audible seams.
-
If you want immediate playback but still avoid WebSockets, use the HTTP streaming endpoint /v1/text-to-speech/{voice_id}/stream. It returns audio progressively over HTTP chunked transfer so you can start playing as it generates.
-
To guarantee nothing was skipped, request timestamps with:
- Convert with timestamps: /v1/text-to-speech/{voice_id}/with-timestamps (returns audio plus per-character timing). Or,
- Stream with timestamps: /v1/text-to-speech/{voice_id}/stream/with-timestamps (streamed JSON chunks with base64 audio and alignment). You can also post-verify with the Forced Alignment API by aligning the final audio against your exact script.
-
If your script has multiple speakers but is still fixed, you can submit a single request to the Text-to-Dialogue endpoints with a list of {voice_id, text} pairs; HTTP streaming is also available.
-
Optional controls to keep pronunciation consistent:
- language_code to enforce the intended language (supported on specific models), and pronunciation_dictionary_locators to apply your dictionary for names/terms.
Notes:
- The Conversational AI Agents platform is real-time and exposed via WebSockets for live conversations; since you don’t want WebSockets, prefer the Text-to-Speech REST endpoints above for scripted playback.
If you share roughly how long the script is (characters) and whether you need timestamps during playback, I can suggest the exact endpoint and chunking plan.
cant i just use knowledge base?