#our agent is calling out loud the Audio tags for expressiveness. how can we disable this?
1 messages · Page 1 of 1 (latest)
our agent is calling out loud the Audio tags for expressiveness. how can we disable this?
Short answer: your TTS model is speaking the tags literally. Fix it by either having the model interpret the tags, or by preventing tags from reaching TTS.
Do one of the following:
-
Use a model that interprets audio tags instead of reading them: switch the agent’s TTS to Eleven v3 Conversational. Expressive Mode (with tag interpretation like [laughs], [whispers], [sighs]) is enabled by default when you select Eleven v3 Conversational for the agent. This makes tags performative rather than spoken. [Docs: Expressive Mode], [Help: What is Eleven v3?], [Help: How do audio tags work with Eleven v3?] https://elevenlabs.io/docs/agents-platform/customization/voice/expressive-mode https://help.elevenlabs.io/hc/en-us/articles/35869054119057-What-is-Eleven-v3 https://help.elevenlabs.io/hc/en-us/articles/35869142561297-How-do-audio-tags-work-with-Eleven-v3-Alpha
-
If you don’t want tags at all, stop the LLM from emitting them and/or strip them before TTS:
- Update the agent’s system prompt to avoid explicit tags (the docs note that expressive behavior can be guided via the prompt and that explicit tags aren’t required). [Docs: Expressive Mode] https://elevenlabs.io/docs/agents-platform/customization/voice/expressive-mode
- Programmatically remove square‑bracketed audio tags from the text before sending it to TTS (the prompting/normalization guide recommends regex-based text normalization when prompting via code). [Docs: Best practices – Prompting/Normalization] https://elevenlabs.io/docs/best-practices/prompting/normalization
Optional (transcripts only): if you just want to hide tags in your web widget transcript, enable widget.strip_audio_tags (defaults to true as of Mar 2, 2026). This affects transcript display, not spoken audio. [Changelog] https://elevenlabs.io/docs/changelog/2026/3/2
Note: If you keep using non‑v3 models (e.g., Turbo/Flash) and you leave tags in the text, they may be spoken verbatim; the pacing/emotion guide cautions that explicit dialogue cues in text can be read aloud. [Docs: Controls – Pacing and Emotion] https://elevenlabs.io/docs/product/prompting/pacing-and-emotion
TTS V3 Conversational is already enabled. we're using v3 moels. how come?