Hello Deepgram Support & Community,
For context, our team uses Livekit for building voice agents where we rely on the Aura-2 model for text to speech (using the python sdk). One of our business use cases is to navigate the IVR where our agent is expected to speak out alphanumeric characters accurately in order to proceed in the conversation.
To ensure consistency for while pronouncing alphanumeric characters, we transform the alphanumeric id into using the following rules:
- Change a number to its word form (e.g. 1 -> one)
- Add semicolons to ensure that a certain group is pronounced with natural pauses.
During one of our calls in production our agent had to speak the alphanumeric id "AWP186W14507" which we converted into the format "A; W; P; one eight six; W; one four five zero seven".
This converted format consistently mispronounces or skips over some of the characters. In this case the "A" sometimes is completely skipped or there are some clear hallucinations. NOTE: This is also very easy to reproduce on the playground as every single TTS audio generation results in inconsistent speech. Here is one such request id: 1d1fe27d-fc7f-47c6-9564-84aa87f88816 made from the playground
Questions to the community and to deepgram support are as follows:
- Is there a clear recommended way to format alphanumeric characters such that there is 100% accuracy while the audio speech itself has clear ,easy to follow pauses. (e.g. Our team wants to group numbers in 3 while we speak each letter seperately with pause) . I have tried experimenting with other formats like "Alpha; Whiskey.." or "A for apple, W for washington.." with limited success.
- The format our team uses in production works pretty consistently for many alphanumeric strings but it was only in this particular case that we could consistently see problems.
Thanks,
Parashara