Hey! We're building Voice assistant over phone. We are going to add user interuption feature means when user interupt the agent in between, the agent should stop. For this we're using deepgram's tts clear method to clearing the audio buffer. But we're confused in where to put or where to clear tts audio buffer in STT. Can we use it in speech_started or on_message or where? Please help us.
#Where to clear TTS audio buffer
1 messages · Page 1 of 1 (latest)
Hi @waxen dove. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
I understand you're building a voice assistant over phone and want to implement user interruption functionality using Deepgram's TTS clear method. This is a common requirement for natural conversational experiences.
Based on the knowledge sources, you should clear the TTS audio buffer when you detect that the user has started speaking while the agent is still talking. Deepgram provides a specific message for this purpose called UserStartedSpeaking.
The best place to implement this is when you receive the UserStartedSpeaking message from Deepgram's Voice Agent API. This message is specifically designed to notify you when a user begins speaking, allowing you to handle interruptions properly.
Here's how you can implement it, based on the Twilio and Deepgram Voice Agent example:
async for message in sts_ws:
if type(message) is str:
# handle barge-in
decoded = json.loads(message)
if decoded['type'] == 'UserStartedSpeaking':
# Clear the TTS audio buffer
clear_message = {
"type": "Clear"
}
await sts_ws.send(json.dumps(clear_message))
# If using Twilio, you may also need to clear audio on the call
# as shown in the example
# await twilio_ws.send(json.dumps({"event": "clear", "streamSid": streamsid}))
When you send the Clear message to the TTS websocket, it will:
- Clear the internal text and audio buffer
- Stop sending new audio chunks as soon as possible
- Return a confirmation with
{"type": "Cleared", "sequence_id": 0}
This approach ensures that when a user interrupts the agent, the system immediately stops generating unnecessary audio, making the conversation more natural and responsive Understanding End-of-Speech Detection.