#Interim Result Accuracy, VAD Event

1 messages · Page 1 of 1 (latest)

sonic wind
#

I observed that transcription accuracy drops when using interim results True. Also VAD event SpeechStarted is unusable since it gets triggered with background noise. I want to use interim results and SpeechStarted VAD event for interrupt handling. How to resolve this ?

small robinBOT
#

Hey there! This channel allows you to ask Kapa.ai for assistance. But it looks like you haven't connected your Discord profile to your Deepgram account yet. You can do so at https://community.deepgram.com - by doing so, you'll be able to ask questions in #1115960287183511643 if you need human assistance!

woeful fjordBOT
#

Hi @sonic wind. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

You're correct that interim results are less accurate than final results—this is expected, as interim results are preliminary guesses that get refined as more audio is processed. Final results are only sent when Deepgram is confident in the transcript for a segment, which is why they are more accurate but arrive with more delay. This trade-off is fundamental to how Deepgram's streaming transcription works: interim results are fast but less stable, while final results are slower but more accurate Using Interim Results.

Regarding the VAD (Voice Activity Detection) event SpeechStarted: it can indeed be triggered by background noise, especially in noisy environments. This is a known limitation, as VAD is based on detecting audio energy, not necessarily speech content Limitations of Endpointing.

For interrupt handling with interim results, consider these approaches:

  1. Combine Endpointing and UtteranceEnd:
    • Use endpointing (speech_final=true) for detecting natural pauses in speech, but be aware it may not always fire in noisy environments.
    • Enable the UtteranceEnd feature (utterance_end_ms) alongside interim results. UtteranceEnd analyzes word timings in the transcript, not just audio energy, making it more robust to background noise. It sends a message when a gap between words is detected, which can be used as a reliable interrupt or end-of-speech signal Utterance End, Quality End-of-Speech Detection.
#
  1. Recommended Settings:

    • Set interim_results=true to get fast updates.
    • Set utterance_end_ms=1000 (or higher) to trigger an UtteranceEnd event after 1 second of silence between words.
    • Optionally, use endpointing with a lower value (e.g., 300–800 ms) for more responsive endpoint detection, but rely on UtteranceEnd for robustness in noisy environments.
  2. Event Handling Logic:

    • Use speech_final=true as your primary interrupt if it fires.
    • If speech_final does not fire (e.g., due to noise), use the next UtteranceEnd event as your interrupt. This "either/or" logic is recommended for robust end-of-speech detection Using UtteranceEnd and Endpointing.

Summary Table:

Feature Pros Cons
Interim Results Fast, low latency Lower accuracy, unstable
speech_final Accurate end-of-speech in clean audio May miss in noisy environments
UtteranceEnd Robust to noise, transcript-based Slightly higher latency (≥1s gap req)

Example LiveOptions (Python):

options = LiveOptions(
    model="nova-3",
    language="en-US",
    interim_results=True,
    utterance_end_ms="1000",
    vad_events=True,
    endpointing=800
)

Utterance End

Key Takeaway:
For robust interrupt handling with interim results, use both endpointing and UtteranceEnd events, and handle whichever comes first. This approach mitigates the limitations of VAD in noisy environments and the lower accuracy of interim results.

#

If you need more specific code examples or integration tips, let me know!

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@sonic wind Kapa.ai is still learning and improving, please let me know how I did by reacting below.