#Delay in transcription during a live conversation on a phone call.

1 messages · Page 1 of 1 (latest)

glossy dove
#

I'm currently testing Deepgram's speech-to-text over WebSocket, but sometimes it delays responding — I have to say 'hello' multiple times before it picks up and sends a transcript.

Here is my sample code. Is there something wrong there?

this.connection = deepgram.listen.live({
language: 'en',
encoding: "mulaw",
model: "nova-2",
sample_rate: 8000,
punctuate: true,
smart_format: true,
interim_results: true,
utterance_end_ms: 3000,
endpointing: 500,
});

somber isleBOT
#

Hi @glossy dove. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

Your code and parameter choices are generally correct, but the delay you’re experiencing is likely related to how Deepgram’s endpointing and utterance detection features work, especially in noisy environments like phone calls.

Key Points from the Knowledge Sources

  • Endpointing: The endpointing parameter (e.g., endpointing: 500) controls how long Deepgram waits for silence before considering speech finished. However, on phone calls, true silence is rare due to background noise, so endpointing may not trigger as expected. Shorter endpointing times (e.g., 500ms) can help, but if there’s constant noise, it may still not fire reliably. Longer endpointing times can actually make things worse in noisy environments, as the silence threshold is rarely met Official answer on endpointing.

  • utterance_end_ms: This feature triggers based on gaps between words in the transcript, not on audio silence. It’s more robust in noisy environments, but it requires interim_results: true to function. Setting utterance_end_ms: 3000 means Deepgram will wait for a 3-second gap between recognized words before sending an UtteranceEnd message. If you want faster detection, try lowering this value (e.g., 1000ms) Limitations of Endpointing.

  • Interim Results: With interim_results: true, you’ll receive partial transcripts every ~1 second, which can help with responsiveness, but final results (is_final: true) may still be delayed if endpointing or utterance end is not triggered Limitations of Endpointing.

#
  • Phone Call Audio: Phone call audio is typically noisy, and silence detection is less reliable. The official guidance is to use shorter endpointing times and rely more on utterance_end_ms for detecting when the user has stopped speaking Official answer on endpointing.

Recommendations

  • Try lowering utterance_end_ms to 1000 or 1500 for faster detection of speech pauses.
  • Keep endpointing at 500ms or lower, but be aware it may not always trigger in noisy environments.
  • Make sure you are handling both is_final: true and UtteranceEnd events in your code to process transcripts as soon as possible.
  • If you are not receiving any transcripts at all, double-check that your audio is being sent in the correct format (mulaw, 8000Hz, mono) and that you are not sending base64-encoded audio (it should be raw audio bytes) No response received from deepgram streaming websockets.

Example Adjustment

this.connection = deepgram.listen.live({
  language:  'en',
  encoding: "mulaw",
  model: "nova-2",
  sample_rate: 8000,
  punctuate: true,
  smart_format: true,
  interim_results: true, 
  utterance_end_ms: 1000, // Try lowering this
  endpointing: 500,       // Keep this low
});

If you continue to experience delays, consider logging the raw audio you send and the events you receive to help diagnose whether the issue is with audio transmission or with Deepgram’s endpointing logic.

#

@glossy dove Kapa.ai is still learning and improving, please let me know how I did by reacting below.

glossy dove
#

@somber isle does this look correct

this.connection.on(LiveTranscriptionEvents.Transcript, (data: any) => {
const alternatives = data.channel?.alternatives;
let text = alternatives ? alternatives[0]?.transcript : "";
console.log(data.type, data.is_final, data.speech_final)
if (data.type === "UtteranceEnd") {
if (!this.speechFinal) {
console.log(UtteranceEnd received before speechFinal, emit collected text: ${this.finalResult}.yellow);
events?.onTranscription?.(this.finalResult.trim()); // Trim for safety
this.finalResult = ""; // Reset after emitting
} else {
console.log("STT -> Speech was already final when UtteranceEnd received".yellow);
}
// return; // Prevent further execution
}

  // Handle final speech results
  if (data.is_final === true && text.trim().length > 0) {
    this.finalResult += (this.finalResult.length ? " " : "") + text; // Append with spacing

    if (data.speech_final === true) { 
      this.speechFinal = true;
      events?.onTranscription?.(this.finalResult.trim());
      this.finalResult = ""; // Reset after emitting
    }
  } else {
    // Handle ongoing utterances (not final)
    events?.onUtterance?.(text);
  }
});
somber isleBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Your event handler logic is generally correct and aligns with recommended practices for handling Deepgram’s live transcription events, especially when using both endpointing and utterance_end_ms. Here’s how your logic matches the guidance from the knowledge sources:

  • UtteranceEnd Handling:
    You correctly check for the "UtteranceEnd" event and emit the collected transcript only if a speech_final event has not already occurred. This matches the recommended approach:

    "trigger when a transcript with speech_final=true is received (which may be followed by an UtteranceEnd message which can be ignored), trigger if you receive an UtteranceEnd message with no preceding speech_final=true message and send the last-received transcript for further processing."
    Limitations of Endpointing

  • Final Results Handling:
    You concatenate finalized transcripts (is_final: true) and emit the result when speech_final: true is received, then reset the buffer. This is the correct way to gather the full utterance:

    "concatenate all responses marked is_final: true until you reach a response marked speech_final: true."
    Getting Final Transcripts

  • Ongoing Utterances:
    You emit partial results for ongoing utterances, which is standard for providing real-time feedback.

#

Minor Suggestions:

  • Make sure to reset this.speechFinal to false after emitting a final result, so the logic works for subsequent utterances.
  • The comment // return; // Prevent further execution is commented out. If you want to avoid processing further in the handler after handling UtteranceEnd, you may want to uncomment it.

Summary:
Your code structure is correct and follows Deepgram’s best practices for handling both endpointing and utterance end events in live transcription scenarios. If you continue to see issues, focus on tuning your endpointing and utterance_end_ms parameters, and ensure your audio input is clean and in the expected format.

For more details, see the official documentation on UtteranceEnd and endpointing.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@glossy dove Kapa.ai is still learning and improving, please let me know how I did by reacting below.

glossy dove
#

@somber isle is there a way to remove backgroun noice

somber isleBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Deepgram does not provide built-in background noise removal or noise suppression as part of its API. The official guidance is to use external noise suppression tools or libraries before sending audio to Deepgram. Some open-source options mentioned include noise-suppression-for-voice, Xiph's C library (with Python and Node bindings), or commercial solutions like krisp.ai. However, be aware that aggressive noise cancellation can sometimes degrade transcription quality, as Deepgram models are trained on real-world, noisy audio and may perform worse with heavily processed input Noise suppression? When use is not wearing headphones, Noise cancellation.

In summary:
You should pre-process your audio with a noise suppression tool before sending it to Deepgram, but test carefully to ensure it does not negatively impact transcription accuracy.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@glossy dove Kapa.ai is still learning and improving, please let me know how I did by reacting below.

glossy dove
#

@somber isle If i only want a transcript when user finsh talk

Help me update the code to fit in

language: events?.language?.toLocaleLowerCase() ?? this.language,
encoding: "mulaw",
model: "nova-2",
sample_rate: 8000,
no_delay: false,
punctuate: false,
smart_format: true,
interim_results: true,
utterance_end_ms: 1000,
utterances: true,
endpointing: 700,

somber isleBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

To emit a transcript only when the user has finished talking, you should trigger your transcript event only when you receive either:

  • a transcript with speech_final: true (from endpointing), or
  • an event of type "UtteranceEnd" (from the utterance_end_ms feature), if you have not already emitted a transcript for that utterance.

You should not emit transcripts on interim or partial results.

Here’s how you can update your event handler logic to fit this requirement:

this.connection.on(LiveTranscriptionEvents.Transcript, (data) => { 
  const alternatives = data.channel?.alternatives;
  let text = alternatives ? alternatives[0]?.transcript : "";

  // Only emit when the user has finished talking
  if (data.is_final === true && text.trim().length > 0) {
    this.finalResult = (this.finalResult?.length ? this.finalResult + " " : "") + text;
  }

  // Handle endpointing (speech_final) or UtteranceEnd
  if (data.speech_final === true) {
    events?.onTranscription?.(this.finalResult.trim());
    this.finalResult = "";
    this.speechFinal = true;
  } else if (data.type === "UtteranceEnd" && !this.speechFinal) {
    // If UtteranceEnd fires before speech_final, emit what you have
    events?.onTranscription?.(this.finalResult.trim());
    this.finalResult = "";
  }

  // Reset for next utterance
  if (data.speech_final === true || data.type === "UtteranceEnd") {
    this.speechFinal = false;
  }
});

Key points:

  • You only emit the transcript when the user has finished talking, as indicated by speech_final: true or "UtteranceEnd".
  • You concatenate all is_final: true transcripts until the end of the utterance.
  • You do not emit on interim results or partials.
#

This approach matches Deepgram’s recommended best practices for utterance segmentation and ensures you only get a transcript when the user has finished speaking, not during their speech or after every partial result see official guidance.

Let me know if you need this in another language or with more context!

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@glossy dove Kapa.ai is still learning and improving, please let me know how I did by reacting below.

glossy dove
#

is smart_format necessary on this?

wide tulipBOT
#

Hey there, it looks like you've been talking with Kapa for a bit. If you need additional help, you should consider moving to #1115960287183511643 where our team can answer your query.

glossy dove
#

@somber isle is smart_format necessary on this?

somber isleBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

No, smart_format is not necessary for detecting when the user has finished talking or for emitting a transcript only at the end of an utterance. The smart_format parameter controls how Deepgram formats the transcript output (e.g., formatting numbers, dates, addresses, etc.), but it does not affect endpointing, utterance detection, or the logic for when transcripts are emitted.

However, be aware that when smart_format is enabled, Deepgram may sometimes wait for additional audio to properly format certain entities (like phone numbers), which can introduce a slight delay in finalizing the transcript in some cases. This is a known limitation described in the documentation:

In rare situations, such as when speaking a phone number, Deepgram may purposefully wait for additional audio from the speaker so it can properly format the transcript (this only occurs when using smart_format=true)
Limitations of Endpointing

If you do not need special formatting, you can set smart_format: false to potentially reduce this delay. Otherwise, it is optional and based on your output formatting needs.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@glossy dove Kapa.ai is still learning and improving, please let me know how I did by reacting below.

glossy dove
#

Can you sena configuartion to always send the transcript after endpoint regardless if the transacript is ok or not

#

@somber isle Can you sena configuartion to always send the transcript after endpoint regardless if the transacript is ok or not

l mean a live configuation to connect

somber isleBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

Based on the knowledge sources, Deepgram’s endpointing feature is designed to send a finalized transcript (speech_final: true) whenever a sufficiently long pause in speech is detected. This happens regardless of whether the transcript is a complete sentence or just a partial utterance. The purpose of endpointing is to indicate a likely endpoint in speech, not necessarily a complete or “good” transcript:

Endpointing is not intended to return complete utterances or to be a reliable indicator of a complete thought or sentence. It is intended to hint to low-latency NLP engines that intermediate processing may be useful. For utterance segmentation, you’ll need to implement a simple timing heuristic appropriate to your domain.
Using Endpointing (speech_final)

Configuration Example:

To always receive a transcript after each detected endpoint (pause), you can use the following configuration:

const connection = deepgram.listen.live({
  language: 'en',
  encoding: 'mulaw',
  model: 'nova-2',
  sample_rate: 8000,
  interim_results: false, // Only get final results
  endpointing: 700,       // Adjust as needed for pause length
  // smart_format and punctuate are optional
});
  • Set interim_results: false to only receive finalized transcripts.
  • Set endpointing to your desired pause duration in milliseconds (e.g., 700ms).
  • Deepgram will send a transcript with speech_final: true after each detected endpoint, regardless of transcript completeness.

Note:
If you want to receive a transcript after every endpoint, you do not need to enable any special parameter beyond endpointing. Deepgram will send the transcript at each endpoint by default, even if the transcript is incomplete or mid-sentence.

#

If you need more control (e.g., always emitting even if the transcript is empty), you may need to handle that logic in your event handler by checking for speech_final: true and emitting whatever transcript is present.

For more details, see the Endpointing documentation and Endpointing & Interim Results With Live Streaming.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: