#(Streaming STT API) Some interim transcripts never get final transcripts

1 messages · Page 1 of 1 (latest)

jolly cliff Jan 10, 2025, 8:23 AM

This seems to break the assumption that an interim result will always eventually get finalized. Here's a log from my application illustrating this.
In the log, you can see that:
Line 1: we get a final transcript containing "And you could it was also, like, an event creator, so you could create " with a start time of 2550.03
Line 2: we get a interim transcript "like, " with a start time of 2554.43
Line 3: interim transcript "like, spontaneous events with" also with a start time of 2554.43. Because the start times are the same, my app knows to replace the interim transcript from Line 2 with this newer one.
The following lines are more interim results with the same start time of 2554.43, culminating in a final result on Line 6, also with a start time of 2554.43. My app sees this and knows that we can replace all the interim transcripts with start time of 2554.43 with this final one.
The interesting part is on lines 7-9.
Line 7: interim transcript "There's no" with a start time of 2558.42
Line 8: a new interim transcript "The last app I" with a start time of 2559.98 (a different start time!)
Line 9: final transcript "Yeah. I do. The last app I launched" with start time of 2559.98

As you can see, we never get a final result for the transcript from Line 7 with a start time of 2558.42!

Is this expected? Why does this happen? I would expect that one of the following should happen instead:

BEFORE the transcript on Line 8 is received, a final transcript is sent for the transcript from Line 7, containing the start time of 2558.42
OR
the transcripts on Line 8 and Line 9 use a start time of 2558.42; that way, my app understands to replace the interim ones with newer interim transcripts accordingly, and eventually replace the interim ones with the final one.

📎 message.txt

loud escarpBOT Jan 10, 2025, 8:23 AM

Hi @jolly cliff. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

Thank you for your detailed question about Deepgram's streaming Speech-to-Text API and the behavior of interim and final transcripts. This is indeed an interesting observation, and I'll try to explain what might be happening based on the information provided in the knowledge sources.

The behavior you're describing, where some interim transcripts never receive corresponding final transcripts, is not explicitly documented as expected behavior. However, it's important to note that the relationship between interim and final results can be complex, especially when dealing with real-time streaming audio.

According to the Interim Results documentation, Deepgram provides interim transcripts as preliminary results during real-time streaming, and then identifies a point at which its transcript has reached maximum accuracy to send a definitive, or final, transcript.

The behavior you're seeing might be related to how Deepgram processes streaming audio in chunks. As mentioned in the Using Interim Results documentation:

"When working with real-time streaming audio, streams flow from your capture source (for example, microphone, browser, or telephony system) to Deepgram's servers in irregular pieces. In some cases the collected audio can end abruptly—perhaps even mid-word—which means that Deepgram's predictions, particularly for words near the tip of the audio stream, are more likely to be wrong."

This suggests that there might be cases where Deepgram decides to discard or merge certain interim results without explicitly finalizing them, especially if it determines that a newer chunk of audio provides a better context for transcription.

It's also worth noting that there's an ongoing issue related to timestamps in streaming transcription, as mentioned in a GitHub discussion. The Deepgram team is aware of timestamp inconsistencies and is working on a fix. This could potentially be related to the behavior you're observing.

Given the complexity of real-time audio processing and the current state of the feature, it seems that the behavior you're experiencing, while not ideal, is within the realm of possible outcomes when using streaming transcription with interim results.

To handle this in your application, you might need to implement a strategy to deal with "orphaned" interim results. For example, you could set a timeout after which you either discard or tentatively accept an interim result if no corresponding final result is received.

For the most up-to-date and accurate information on this behavior, I recommend reaching out to Deepgram's support team or posting your question in their GitHub discussions. They may be able to provide more specific insights or confirm if this is expected behavior or a known issue they're working to address.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

@jolly cliff Kapa.ai is still learning and improving, please let me know how I did by reacting below.