#Utterance segmentation is still too long even with a very small utt_split

1 messages · Page 1 of 1 (latest)

mental marlin
#

We’re seeing that the returned utterances/sentences are still too long, even after setting utt_split to a very small value. (utt_split is 0.1)

The file above is our original audio file. In the transcription result, the sentence below is segmented as one very long utterance.

Could you help us understand how to make the utterance segmentation shorter or more granular? If you need any additional sample

doc: https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded

Transcribe audio and video using Deepgram's speech-to-text REST API

ashen sunBOT
#

Hey there! This channel allows you to ask Kapa.ai for assistance. But it looks like you haven't connected your Discord profile to your Deepgram account yet. You can do so at https://community.deepgram.com - by doing so, you'll be able to ask questions in #1115960287183511643 if you need human assistance!

karmic galleonBOT
#

Hi @mental marlin. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

Deepgram’s utt_split is based on detected pauses in speech, not on a fixed maximum utterance length. That means:

  • utterances=true must be set, or utt_split has no effect at all.[Improvements audit; API params]
  • utt_split is “seconds to wait before detecting a pause between words in submitted audio” — if the speaker doesn’t actually pause by at least that amount, the utterance will not be split, no matter how long the sentence is.[API params]

From the GitHub discussion you’re essentially running into the same behavior others have seen:

  • Decreasing utt_split below the default (0.8s) often has little or no effect on the number of utterances. In a 96‑second test file, values 0.01, 0.1, 0.2, and 0.5 all produced the same number of utterances; only increasing it (1, 1.5, 2, 3) noticeably changed segmentation.[GitHub utt_split discussion]
  • The feature is “best suited for especially fast or slow speakers,” and you’ll “notice more variation of fewer longer utterances when increasing utt_split, but less so of more shorter utterances when decreasing it.”[GitHub utt_split discussion]

So if your speaker has long stretches of continuous speech with very short or no pauses, even utt_split=0.1 won’t force shorter segments; there simply aren’t pauses long enough to trigger splits.

What you can do

Within the limits of the current feature set (based on the docs and discussions):

  1. Confirm your request options

    Make sure you’re actually sending both:

#
  • utterances=true

    • utt_split=0.1 (as a number, not a string, if you’re using an SDK)

    Example with the Python SDK:[Python utterances recipe; Python client params]

    from deepgram import DeepgramClient
    
    client = DeepgramClient()
    resp = client.listen.v1.media.transcribe_url(
        url="<https://dpgr.am/spacewalk.wav",>
        model="nova-3",
        smart_format=True,
        utterances=True,
        utt_split=0.1,
    )
    
  1. If you need shorter units than pauses allow, post-process the transcript

    The knowledge sources show that for captioning, Deepgram’s SDK utilities further split utterances by a max word count (line_length) when generating SRT/WebVTT, even though the underlying utterances are longer.[SRT/WebVTT discussion]

    That pattern is what you’d need to replicate for your own use case:

    • Use utterances=true to get time‑aligned chunks.
    • Then, in your own code, split each utterance into smaller segments (e.g., by word count or character count), using the word timestamps to interpolate new start/end times.

    The docs don’t provide a ready‑made “max duration per utterance” parameter, so this kind of client‑side splitting is currently the documented way to get more granular segments.

  2. When increasing granularity, you may ignore utterances entirely

#

The official answer in the same GitHub thread notes that you can “go word by word within each utterance with a defined line length and use first and last words’ timestamps,” and if utterances aren’t helpful for your use case, you can “chunk by a line length across the entire transcript, not using utterances.”[GitHub utt_split discussion]


If you can share (or describe) the approximate speaking style (very fast, lots of pauses, mostly continuous, etc.), I can outline a concrete post‑processing strategy (e.g., “max 3 seconds or 10 words per segment”) based on the word timestamps.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: