#Whisper API segment length / words level punctuation

1 messages · Page 1 of 1 (latest)

hushed wave
#

Hi,

I’m using the openAI API, I’m trying to get short segments (~4 words) with timings & punctuations.
What I went through:
API doesn’t allow to set the number of words per segment
Thought I could build it from words level transcribe → there is no punctuation there, also characters like - and ' weirdly managed
Thought I could merge text or segments with words (I can get punctuation from text and timing from words)

Until I noticed a few things between text/segments and words:
Text might differ. Literally having word in words that totally not exist in text/segments
Timestamps is a big mismatch between words and segments
No punctuation in words
Words containing ' or - like « it’s » in some language would be consider as one word, in other language as two word.
This makes merging segments and words difficult since there is not the same amount of words in both side and rules on specific characters differ depending on the language
Did anyone succeed getting a word based transcribe with punctuation and level timestamp with the API or short segments ?

quick rockBOT
#

Hi @hushed wave. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

Based on the knowledge sources provided, it seems that the Whisper API does not directly support word-level timestamps with punctuation. The API is designed to provide phrase-level timestamps, and word-level timestamps are produced by an inference-time trick, which may not give perfectly accurate timing, especially when dealing with pauses source (https://github.com/openai/whisper/discussions/1855).

The Whisper API does allow for word-level timestamps, but it does not include punctuation source (https://platform.openai.com/docs/guides/speech-to-text). The API also does not allow you to set the number of words per segment.

As for the discrepancies between the text/segments and words, it seems that this is a known issue. One user reported that the text key often shows more text than the segments**0]['words'] key [source** (https://github.com/openai/whisper/discussions/1104). This could be due to the way the Whisper model handles certain characters like '-' and '.

In terms of getting short segments, one possible solution could be to tune the threshold for how likely the model samples the timestamp tokens, which will determine the length of each phrase. This can be done by modifying the decoding strategy for timestamps in the Whisper code source (https://github.com/openai/whisper/discussions/223).

However, please note that these solutions may not fully address your needs, as they do not provide a way to get word-level timestamps with punctuation. It might be worth reaching out to OpenAI directly for further assistance.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: