#How to get responses immediately at the word level

1 messages · Page 1 of 1 (latest)

true fiber
#

In Google STT, I can get responses immediately at the word level. Can the OpenAI Realtime API do something similar? Even if I commit periodically, it doesn’t work well.

sick hedgeBOT
#

Hi @true fiber. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

The OpenAI Realtime API provides low-latency, streaming transcription with partial and final results, but it does not natively emit word-level results as soon as each word is spoken, in the same way Google STT does. Instead, the Realtime API uses voice activity detection (VAD) to determine speech segments and emits transcription events at the end of detected speech chunks (turns), not at every word boundary. This means you may experience a slight delay (typically 300–800 ms) before receiving a chunk of transcribed text, rather than immediate word-by-word updates Comparing Speech-to-Text Methods with the OpenAI API Realtime transcription.

If you disable VAD and manually commit audio buffers more frequently (e.g., after every small chunk), you can increase the update frequency, but the API is still optimized for turn-based (utterance-level) detection and may not guarantee true word-level immediacy. The documentation does not describe a mode where you receive a streaming event for every word as it is spoken.

For true word-level streaming, the Transcription API (for pre-recorded audio) supports word-level timestamps, but this is not available in real-time streaming mode Speech to text – Timestamps. The Realtime API is designed for low-latency, conversational use cases, and while it provides partial results, it does not match Google STT’s per-word immediacy.

#

Summary:

  • The OpenAI Realtime API streams partial and final transcriptions at the end of detected speech segments, not at every word.
  • Disabling VAD and committing audio more frequently may help, but true word-level immediacy is not guaranteed.
  • Word-level timestamps are available for pre-recorded audio, not for real-time streaming.

If you need true word-level streaming as in Google STT, the current OpenAI Realtime API does not explicitly support this feature based on the available documentation.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@true fiber Kapa.ai is still learning and improving, please let me know how I did by reacting below.

true fiber
#

You expect a 300–800 ms delay when using VAD, but in reality it’s noticeably slower. I don’t get any partial results while speaking, nothing comes back until the entire sentence is finished. @sick hedge

sick hedgeBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

The OpenAI Realtime API is designed to provide ultra-low latency transcription, with typical delays of 300–800 ms, and supports partial and final transcripts for a more dynamic user experience cookbook.openai.com/examples/speech_transcription_methods#3--realtime-transcription-api. However, in practice, you may observe that no partial results are returned while speaking, and the API only emits a result after the entire sentence or utterance is finished.

This behavior is due to how voice activity detection (VAD) works in the Realtime API. By default, VAD segments audio based on detected speech turns—either by silence (server_vad) or semantic cues (semantic_vad). The API emits transcription events (such as partial or final results) only at the end of these detected turns, not continuously as you speak platform.openai.com/docs/guides/realtime-vad. This means you may not receive any output until the model determines you have finished speaking, which can feel slower than expected, especially if your speech contains few pauses or the VAD settings are not aggressive enough.

If you want more frequent updates, you can try:

#

The knowledge sources do not mention a way to receive true word-level partial results as you speak, and the observed delay is a result of the API’s design around turn-based detection rather than continuous, per-word streaming.

In summary:

  • The delay and lack of partials while speaking is expected with the current VAD-based design.
  • You can experiment with VAD settings or manual commits, but true word-level streaming is not currently supported by the OpenAI Realtime API.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: