#how to achieve instant subtitle-style output?

1 messages · Page 1 of 1 (latest)

weary temple
#

When using OpenAI’s Realtime API, I’d like to display live transcription results instantly. similar to YouTube subtitles. However, when I enable turn_detection, the transcription appears with significant delay and doesn’t feel real-time. Even if I send input_audio_buffer.commit every 200 ms, the output still lags.

Is the only practical solution to turn off turn_detection and handle the segmentation logic manually?

meager plinthBOT
#

Hi @weary temple. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

To achieve instant, subtitle-style output (i.e., live, low-latency transcription updates as the user speaks), you need to consider how the Realtime API’s turn detection (VAD) works:

  • With turn_detection enabled:
    The API waits for a speech turn to end (e.g., a pause or silence) before committing the audio buffer and emitting a transcription event. This introduces a delay, as you’ve observed, because the model only transcribes and emits results after detecting the end of a speech segment. This is by design for natural conversation flow, but it’s not ideal for instant subtitles Realtime transcription – Voice activity detection Voice activity detection (VAD).

  • Sending input_audio_buffer.commit frequently:
    Even if you manually commit the buffer every 200ms, the API may still not emit partial transcriptions as quickly as you want, because the underlying model and event system are optimized for turn-based (not token-by-token) output.

  • For true instant subtitles:
    The knowledge sources indicate that to get the most immediate, streaming transcription (like YouTube subtitles), you should disable VAD by setting turn_detection to null and handle segmentation and buffer commits yourself. This gives you full control to commit audio as often as you like and display partial results as soon as they are available Realtime conversations – Voice activity detection Realtime transcription – Voice activity detection.

#

VAD can be disabled by setting turn_detection to null... When VAD is disabled, the client will have to manually emit some additional client events to trigger audio responses: Manually send input_audio_buffer.commit... Realtime conversations – Voice activity detection

Summary:
Yes, the only practical solution for instant, subtitle-style output is to disable turn detection (turn_detection: null) and handle segmentation and buffer commits manually. This approach allows you to push audio and receive transcription deltas as quickly as possible, minimizing latency and making the experience feel truly real-time.

If you need further details on how to implement this, let me know!

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@weary temple Kapa.ai is still learning and improving, please let me know how I did by reacting below.

#

Let me check the knowledge base for you again. One moment! :books:

#

Certainly! Here’s a sample code snippet for achieving instant, subtitle-style output using OpenAI’s Realtime API with turn_detection disabled. This example uses a WebSocket connection, sends audio buffers, and commits them manually for low-latency transcription. The key is to set turn_detection to null and frequently send input_audio_buffer.commit events.

Sample WebSocket Payloads:

  1. Start the session with turn detection disabled:
{
  "type": "transcription_session.update",
  "input_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "prompt": "",
    "language": "en"
  },
  "turn_detection": null,  // Disable VAD for instant output
  "input_audio_noise_reduction": {
    "type": "near_field"
  }
}

Streaming the transcription of an ongoing audio recording

  1. Append audio data and commit frequently (e.g., every 100–200ms):
{
  "type": "input_audio_buffer.append",
  "audio": "Base64EncodedAudioData"
}
{
  "type": "input_audio_buffer.commit"
}

Realtime conversations – Voice activity detection

Python-style Pseudocode:

import asyncio
import websockets
import base64

async def stream_audio(ws, audio_chunks):
    for chunk in audio_chunks:
        # Send audio buffer
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(chunk).decode("utf-8")
        }))
        # Commit buffer for instant transcription
        await ws.send(json.dumps({
            "type": "input_audio_buffer.commit"
        }))
        await asyncio.sleep(0.2)  # Pace commits for low latency
#

async def main():
async with websockets.connect(
"wss://api.openai.com/v1/realtime?intent=transcription",
extra_headers={"Authorization": f"Bearer {YOUR_API_KEY}"}
) as ws:
# Start session with turn_detection disabled
await ws.send(json.dumps({
"type": "transcription_session.update",
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-transcribe",
"prompt": "",
"language": "en"
},
"turn_detection": None
}))
# Start streaming audio
await stream_audio(ws, your_audio_chunks_generator())

    # Listen for transcription events
    async for message in ws:
        event = json.loads(message)
        if event["type"] == "conversation.item.input_audio_transcription.delta":
            print("Partial transcript:", event["delta"])
        elif event["type"] == "conversation.item.input_audio_transcription.completed":
            print("Final transcript:", event["transcript"])

asyncio.run(main())


**Note:**  
- Replace your_audio_chunks_generator() with your own function that yields small PCM16 audio chunks.
- This code is a template; you’ll need to handle authentication and audio input as appropriate for your environment.

**References:**  
- [Realtime transcription – Voice activity detection](<https://platform.openai.com/docs/guides/realtime-transcription#voice-activity-detection>)
- [Streaming the transcription of an ongoing audio recording](<https://platform.openai.com/docs/guides/speech-to-text#streaming-the-transcription-of-an-ongoing-audio-recording>)
- [Handling transcriptions](<https://platform.openai.com/docs/guides/realtime-transcription#handling-transcriptions>)
#

If you need a more specific example (e.g., with microphone input), let me know!

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@weary temple Kapa.ai is still learning and improving, please let me know how I did by reacting below.

weary temple
#

How can I determine where to break the speech into pronunciation segments? @meager plinth

meager plinthBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

The knowledge sources do not provide explicit guidance or algorithms for determining where to break speech into pronunciation segments when using the OpenAI Realtime API with turn detection disabled. The segmentation logic—such as splitting on silence, pauses, or linguistic boundaries—must be implemented on the client side when you manage commits manually.

The documentation does describe how the built-in VAD (voice activity detection) works:

  • Server VAD breaks speech based on periods of silence, using parameters like silence_duration_ms and threshold to tune sensitivity.
  • Semantic VAD uses a classifier to estimate when a user has finished speaking, based on the content of the utterance, but this is only available when VAD is enabled and not when you are handling segmentation manually Voice activity detection (VAD).

If you disable VAD, you are responsible for deciding when to commit audio buffers. The documentation does not specify recommended strategies for this, so you may need to experiment with:

  • Fixed time intervals (e.g., every 100–200ms)
  • Detecting silence or low energy in the audio stream
  • Using external speech segmentation libraries

In summary, the knowledge sources do not contain a built-in or recommended method for breaking speech into pronunciation segments when VAD is off; this logic is left to the implementer Realtime conversations – Voice activity detection.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: