#API Endpoints Comparisons

1 messages · Page 1 of 1 (latest)

feral flame
#

If I wanted to make a conversational voice agent with manual control over context. What are the practical differences between the Realtime API, the Responses API with an Audio-Audio capable model, and the Completions API with an Audio-Audio model.

The voice agents post talks about using either the Realtime API or a chained approach. What is the practical difference between the two of these? Considering the Realtime and Completions APIs can also stream responses.

boreal patrolBOT
#

Hi @feral flame. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

You’re asking two related things:

  1. How Realtime vs Responses vs Chat Completions differ for a conversational voice agent (audio↔audio) where you manage context.
  2. How that compares to the “chained” voice pipeline described in the Voice agents guide.

Below is what’s explicitly covered in the docs; anything not stated there I’ll flag as such.


1. Realtime API vs Responses vs Chat Completions (for voice agents)

Realtime API

What it is

  • A stateful session with a Realtime model (e.g. gpt-realtime) that natively supports audio+text in, audio+text out, optimized for low latency and continuous conversations.[Realtime overview]
  • You connect via WebRTC (browser) or WebSocket/SIP and exchange events (e.g. session.update, input_audio_buffer.append, response.create).[Realtime conversations]

Context & control

#
  • The session maintains a Conversation object (items for user audio, model responses, tools, etc.).[Realtime conversations]
  • You can:
    • Configure truncation strategies (e.g. truncation: { type: "retention_ratio", retention_ratio: 0.7 }) to control how much history is kept vs dropped, which is important for cache behavior and context size.[Prompt caching 201; Context summarization]
    • Implement your own context summarization loop: track all turns, periodically summarize older ones into a shorter message, and keep recent turns verbatim.[Context summarization]
    • Use out‑of‑band responses (conversation: "none") to run extra tasks (e.g. transcription, summaries) that don’t modify the main conversation state.[OOB transcription code; Why OOB]

Audio behavior

#

When it’s the best fit

  • You want:
    • Live, low‑latency audio in/out.
    • A long‑lived session where the server tracks conversation state.
    • Fine‑grained control over turns, interruptions, and context trimming.

Responses API with an audio‑capable model

The docs you provided only compare Responses vs Chat Completions at a high level (no audio‑specific examples), but they do clarify the shape of the API:

  • Responses uses Items instead of messages; a message is just one item type among others (e.g. function_call, function_call_output).[Migrate to Responses]
  • You send input and get back a response with an output array of items; there is no n parameter and only one generation per call.[Migrate to Responses]

The knowledge sources you shared do not explicitly document an “audio‑audio” Responses example, so I can’t describe its exact audio payload format or streaming semantics from these docs alone.

Context & control (from what is documented)

#
  • Responses is request/response: each call is independent; there is no built‑in long‑lived session like Realtime.
  • To have manual control over context, you would:
    • Maintain your own conversation history on the client/server.
    • Decide what subset (or summary) to send as input on each call.
  • This gives you maximal explicit control over what the model sees, but you must implement all context management yourself.

Because the provided docs don’t show audio examples for Responses, I can’t reliably compare its audio streaming behavior to Realtime beyond saying that Responses is not the “live session” API; Realtime is.[Realtime overview; Audio guide]


Chat Completions API with an audio‑capable model

From the audio guide and the forum post:

Context & control

  • Chat Completions is also request/response:
    • You send a messages array.
    • The model doesn’t maintain a server‑side session; you manage history client‑side.
  • For manual context control, you:
    • Decide which prior messages to include.
    • Optionally summarize or trim older turns before each call.

When it’s the best fit (per docs)

#
  • When realtime latency is not critical, but you want:
    • Audio in/out,
    • Features like function calling, and
    • A more traditional HTTP/streaming interface.[Choosing the right API]

2. Realtime speech‑to‑speech vs “chained” voice pipeline

The Voice agents guide explicitly contrasts these two architectures:[Voice agents]

A. Speech‑to‑speech with live audio sessions (Realtime)

Trade‑offs (from docs)

#
  • Pros:
  • Cons:
    • You have less explicit control over intermediate text (unless you add extra out‑of‑band calls).
    • Evals and debugging are more complex because you’re grading both content and audio quality and a multi‑stage streaming pipeline.[Realtime eval guide]

B. Chained voice pipeline (STT → LLM → TTS)

Per the Voice agents guide, this is the “chained voice pipeline”:[Voice agents]

  • You explicitly chain:
    1. Speech‑to‑text via audio/transcriptions (e.g. gpt-4o-transcribe, whisper-1).[Audio guide; Speech transcription methods]
    2. A text LLM (e.g. via Responses or Chat Completions).
    3. Text‑to‑speech via audio/speech (e.g. gpt-4o-mini-tts, tts-1).[Audio guide]

When it’s recommended

#
  • “Predictable workflows or extending an existing text agent” and cases where you want stronger control over intermediate text and deterministic logic between stages.[Voice agents; Choosing the right API]
  • Often better for:
    • Support flows,
    • Approval‑heavy flows,
    • Durable transcripts and explicit business logic.[Voice agents]

Trade‑offs (from docs)


3. Putting it together for “manual context control”

Given the docs:

#
  • If you want maximum explicit control over context and don’t strictly need ultra‑low latency:
    • Use a chained pipeline with Chat Completions or Responses:
  • If you want low‑latency, natural speech‑to‑speech but still strong context control:

The knowledge sources don’t provide a detailed, side‑by‑side latency or feature matrix for Realtime vs Responses vs Chat Completions with audio‑audio, but they do clearly position:

#
  • Realtime → live, low‑latency, stateful speech‑to‑speech sessions.
  • Chat Completions / Responses → general LLM APIs (with audio support in Chat Completions) where you manage context per request.
  • Chained pipeline → explicit STT → LLM → TTS for maximum control and predictability.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: