#Contradiction in the OpenAI Realtime API Cookbook
1 messages · Page 1 of 1 (latest)
You’re right to notice the mismatch, and your interpretation is correct.
From the Realtime “Beta to GA migration” guide, the event names were explicitly changed when the Realtime API went GA:
response.text.delta→response.output_text.deltaresponse.audio.delta→response.output_audio.deltaresponse.audio_transcript.delta→response.output_audio_transcript.delta[Beta→GA migration]
The current GA docs for audio over WebSockets also consistently use response.output_audio.delta and response.output_audio.done as the correct server events. [Handling audio with WebSockets; Audio output from WebSocket]
However, the context‑summarization cookbook still shows response.audio.delta in its event table and example handler, which matches the old (beta) naming. [Context summarization example]
So:
- The API reference is up to date for GA.
- The cookbook example is partially outdated in its event names and should be updated to use
response.output_audio.delta(and similarly for otheroutput_*events).
If you’re building against the GA Realtime API (without the OpenAI-Beta: realtime=v1 header), you should follow the GA event names from the API reference, not the older ones shown in that cookbook.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
is there a way to retrieve the session's record of a conversation history to check if it matches my internal records? @unique dagger
Let me check the knowledge base for you again. One moment! :books:
Yes. The Session protocol in openai-agents-python defines a get_items(limit: int | None = None) method that returns the conversation history for that session as a list of input items. [Session protocol]
Concrete session implementations (e.g., SQLiteSession, RedisSession, SQLAlchemySession, DaprSession) all provide this same async method to retrieve history:
items = await session.get_items() # full history
items = await session.get_items(limit=50) # latest 50 items
SQLiteSession.get_itemsreads storedmessage_datarows for thesession_idand returns them in chronological order. [SQLiteSession.get_items]RedisSession.get_itemspulls the list from Redis and returns the messages, again in chronological order, with an optionallimit. [RedisSession.get_items]SQLAlchemySession.get_itemsandDaprSession.get_itemsbehave similarly, querying their respective backends and returning a list of deserialized items. [SQLAlchemySession.get_items; DaprSession.get_items]
You can compare the returned list of items to your internal records to verify they match.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
is there a way to retrieve the session's record of a conversation history to check if it matches my internal records? I am using the realtime api with a websocket. in a voice conversation. @unique dagger
Let me check the knowledge base for you again. One moment! :books:
There is some support for this, but the information in the knowledge sources is limited.
From the openai-agents-python side, a session is explicitly defined as something that “stores conversation history for a specific session,” and you can retrieve that history via the Session.get_items(limit: int | None = None) method, which returns a list of items representing the conversation history. [Session protocol] This is how you’d programmatically read back what the agent believes the session history is and compare it to your own records.
For Realtime specifically, there is also a RealtimeHistoryUpdated event whose history field is documented as:
history: list[RealtimeItem]– The full history of the session. [Realtime history]
So, in a Realtime WebSocket voice conversation, the documented options are:
-
Listen for
RealtimeHistoryUpdatedevents and inspect thehistorylist to see the full session history as the Realtime agent tracks it, then compare that to your internal log. [Realtime history] -
If you are using an
openai-agents-pythonSession backend, callawait session.get_items()(optionally with alimit) to retrieve the stored conversation history and compare it to your own records. [Session protocol]
The knowledge sources do not show any other Realtime-API-specific “fetch history” command beyond these patterns, so if you need more direct or lower-level history retrieval, it isn’t documented in the provided material.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
The docs say that using a websocket to interact with the realtime api allows for more low level control than creating a RealtimeSession. what does the websocket give you control over that you simply cannot do with the RealtimeSession @unique dagger
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources only partially answer this, and they don’t list a complete “cannot do with RealtimeSession” matrix. I’ll stick to what’s explicitly documented.
From the docs, WebSocket access is the lowest‑level interface to the Realtime API: you send and receive raw JSON events yourself. [Sending & receiving events] In contrast, RealtimeSession in openai-agents-python is a higher‑level wrapper that:
- Hides the WebSocket connection (
OpenAIRealtimeWebSocketModel) behindsession.model - Provides convenience methods like
send_message()andsend_audio()that automatically manage some Realtime events for you [Inputs and outputs]
The docs explicitly call out what you use the underlying transport (WebSocket model) for that goes beyond the high‑level session:
You can access the underlying transport object through
session.model.
Use this when you need:
- custom listeners via
session.model.add_listener(...)- raw client events such as
response.createorsession.update- custom
url,headers, orapi_keyhandling throughmodel_configcall_idattach to an existing realtime call [Low-level access]
Concretely, the extra control you get at the WebSocket / model level (vs just using the high‑level RealtimeSession helpers) includes:
-
Sending arbitrary raw client events
- You can send your own
response.create,session.update,input_audio_buffer.commit, etc., by constructing and sending raw messages over the socket (or viasession.model.send_event(RealtimeModelSendRawMessage(...))). [Inputs and outputs; Low-level access] - This is required for fully manual turn control (e.g., disabling
turn_detectionand deciding exactly when to triggerresponse.create). [Inputs and outputs]
- You can send your own
-
Custom connection parameters and endpoints
- You can specify a custom
url,headers, andapi_keyinmodel_configwhen creating the session, which are passed through to the underlying WebSocket connection. This is how you, for example, connect to Azure OpenAI Realtime endpoints or use token‑based auth. [Low-level access; OpenAIRealtimeWebSocketModel.connect]
- You can specify a custom
-
Attaching to an existing realtime call via
call_id- The model transport supports
call_idin its config to attach to an existing Realtime call (e.g., SIP example). [Low-level access; OpenAIRealtimeWebSocketModel.connect]
- The model transport supports
-
Custom event listeners at the transport level
- You can register listeners directly on the model (
session.model.add_listener(...)) to receive low‑level events such as audio deltas, tool calls, and internal error events. [Low-level access; OpenAIRealtimeWebSocketModel.add_listener]
- You can register listeners directly on the model (
-
Direct handling of raw WebSocket messages
- At the pure WebSocket level (without the Python agents wrapper), you are responsible for sending JSON events and parsing all server events yourself, including Base64‑encoded audio chunks (
response.output_audio.delta). [Sending & receiving events; Audio output over WebSocket]
- At the pure WebSocket level (without the Python agents wrapper), you are responsible for sending JSON events and parsing all server events yourself, including Base64‑encoded audio chunks (
What’s not clearly documented is a list of things that are impossible with RealtimeSession but possible with raw WebSockets; instead, the docs show that RealtimeSession exposes the underlying WebSocket model when you need this lower‑level control. So in practice:
- Use
RealtimeSessionfor convenience (send_message,send_audio, automatic response handling). - Drop down to the WebSocket / model layer when you need raw events, custom connection behavior, or to attach to existing calls.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
when would you want to use a RealtimeRunner vs a RealtimeSession @unique dagger
Let me check the knowledge base for you again. One moment! :books:
RealtimeRunner and RealtimeSession are used together, but at different stages:
-
RealtimeRunneris a factory for realtime sessions.- You create it once with a starting
RealtimeAgentand optional run‑level config (guardrails, async tool calls, etc.). [RealtimeRunner; Realtime guide] - It wires that agent to a transport (by default
OpenAIRealtimeWebSocketModel) and prepares the overall run shape. [RealtimeRunner; Realtime transport]
- You create it once with a starting
-
RealtimeSessionis the live connection you actually talk to.- You get it by calling
await runner.run(...). [RealtimeRunner.run; Session lifecycle] - You then:
- enter it (
async with await runner.run() as session:), - send messages/audio (
send_message,send_audio), - and iterate its events until the conversation ends. [RealtimeSession; Session lifecycle]
- enter it (
- You get it by calling
So in practice:
- Use
RealtimeRunnerwhen you’re setting up a realtime agent run: choosing the starting agent, transport, and run‑wide config. - Use
RealtimeSessiononce the run has started, to conduct the actual conversation: streaming inputs/outputs, handling events, tools, and history.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
how do you access the chat history of a RealtimeSession vs a RealtimeRunner @unique dagger
Let me check the knowledge base for you again. One moment! :books:
For RealtimeSession, the history is exposed directly via events; for RealtimeRunner, there’s no separate history API in the provided docs.
RealtimeSession
When a RealtimeSession starts (__aenter__), it immediately emits a RealtimeHistoryUpdated event whose history field is:
history: list[RealtimeItem]– The full history of the session. [RealtimeSession ref; HistoryUpdated.history]
The Realtime agents guide also calls out:
High‑value session events include
history_added,history_updated… They expose the session’s local history asRealtimeItemobjects. [Events, history, interruptions]
So to access chat history with a RealtimeSession, you:
- Iterate session events and watch for:
history_updated→ full history (list[RealtimeItem])history_added→ newly added item(s) [Events, history, interruptions]
RealtimeRunner
The knowledge sources only describe RealtimeRunner as a session factory that you call to obtain a RealtimeSession; they do not document any direct history access on the runner itself. [Realtime agents guide]
So, based on the available docs:
- You do not access chat history from
RealtimeRunnerdirectly. - You do access it from the
RealtimeSessioncreated by the runner, viahistory_updated/history_addedevents and theirhistory/itemfields.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
can you add a message to the RealtimeSession history without it appearing as the most recent message (so that the agent does not directly reply to it as though the user just said that thing) and also without triggering a response? If i wanted to insert some information into the chat history for the model to use in its next responses, but i didnt want the model to reply as though the user just said that, how would i do that? @unique dagger
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources don’t describe a way to back‑insert items into a Realtime session’s history (e.g., at an earlier position) or to mark a history item as “not the latest turn.” They also don’t show a dedicated “silent history write” event for Realtime sessions.
What is documented are two related patterns:
-
Control what context is used for a response, without changing the default conversation
With the low‑level Realtime API you can create an out‑of-band response that:
- does not get added to the default conversation (
conversation: "none"), and - uses a custom
inputarray that can include arbitrary items and/or references to existing items. [Create custom context; Out-of-band responses]
Example (Python-style from the docs):
event = { "type": "response.create", "response": { "conversation": "none", "metadata": { "topic": "pizza" }, "output_modalities": ["text"], "input": [ # reference existing items { "type": "item_reference", "id": "some_conversation_item_id" }, # add new context that is *not* treated as a fresh user turn { "type": "message", "role": "system", "content": [ { "type": "input_text", "text": "Extra background info..." } ], }, ], }, } - does not get added to the default conversation (
This lets you inject information into the context used for the next response without:
- modifying the default conversation history, and
- making it look like the user just said that. [Create custom context]
-
In the agents SDK, control how history + new input are merged
In
openai-agents-python, when you use aSession(not specifically Realtime), the runner normally sends:- session history (
session.get_items(...)) - new turn input
You can customize this merge with
RunConfig.session_input_callback(history, new_input), which returns the final list of items sent to the model. [Control history merge]That callback:
- does not change what’s stored in the session; it only changes what the model sees for that call.
- lets you add extra items into the context (e.g., prepend a synthetic system message) without those items being saved as “the latest user turn.”
Example from the docs (simplified):
def keep_recent_history(history, new_input): # You could also insert extra context here return history[-10:] + new_inputOnly the new turn’s items are persisted; your injected items can be used for context without becoming the latest user message. [Control history merge]
- session history (
Given the docs we have:
- There is no documented way to insert a message into a RealtimeSession’s stored history at an earlier position or as a “hidden” item that doesn’t look like the latest turn.
- The documented pattern to “add information the model can use next, without it being treated as a fresh user utterance and without auto‑triggering a response” is:
- either use
response.createwithconversation: "none"and a custominputarray that includes your extra info as a system message alongside references to existing items; [Create custom context] - or, in the agents SDK, use
session_input_callbackto inject extra context into the model input for that turn, without persisting it as the latest user message. [Control history merge]
- either use
If you specifically need “write into the stored RealtimeSession history but not as the most recent item,” the knowledge sources do not describe a supported mechanism for that.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger what is the more low level alternative to RealtimeSession.send_message() What if I want to add a message to the chat history but I dont want to prompt a response or interrupt the model talking.
Let me check the knowledge base for you again. One moment! :books:
The more low-level alternative is to send a raw client event through the underlying model transport instead of using RealtimeSession.send_message().
In the Realtime agents guide, this is shown using RealtimeModelSendRawMessage via session.model.send_event(...) to send a response.create manually. [Inputs and outputs]
from agents.realtime.model_inputs import RealtimeModelSendRawMessage
await session.model.send_event(
RealtimeModelSendRawMessage(
message={
"type": "response.create",
}
)
)
You can use the same pattern to send any raw Realtime client event, not just response.create. That’s the low-level alternative to send_message().
However, the knowledge sources do not document a specific raw event that:
- Adds a message to the session’s stored conversation history,
- Does not become the “latest user turn” for response triggering, and
- Does not trigger or interrupt a response.
What is documented instead is how to add context without changing the default conversation or prompting a normal turn:
- Use a
response.createwith a custominputarray andconversation: "none"to build a one-off context (including your extra message as asystemorassistantmessage) that the model uses for that response only, without adding it to the default conversation. [Create custom context]
So, based on the docs:
- Low-level alternative to
send_message(): send raw client events viasession.model.send_event(RealtimeModelSendRawMessage(...)). [Inputs and outputs] - Adding info for the next response without treating it as a fresh user utterance: use
response.createwith a custominputarray (and optionallyconversation: "none") rather than trying to silently write into the stored session history. [Create custom context]
The sources do not show a way to “silently append to stored history” in the exact way you describe.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger
event = {
"type": "response.create",
"response": {
"conversation": "none",
"metadata": { "topic": "pizza" },
"output_modalities": [ "text" ],
# Create a custom input array for this request with whatever
# context is appropriate
"input": [
# potentially include existing conversation items:
{
"type": "item_reference",
"id": "some_conversation_item_id"
},
# include new content as well
{
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Is it okay to put pineapple on pizza?",
}
],
}
],
},
}
ws.send(json.dumps(event))```
Is it possible to run some function like this with custom text / information in the context, using `session.model.send_event(RealtimeModelSendRawMessage(...))` rather than sending a dict to the websocket specifically?
Let me check the knowledge base for you again. One moment! :books:
Yes. The Realtime agents guide explicitly shows using session.model.send_event(RealtimeModelSendRawMessage(...)) to send raw client events like response.create, instead of writing JSON to the WebSocket yourself. [Inputs and outputs]
The guide’s example (for manual response control) is:
from agents.realtime.model_inputs import RealtimeModelSendRawMessage
await session.model.send_event(
RealtimeModelSendRawMessage(
message={
"type": "response.create",
}
)
)
You can adapt this pattern to send your custom response.create with a custom input array (like in your pizza example). Conceptually it would look like:
from agents.realtime.model_inputs import RealtimeModelSendRawMessage
await session.model.send_event(
RealtimeModelSendRawMessage(
message={
"type": "response.create",
"response": {
"conversation": "none",
"metadata": {"topic": "pizza"},
"output_modalities": ["text"],
"input": [
{"type": "item_reference", "id": "some_conversation_item_id"},
{
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Is it okay to put pineapple on pizza?",
}
],
},
],
},
},
)
)
The docs don’t show this exact payload, but they do state that RealtimeModelSendRawMessage is for sending “raw client events” and demonstrate it with response.create, so using it with the richer response body from the Realtime API examples is consistent with the documented pattern. [Inputs and outputs; Create custom context]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger How would you add context permanently to a conversation without triggering a response
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources don’t describe a way to permanently write extra context into a Realtime conversation’s stored history without it being treated as a new turn and/or potentially triggering a response. They instead document patterns for adding ephemeral context to a single response.
Here’s what is documented:
-
Custom context for a single response (not permanent)
You can send aresponse.createwith a custominputarray (including new messages and references to existing items) and even setconversation: "none"so it doesn’t touch the default conversation. This lets you inject extra text the model can use for that response only, without changing the stored conversation. [Create custom context] -
Custom merge of history + new input (agents SDK, not Realtime history writes)
With the higher‑level agentsSession, you can useRunConfig.session_input_callback(history, new_input)to modify what the model sees each turn (e.g., prepend a synthetic system message) without changing what’s actually stored in the session. [Control history merge]
Neither of these is a “permanent, silent insert into RealtimeSession history.” They both adjust what context is sent to the model, not the underlying stored history.
So, based on the provided docs:
- You can add context for a response (via
response.createwith custominput, or viasession_input_callbackin the agents SDK). [Create custom context; Control history merge] - There is no documented mechanism to permanently add context into a Realtime conversation’s stored history without it being treated as a normal turn that could influence response triggering.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger how do you send a sessionupdate event to a realtimeSession so you can update it's prompt / information
Let me check the knowledge base for you again. One moment! :books:
You update a RealtimeSession’s prompt/instructions by sending a session.update client event through its underlying model transport.
There are two closely related cases:
1. Raw Realtime API (WebSocket)
Send a session.update event over the WebSocket with the new prompt / instructions:
event = {
"type": "session.update",
"session": {
"type": "realtime",
"model": "gpt-realtime",
"prompt": {
"id": "pmpt_123",
"version": "89",
"variables": {
"city": "Paris",
},
},
# Direct fields override prompt fields if they overlap
"instructions": "Speak clearly and briefly. Confirm understanding before taking actions.",
},
}
ws.send(json.dumps(event))
This is exactly the pattern shown in the Realtime docs for updating a session’s prompt and instructions mid-call. [Session lifecycle; Prompt update]
You can also send a smaller update, e.g. just changing variables or instructions:
dc.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
prompt: {
id: "pmpt_123",
version: "89",
variables: { city: "Berlin" },
},
},
}));
2. Using RealtimeSession from openai-agents-python
RealtimeSession wraps a transport (by default OpenAIRealtimeWebSocketModel). To send a session.update from Python, you use the model’s send_event with a RealtimeModelSendSessionUpdate event, which internally sends a session.update client event. This is how RealtimeSession.update_agent works:
async def update_agent(self, agent: RealtimeAgent) -> None:
"""Update the active agent for this session and apply its settings to the model."""
self._current_agent = agent
updated_settings = await self._get_updated_model_settings_from_agent(
starting_settings=None,
agent=self._current_agent,
)
await self._model.send_event(
RealtimeModelSendSessionUpdate(session_settings=updated_settings)
)
[RealtimeSession.update_agent]
So in your own code you:
- Build new session settings (including updated
instructions,prompt, etc.). - Call
await session._model.send_event(RealtimeModelSendSessionUpdate(...))
or, more typically, use a helper likesession.update_agent(...)that does this for you. [RealtimeSession.update_agent]
Summary
- Over a raw WebSocket: send a JSON event with
type: "session.update"and your newsessionfields. [Session lifecycle] - With
RealtimeSession(agents SDK): callupdate_agent(...)or directly send aRealtimeModelSendSessionUpdateviasession._model.send_event(...), which emits thesession.updateevent under the hood. [RealtimeSession.update_agent]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger how do you send a sessionupdate event to a realtimeSession so you can update it's prompt / information. Assuming i only have a reference to the RealtimeSession variable that i'm using. i dont have a raw websocket variable currently
Let me check the knowledge base for you again. One moment! :books:
You can do this directly through the RealtimeSession you already have; you don’t need a raw WebSocket handle.
In openai-agents-python, RealtimeSession updates the model’s session by sending a RealtimeModelSendSessionUpdate event via its internal model (self._model). The built‑in update_agent method shows the pattern:
async def update_agent(self, agent: RealtimeAgent) -> None:
"""Update the active agent for this session and apply its settings to the model."""
self._current_agent = agent
updated_settings = await self._get_updated_model_settings_from_agent(
starting_settings=None,
agent=self._current_agent,
)
await self._model.send_event(
RealtimeModelSendSessionUpdate(session_settings=updated_settings)
)
[RealtimeSession.update_agent]
So, given only a session: RealtimeSession:
- Build the new session settings dict (e.g., updated
instructions,prompt, etc.). - Send a session update event via the session’s model:
from agents.realtime.model_inputs import RealtimeModelSendSessionUpdate
updated_settings = {
# whatever fields you want to change, e.g.:
"instructions": "New instructions here",
# "prompt": {...}, "output_modalities": [...], etc.
}
await session._model.send_event(
RealtimeModelSendSessionUpdate(session_settings=updated_settings)
)
This sends the equivalent of a session.update client event to the Realtime API, updating the prompt / information for that live RealtimeSession.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger under what circumstances is the agents.realtime.events.RealtimeAgentEndEvent sent
Let me check the knowledge base for you again. One moment! :books:
agents.realtime.events.RealtimeAgentEndEvent is emitted by RealtimeSession when the underlying model signals that a turn has finished.
In the RealtimeSession event loop, when it receives a model event with event.type == "turn_ended":
- It clears some per‑turn guardrail state.
- Then it emits a
RealtimeAgentEndEventwith the current agent and common event info. [RealtimeSession]
elif event.type == "turn_ended":
# Clear guardrail state for next turn
self._item_transcripts.clear()
self._item_guardrail_run_counts.clear()
await self._put_event(
RealtimeAgentEndEvent(
agent=self._current_agent,
info=self._event_info,
)
)
The event class itself is documented as:
“An agent has ended.”
agent: The agent that ended.
info: Common info for all events, such as the context. [Realtime events]
So, whenever the model finishes generating for a turn (turn_ended), the session emits RealtimeAgentEndEvent for that agent.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger when I run this code:
from agents.realtime.model_inputs import RealtimeModelSendSessionUpdate
updated_settings = {
# whatever fields you want to change, e.g.:
"instructions": "New instructions here",
# "prompt": {...}, "output_modalities": [...], etc.
}
await session._model.send_event(
RealtimeModelSendSessionUpdate(session_settings=updated_settings)
)```
I get this error: Cannot update a conversation's voice if assistant audio is present. How do i make it clear that i'm only trying to update the instructions and not the voice.
Let me check the knowledge base for you again. One moment! :books:
The error is coming from the Realtime API itself: once a session has produced assistant audio, you cannot change the voice for that conversation anymore. [Session lifecycle]
What’s happening in your case is:
RealtimeModelSendSessionUpdate(session_settings=updated_settings)is being built from a fullRealtimeSessionModelSettingsobject.- That object likely still contains the original
voice(or nestedaudio.output.voice) from when the session was created. - When you send the update, the server sees a
voicefield in the payload and rejects it with: “Cannot update a conversation's voice if assistant audio is present.”
To “only update instructions,” you must ensure the session update payload does not include any voice fields.
In the agents SDK, the recommended way to configure a realtime session is via model_settings (including audio.output.voice) when you create the runner, and then use session.update_agent(...) / session._model.send_event(...) with a minimal settings dict that only has the fields you want to change. [Agent and session configuration; Model settings]
Concretely, instead of reusing a full settings object, build a narrow one:
from agents.realtime.model_inputs import RealtimeModelSendSessionUpdate
# Only include fields you actually want to change
updated_settings = {
"instructions": "New instructions here",
# do NOT include "voice" or "audio.output.voice" here
}
await session._model.send_event(
RealtimeModelSendSessionUpdate(session_settings=updated_settings)
)
The docs don’t show a full example of stripping voice, but they do state:
- Most session properties can be updated at any time, except for the
voicethe model uses for audio output, after the model has responded with audio once during the session. [Session lifecycle]
So the fix is to:
- Make sure
updated_settingsonly contains updatable fields (e.g.,instructions,prompt,tools, etc.). - Avoid passing any
voice/audio.output.voicein that dict once audio has been produced.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger If i want to update the instructions used by a RealtimeSession's agent. should i copy it's current agent, edit the instructions, then use update_agent with the newly updated agent?
Let me check the knowledge base for you again. One moment! :books:
Yes, that’s the pattern the SDK is designed for.
-
Clone the current agent with new instructions
BothAgentandRealtimeAgenthave aclone(**kwargs)helper that makes a shallow copy with selected fields changed, e.g.:new_agent = session.current_agent.clone(instructions="New instructions")This is exactly how the docs show updating instructions:
new_agent = agent.clone(instructions="New instructions"). [Agent.clone; RealtimeAgent.clone] -
Apply it to the live session with
update_agent
RealtimeSession.update_agent(agent)sets the active agent and sends asession.updateto apply its settings to the model:await session.update_agent(new_agent)Internally, this computes updated model settings from the agent and sends a
RealtimeModelSendSessionUpdateevent. [RealtimeSession.update_agent]
So: yes—clone the existing RealtimeAgent with new instructions, then call session.update_agent(new_agent) to update the instructions for that RealtimeSession.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger what's the difference between an agent's instructions and it's prompt
Let me check the knowledge base for you again. One moment! :books:
In the openai-agents-python SDK, instructions and prompt are two different ways to control an agent, with different roles and mechanisms.
instructions
-
A field directly on
Agent/RealtimeAgent:instructions: ( str | Callable[[RunContextWrapper[TContext], Agent[TContext]], MaybeAwaitable[str]] | None ) = NoneIt is:
The instructions for the agent. Will be used as the "system prompt" when this agent is invoked. Describes what the agent should do, and how it responds. [Agent instructions; RealtimeAgent instructions]
-
Can be:
- a string (static system prompt), or
- a function that dynamically returns a string based on the run context and the agent. [Agent instructions]
-
For
RealtimeAgent,get_system_promptresolvesinstructionsinto the actual system prompt string sent to the model. [RealtimeAgent]
So: instructions = the agent’s system prompt text, defined in code.
prompt
-
A separate field on
Agent/RealtimeAgent:prompt: Prompt | DynamicPromptFunction | None = NoneFor
Agent:
A prompt object (or a function that returns a Prompt). Prompts allow you to dynamically configure the instructions, tools and other config for an agent outside of your code. Only usable with OpenAI models, using the Responses API. [Agent prompt]
For RealtimeAgent:
A prompt object. Prompts allow you to dynamically configure the instructions, tools and other config for an agent outside of your code. Only usable with OpenAI models. [RealtimeAgent prompt]
Promptobjects are a higher‑level configuration mechanism:- live outside your code (e.g., managed prompt templates),
- can set instructions, tools, and other config together.
So: prompt = a structured prompt configuration object (or provider) that can define instructions and more, often managed outside code.
How they relate
instructionsis the direct system prompt string (or generator) on the agent.promptis a Prompt object that can supply instructions (and other settings) dynamically.- When both are used, the SDK can combine or override based on the
Promptdefinition; the docs don’t spell out precedence in detail, only that prompts “dynamically configure the instructions, tools and other config.” [Agent prompt]
In short:
- Use
instructionsfor straightforward, code-defined system behavior. - Use
promptwhen you want a reusable, externally managed prompt configuration that can also control tools and other settings, not just the raw instruction text.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger how do you specify the language that a realtime voice agent should speak using. also how do you specify what langauge should be assumed for the user audio transcription purposes
Let me check the knowledge base for you again. One moment! :books:
There are two separate settings for this, and they work differently:
1. Language the agent speaks (output voice language)
The knowledge sources do not document a dedicated "output language" setting on RealtimeSessionModelSettings or RealtimeAgent. The voice itself (e.g., "ash") is set via audio.output.voice, but there is no explicit language selector for the output audio shown in the provided docs. [Agent and session configuration]
The typical way to control what language the agent responds in is through the agent's instructions — e.g., "Always respond in French." — rather than a dedicated language field.
2. Language assumed for user audio transcription (input)
This is explicitly configurable. The RealtimeInputAudioTranscriptionConfig TypedDict has a language field:
class RealtimeInputAudioTranscriptionConfig(TypedDict):
language: NotRequired[str] # The language code for transcription
model: NotRequired[Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "whisper-1"] | str]
prompt: NotRequired[str] # An optional prompt to guide transcription
You set it in the runner config under audio.input.transcription:
runner = RealtimeRunner(
starting_agent=agent,
config={
"model_settings": {
"model_name": "gpt-realtime-1.5",
"audio": {
"input": {
"format": "pcm16",
"transcription": {
"model": "gpt-4o-mini-transcribe",
"language": "fr", # e.g. BCP-47 language code
},
},
"output": {
"format": "pcm16",
"voice": "ash",
},
},
}
},
)
The language value should be a language code (e.g., "fr" for French, "es" for Spanish). The docs don't specify the exact format, but BCP-47 codes are standard for Whisper-based transcription.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger how do you tune the turn taking settings? i've found that in my conversations i'm recieving many transcriptions from user audio that are single words, usually when the user is speaking slowly. how do i specify the duration of silence before the user's audio is transcribed as a distinct message in the chat history
Let me check the knowledge base for you again. One moment! :books:
The issue you're describing — audio being chunked too eagerly into short fragments — is controlled by the turn detection (VAD) settings, specifically silence_duration_ms and the choice of VAD mode.
Option 1: Increase silence_duration_ms (Server VAD)
With the default server_vad mode, the key parameter is:
silence_duration_ms: Duration of silence (in ms) to detect speech stop. Defaults to500ms. With shorter values the model responds more quickly but may jump in on short pauses. [Server VAD]
Increasing this means the model waits longer after the user stops speaking before treating it as a completed turn. You can set it via session.update:
{
"type": "session.update",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 1000, # increase from default 500ms
"create_response": True,
"interrupt_response": True,
}
}
}
Or via RealtimeModelSendSessionUpdate in the agents SDK:
from agents.realtime.model_inputs import RealtimeModelSendSessionUpdate
await session._model.send_event(
RealtimeModelSendSessionUpdate(session_settings={
"turn_detection": {
"type": "server_vad",
"silence_duration_ms": 1000,
}
})
)
Option 2: Switch to Semantic VAD
Semantic VAD is specifically designed for this problem — it uses a model to estimate whether the user has actually finished speaking, rather than just detecting silence. For example, if the user trails off with "uhhm…", it scores a low probability of turn end and waits longer. [Semantic VAD]
You can control its eagerness with the eagerness parameter:
low— waits longer (max timeout 8s), better for slow speakersmedium/auto— default (max timeout 4s)high— responds quickly (max timeout 2s)
{
"type": "session.update",
"session": {
"turn_detection": {
"type": "semantic_vad",
"eagerness": "low", # let the user take their time
"create_response": True,
"interrupt_response": True,
}
}
}
Summary: For slow speakers getting fragmented into single-word transcriptions, the most targeted fix is either:
- Increase
silence_duration_msinserver_vad, or - Switch to
semantic_vadwitheagerness: "low", which is purpose-built for this scenario.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger
"tools": [
{
"name": "get_random_fact",
"description": "Get a random fact to tell the user. \nUse this function whenever the user asks or if the conversation just needs a little spicing up.\n\nArgs: The current topic of conversation",
"params_json_schema": {
"properties": {
"current_conversation_topic": {
"title": "Current Conversation Topic",
"type": "string"
}
},
"required": [
"current_conversation_topic"
],
"title": "get_random_fact_args",
"type": "object",
"additionalProperties": false
},
"on_invoke_tool": {},
"strict_json_schema": true,
"is_enabled": true,
"tool_input_guardrails": null,
"tool_output_guardrails": null,
"needs_approval": false,
"timeout_seconds": null,
"timeout_behavior": "error_as_result",
"timeout_error_function": null,
"defer_loading": false,
"_failure_error_function": null,
"_use_default_failure_error_function": true,
"_is_agent_tool": false,
"_is_codex_tool": false,
"_agent_instance": null,
"_tool_namespace": null,
"_tool_namespace_description": null,
"_mcp_title": null,
"_tool_origin": null,
"_emit_tool_origin": true
}
how do i tweak these tool parameters to ensure that the model waits for the fun fact function to finish before it continues the conversation and tells the fun fact to the user.
This is how the function is defined
Let me check the knowledge base for you again. One moment! :books:
The tool parameters you've shown are already set up correctly for synchronous execution — the key thing here isn't the tool's JSON schema parameters, but rather how the tool function itself is defined and whether it's async.
From the function_tool source code, the SDK handles sync vs async functions differently:
- If your function is async (
async def), it isawaited directly. - If your function is sync (
def), it is run in a thread viaasyncio.to_thread(...).
In both cases, the SDK waits for the function to complete before returning the result to the model. [function_tool ref]
So the model will not continue until the tool returns — that behaviour is built in.
If the model is continuing before the tool finishes
The most likely cause is a timeout. Your tool config shows:
"timeout_seconds": null,
"timeout_behavior": "error_as_result"
timeout_seconds: null means no timeout is enforced, so the SDK will wait indefinitely. That's fine.
However, if you want to add a safety timeout (e.g., 10 seconds), you can set it when defining the tool:
@function_tool(timeout=10, timeout_behavior="error_as_result")
async def get_random_fact(current_conversation_topic: str) -> str:
...
With timeout_behavior="error_as_result", if the function times out, a model-visible error message is returned instead of raising an exception, so the model can still respond gracefully. [function_tool ref]
Summary
- The SDK already waits for the tool to finish before passing the result back to the model — no parameter changes are needed for that.
- If the model seems to be skipping the result, check whether your function is raising an exception silently (controlled by
failure_error_function) or timing out. - The
timeout_secondsandtimeout_behaviorfields in your tool config are the relevant knobs if you want to control what happens when the function takes too long. [FunctionTool ref]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger during a realtime voice agent session with tool use. what are the events that I should use to maintain a record of what was said, in sequence, as well as what tools were used, and when the tools completed, and with what return values.
Let me check the knowledge base for you again. One moment! :books:
Based on the knowledge sources, here are the key events to listen for on a RealtimeSession to maintain a full sequential record:
Conversation Content
history_added— emitted when a newRealtimeItemis appended to the session's local history (e.g., a new user message after transcription completes, or a new assistant message). [Events, history, interruptions]history_updated— emitted when an existing item in the history is updated (e.g., a message item gets its content filled in). [Events, history, interruptions]
These two events expose the session's local history as RealtimeItem objects, including user messages and assistant messages.
For user speech specifically, the underlying model event input_audio_transcription_completed (type RealtimeModelInputAudioTranscriptionCompletedEvent) carries the item_id and transcript — the session uses this to update history and then emits history_updated or history_added. [Model Events]
Tool Use
tool_start(RealtimeToolStart) — emitted when the agent begins a tool call. Contains:agent— the agent making the calltool— the tool being calledarguments— the arguments as a JSON string [Tool Events]
tool_end(RealtimeToolEnd) — emitted after the tool function completes and its output has been sent back to the model. Contains:tool— the tool that was calledoutput— the return value of the toolagent— the agent that called itarguments— the arguments that were passed [RealtimeSession]
Turn Boundaries
agent_start/agent_end— mark the beginning and end of each agent turn, useful for sequencing. [Events, history, interruptions]
Interruptions
audio_interrupted— emitted when the user interrupts the assistant mid-speech. The session also updates history at this point to reflect what was actually heard. [Events, history, interruptions]
Recommended minimal event set for your record
| What you want to track | Event(s) to listen for |
|---|---|
| User messages (speech → text) | history_added, history_updated |
| Assistant messages | history_added, history_updated |
| Tool called (name, args, timing) | tool_start |
| Tool completed (return value) | tool_end |
| Turn boundaries | agent_start, agent_end |
| Interrupted audio | audio_interrupted |
The knowledge sources don't document a single "replay log" API, but iterating the session's event stream and recording these events in order will give you a complete sequential record of the conversation and all tool activity.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger is there a tool to turn openai realtime agents events in the pythonsdk into dictionaries?
Let me check the knowledge base for you again. One moment! :books:
The knowledge sources don't document a dedicated utility function specifically for converting Realtime session events into dictionaries.
What the sources do show is that the Realtime events (e.g., RealtimeAgentStartEvent, RealtimeAudio, RealtimeAudioEnd, etc.) are Python @dataclass objects [Realtime events], and the underlying model layer uses Pydantic models (e.g., model_dump_json(exclude_unset=True) is called when sending raw messages to the WebSocket). [OpenAIRealtimeWebSocketModel]
So your options with standard Python tooling are:
-
For the session-level events (dataclasses like
RealtimeAgentStartEvent,RealtimeAudio, etc.), use Python's built-indataclasses.asdict(event):import dataclasses event_dict = dataclasses.asdict(event) -
For Pydantic model events at the model layer (e.g.,
RealtimeModelAudioEvent), use.model_dump():event_dict = event.model_dump()
The knowledge sources don't mention a dedicated SDK helper beyond these standard approaches, so I can't confirm one exists in the SDK itself.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger when are RealtimeAgentStartEvent and RealtimeAgentEndEvent emitted
Let me check the knowledge base for you again. One moment! :books:
Based on the knowledge sources:
-
RealtimeAgentStartEventis emitted byRealtimeSessionwhen it receives a model event withevent.type == "turn_started"— i.e., when the model starts generating a response for a turn. [RealtimeSession] -
RealtimeAgentEndEventis emitted when the session receives a model event withevent.type == "turn_ended"— i.e., when the model finishes generating for a turn. At this point the session also clears per-turn guardrail state. [RealtimeSession]
The underlying turn_started model event is itself triggered when the Realtime API sends a response.created server event, and turn_ended corresponds to response.done. [OpenAIRealtimeWebSocketModel]
Both events carry:
agent— the currentRealtimeAgentat the time of the eventinfo— common event info (e.g., context) [Realtime events]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger when i run RealtimeSession.send_audio() i dont get any reply, how can i check what's happening?
Let me check the knowledge base for you again. One moment! :books:
A few things to check, based on what the knowledge sources document:
1. Check for error events
Iterate the session's event stream and watch for error events:
async for event in session:
if event.type == "error":
print(f"Error: {event.error}")
2. Check your turn detection configuration
send_audio() does not automatically trigger a response in every configuration. The docs explicitly state:
Raw audio buffering does not automatically do the same in every configuration.
If server-side turn detection (server_vad or semantic_vad) is disabled, the model won't know when the user has finished speaking, so it won't generate a reply. [Inputs and outputs]
Fix option A: Enable turn detection in your runner config:
"audio": {
"input": {
"turn_detection": {
"type": "semantic_vad",
"interrupt_response": True,
}
}
}
Fix option B: If you're managing turns manually, you need to commit the audio buffer and trigger a response yourself:
# Commit the audio buffer (marks end of user turn)
await session.send_audio(audio_bytes, commit=True)
# Then manually trigger a response
from agents.realtime.model_inputs import RealtimeModelSendRawMessage
await session.model.send_event(
RealtimeModelSendRawMessage(
message={"type": "response.create"}
)
)
3. Listen for raw_model_event to see what the server is actually sending
You can listen for raw_model_event events to see the low-level server events coming back, which may reveal whether the server is receiving audio and processing it:
async for event in session:
if event.type == "raw_model_event":
print(event.data)
[Events, history, interruptions]
Most likely cause: turn detection is not configured, so the model never receives a signal to respond. Adding turn_detection to your config or manually committing + calling response.create should resolve it.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@oblique mango Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@unique dagger what events should i listen for from the realtime agents sdk during a voice conversation, to maintain a live chat transcript that updates live as the agent and user speaks
Let me check the knowledge base for you again. One moment! :books:
For a live-updating transcript, you need to handle both the user's speech and the assistant's speech separately, as they come through different events.
User Speech (Input Audio Transcription)
At the model layer, listen for:
RealtimeModelInputAudioTranscriptionCompletedEvent— fires when the user's full turn transcript is ready. Containsitem_idandtranscript. [Model Events]
For incremental/live user transcription (streaming as the user speaks), the raw Realtime API emits conversation.item.input_audio_transcription.delta events with partial text, and conversation.item.input_audio_transcription.completed for the final transcript. [Handling transcriptions]
Note: for
gpt-4o-transcribeandgpt-4o-mini-transcribe, the delta event contains incremental transcripts streamed in real time. Forwhisper-1, the delta contains the full transcript (same as completed). [Handling transcriptions]
Assistant Speech (Output Transcript)
At the model layer, listen for:
RealtimeModelTranscriptDeltaEvent— partial transcript update as the assistant speaks. Containsitem_id,delta(the new text chunk), andresponse_id. This is what you use to update the transcript live as the assistant generates audio. [Model Events]
The RealtimeSession internally accumulates these deltas per item_id into self._item_transcripts[item_id] and emits history_updated each time. [RealtimeSession]
Higher-Level: History Events
If you prefer to work at the session level rather than the model event level, use:
history_added— a new item (user message, assistant message) was appended to the session history.history_updated— an existing item was updated (e.g., transcript filled in, or assistant transcript delta accumulated).
Both expose RealtimeItem objects. [Events, history, interruptions]
Interruptions
audio_interrupted— the user interrupted the assistant. The session updates history to reflect what was actually heard, so your transcript should also truncate accordingly. [Events, history, interruptions]
Summary Table
| What to show | Event to listen for | Level |
|---|---|---|
| User transcript (live, streaming) | conversation.item.input_audio_transcription.delta (raw) |
Raw model event |
| User transcript (final) | RealtimeModelInputAudioTranscriptionCompletedEvent |
Model layer |
| Assistant transcript (live, streaming) | RealtimeModelTranscriptDeltaEvent |
Model layer |
| Any history change (simpler approach) | history_added, history_updated |
Session layer |
| Interrupted assistant speech | audio_interrupted |
Session layer |