#How to get a transcription of user audio during realtime audio conversation
1 messages · Page 1 of 1 (latest)
Hi @tame trench. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
To achieve real-time transcription of user audio during a conversation using the Realtime API, you can follow these steps:
-
Session Configuration: Ensure that your session is configured to handle both audio and text modalities. This is crucial for enabling transcription capabilities.
-
WebRTC Setup: If you are using WebRTC for audio communication, the Realtime API acts as a peer connection to your client. You can set up a peer connection to handle audio input and output. Here's a basic example of configuring both local and remote audio:
// Create a peer connection const pc = new RTCPeerConnection(); // Set up to play remote audio from the model const audioEl = document.createElement("audio"); audioEl.autoplay = true; pc.ontrack = e => audioEl.srcObject = e.streams[0]; // Add local audio track for microphone input in the browser const ms = await navigator.mediaDevices.getUserMedia({ audio: true }); pc.addTrack(ms.getTracks()[0]); -
Session Update: After the WebRTC data channel opens, send a
session.updatemessage to ensure that the transcription model is correctly enabled during the session. This step is crucial for activating the transcription feature.{ "event_id": "event_123", "type": "session.update", "session": { "modalities": ["text", "audio"], "input_audio_transcription": { "model": "whisper-1" } } }
-
Listening for Transcription Events: During the session, you will receive server-sent events such as
response.audio_transcript.deltafor in-progress audio transcripts. These events provide real-time updates of the transcription process.
By following these steps, you should be able to achieve near-real-time transcription of user audio during a conversation using the Realtime API. For more detailed guidance, you can refer to the Realtime model capabilities documentation.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@tame trench Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@sick mirage is there a way to get the raw audio of the user that the model is responding to?
Let me check the knowledge base for you again. One moment! :books:
To obtain the raw audio of the user that the model is responding to, you can use the Realtime API's capabilities to handle audio inputs. Here’s how you can achieve this:
-
Streaming Audio Input: You can stream audio input to the server using the
input_audio_buffer.appendclient event. This involves sending chunks of Base64-encoded audio bytes to the Realtime API over a WebSocket. Each chunk should not exceed 15 MB in size.Here is an example of how you can append audio input bytes to the conversation:
import fs from 'fs'; import decodeAudio from 'audio-decode'; // Converts Float32Array of audio data to PCM16 ArrayBuffer function floatTo16BitPCM(float32Array) { const buffer = new ArrayBuffer(float32Array.length * 2); const view = new DataView(buffer); let offset = 0; for (let i = 0; i < float32Array.length; i++, offset += 2) { let s = Math.max(-1, Math.min(1, float32Array[i])); view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true); } return buffer; } // Converts a Float32Array to base64-encoded PCM16 data function base64EncodeAudio(float32Array) { const arrayBuffer = floatTo16BitPCM(float32Array); let binary = ''; let bytes = new Uint8Array(arrayBuffer); const chunkSize = 0x8000; // 32KB chunk size for (let i = 0; i < bytes.length; i += chunkSize) { let chunk = bytes.subarray(i, i + chunkSize); binary += String.fromCharCode.apply(null, chunk); } return btoa(binary); } // Fills the audio buffer with the contents of three files, // then asks the model to generate a response. const files = [ './path/to/sample1.wav', './path/to/sample2.wav', './path/to/sample3.wav' ];
for (const filename of files) {
const audioFile = fs.readFileSync(filename);
const audioBuffer = await decodeAudio(audioFile);
const channelData = audioBuffer.getChannelData(0);
const base64Chunk = base64EncodeAudio(channelData);
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64Chunk
}));
}
ws.send(JSON.stringify({type: 'input_audio_buffer.commit'}));
ws.send(JSON.stringify({type: 'response.create'}));
[Source](<https://platform.openai.com/docs/guides/realtime-model-capabilities#streaming-audio-input-to-the-server>)
2. **Handling Audio with WebRTC**: If you are using WebRTC, you can capture the user's audio input using the `getUserMedia` API and add it as a track to the peer connection. This allows you to access the raw audio stream directly from the user's microphone.
```javascript
const ms = await navigator.mediaDevices.getUserMedia({
audio: true
});
pc.addTrack(ms.getTracks()[0]);