#How to get a transcription of user audio during realtime audio conversation

1 messages · Page 1 of 1 (latest)

tame trench
#

I'm using the realtime audio API and would like to get text transcripts of the user's speech during the conversation.

Ideally, these transcriptions of the user's audio would happen in close to real time, similar to how the 'assistant' transcriptions stream in via response.text.delta events.

Is there a way to achieve this?

sick mirageBOT
#

Hi @tame trench. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

#

To achieve real-time transcription of user audio during a conversation using the Realtime API, you can follow these steps:

  1. Session Configuration: Ensure that your session is configured to handle both audio and text modalities. This is crucial for enabling transcription capabilities.

  2. WebRTC Setup: If you are using WebRTC for audio communication, the Realtime API acts as a peer connection to your client. You can set up a peer connection to handle audio input and output. Here's a basic example of configuring both local and remote audio:

    // Create a peer connection
    const pc = new RTCPeerConnection();
    
    // Set up to play remote audio from the model
    const audioEl = document.createElement("audio");
    audioEl.autoplay = true;
    pc.ontrack = e => audioEl.srcObject = e.streams[0];
    
    // Add local audio track for microphone input in the browser
    const ms = await navigator.mediaDevices.getUserMedia({
      audio: true
    });
    pc.addTrack(ms.getTracks()[0]);
    

    Source

  3. Session Update: After the WebRTC data channel opens, send a session.update message to ensure that the transcription model is correctly enabled during the session. This step is crucial for activating the transcription feature.

    {
      "event_id": "event_123",
      "type": "session.update",
      "session": {
        "modalities": ["text", "audio"],
        "input_audio_transcription": {
          "model": "whisper-1"
        }
      }
    }
    

    Source

#
  1. Listening for Transcription Events: During the session, you will receive server-sent events such as response.audio_transcript.delta for in-progress audio transcripts. These events provide real-time updates of the transcription process.

    Source

By following these steps, you should be able to achieve near-real-time transcription of user audio during a conversation using the Realtime API. For more detailed guidance, you can refer to the Realtime model capabilities documentation.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

#

@tame trench Kapa.ai is still learning and improving, please let me know how I did by reacting below.

tame trench
#

@sick mirage is there a way to get the raw audio of the user that the model is responding to?

sick mirageBOT
#

Let me check the knowledge base for you again. One moment! :books:

#

To obtain the raw audio of the user that the model is responding to, you can use the Realtime API's capabilities to handle audio inputs. Here’s how you can achieve this:

  1. Streaming Audio Input: You can stream audio input to the server using the input_audio_buffer.append client event. This involves sending chunks of Base64-encoded audio bytes to the Realtime API over a WebSocket. Each chunk should not exceed 15 MB in size.

    Here is an example of how you can append audio input bytes to the conversation:

    import fs from 'fs';
    import decodeAudio from 'audio-decode';
    
    // Converts Float32Array of audio data to PCM16 ArrayBuffer
    function floatTo16BitPCM(float32Array) {
      const buffer = new ArrayBuffer(float32Array.length * 2);
      const view = new DataView(buffer);
      let offset = 0;
      for (let i = 0; i < float32Array.length; i++, offset += 2) {
        let s = Math.max(-1, Math.min(1, float32Array[i]));
        view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
      }
      return buffer;
    }
    
    // Converts a Float32Array to base64-encoded PCM16 data
    function base64EncodeAudio(float32Array) {
      const arrayBuffer = floatTo16BitPCM(float32Array);
      let binary = '';
      let bytes = new Uint8Array(arrayBuffer);
      const chunkSize = 0x8000; // 32KB chunk size
      for (let i = 0; i < bytes.length; i += chunkSize) {
        let chunk = bytes.subarray(i, i + chunkSize);
        binary += String.fromCharCode.apply(null, chunk);
      }
      return btoa(binary);
    }
    
    // Fills the audio buffer with the contents of three files,
    // then asks the model to generate a response.
    const files = [
      './path/to/sample1.wav',
      './path/to/sample2.wav',
      './path/to/sample3.wav'
    ];
#

for (const filename of files) {
const audioFile = fs.readFileSync(filename);
const audioBuffer = await decodeAudio(audioFile);
const channelData = audioBuffer.getChannelData(0);
const base64Chunk = base64EncodeAudio(channelData);
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64Chunk
}));
}

ws.send(JSON.stringify({type: 'input_audio_buffer.commit'}));
ws.send(JSON.stringify({type: 'response.create'}));


[Source](<https://platform.openai.com/docs/guides/realtime-model-capabilities#streaming-audio-input-to-the-server>)

2. **Handling Audio with WebRTC**: If you are using WebRTC, you can capture the user's audio input using the `getUserMedia` API and add it as a track to the peer connection. This allows you to access the raw audio stream directly from the user's microphone.

```javascript
const ms = await navigator.mediaDevices.getUserMedia({
  audio: true
});
pc.addTrack(ms.getTracks()[0]);

Source

By using these methods, you can access the raw audio data of the user that the model is responding to. For more detailed information, you can refer to the Realtime model capabilities documentation.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon: