what is the correct implementation to get accurate word by owrd timestamps on nodejs? | ElevenLabs | Page 1

AI-generated Answer for fernandops26

To get accurate word-by-word timestamps using Node.js with the ElevenLabs text-to-speech API, you can use the alignment feature. This feature provides word-level timestamps, which can be useful for aligning audio with text in applications like video synchronization.

Here's a general approach to implement this:

Set Up Your Environment: Ensure you have Node.js and the necessary packages installed. You can manage your environment variables using the dotenv package.
Initiate the WebSocket Connection: Connect to the ElevenLabs text-to-speech API using WebSocket. You'll need to select a voice from the voice library and a text-to-speech model.
Send the Input Text: Once the connection is open, configure the voice settings and send the text message to the API.
Receive and Process the Response: Listen for the response from the API, which will include the audio data and alignment information for word-level timestamps.
Use the Alignment Data: Extract the alignment data from the response to get the timestamps for each word.

Here's a simplified example of how you might set up the WebSocket connection and send text:

import * as fs from "node:fs";
import * as dotenv from "dotenv";
import * as WebSocket from "ws";

dotenv.config();

const ELEVENLABS_API_KEY = process.env.ELEVENLABS_API_KEY;
const voice_id = 'your_voice_id';
const model_id = 'your_model_id';

const uri = `wss://api.elevenlabs.io/v1/text-to-speech/${voice_id}/stream-input?model_id=${model_id}`;

const ws = new WebSocket(uri);

ws.on('open', function open() {
  const message = {
    text: "Your text here",
    voice_settings: { stability: 0.5, similarity_boost: 0.8, use_speaker_boost: false },
    generation_config: { chunk_length_schedule: [120, 160, 250, 290] },
    xi_api_key: ELEVENLABS_API_KEY,
    alignment: true // Request alignment data
  };
  ws.send(JSON.stringify(message));
});

ws.on('message', function incoming(data) {
  const response = JSON.parse(data);
  if (response.alignment) {
    console.log("Word-level timestamps:", response.alignment);
  }
  if (response.audio) {
    // Process audio data
  }
});

Make sure to replace 'your_voice_id' and 'your_model_id' with the appropriate IDs from your setup. This example demonstrates how to request alignment data for word-level timestamps.

#what is the correct implementation to get accurate word by owrd timestamps on nodejs?