#what is the correct implementation to get accurate word by owrd timestamps on nodejs?
1 messages · Page 1 of 1 (latest)
To get accurate word-by-word timestamps using Node.js with the ElevenLabs text-to-speech API, you can use the alignment feature. This feature provides word-level timestamps, which can be useful for aligning audio with text in applications like video synchronization.
Here's a general approach to implement this:
-
Set Up Your Environment: Ensure you have Node.js and the necessary packages installed. You can manage your environment variables using the
dotenvpackage. -
Initiate the WebSocket Connection: Connect to the ElevenLabs text-to-speech API using WebSocket. You'll need to select a voice from the voice library and a text-to-speech model.
-
Send the Input Text: Once the connection is open, configure the voice settings and send the text message to the API.
-
Receive and Process the Response: Listen for the response from the API, which will include the audio data and alignment information for word-level timestamps.
-
Use the Alignment Data: Extract the alignment data from the response to get the timestamps for each word.
Here's a simplified example of how you might set up the WebSocket connection and send text:
import * as fs from "node:fs";
import * as dotenv from "dotenv";
import * as WebSocket from "ws";
dotenv.config();
const ELEVENLABS_API_KEY = process.env.ELEVENLABS_API_KEY;
const voice_id = 'your_voice_id';
const model_id = 'your_model_id';
const uri = `wss://api.elevenlabs.io/v1/text-to-speech/${voice_id}/stream-input?model_id=${model_id}`;
const ws = new WebSocket(uri);
ws.on('open', function open() {
const message = {
text: "Your text here",
voice_settings: { stability: 0.5, similarity_boost: 0.8, use_speaker_boost: false },
generation_config: { chunk_length_schedule: [120, 160, 250, 290] },
xi_api_key: ELEVENLABS_API_KEY,
alignment: true // Request alignment data
};
ws.send(JSON.stringify(message));
});
ws.on('message', function incoming(data) {
const response = JSON.parse(data);
if (response.alignment) {
console.log("Word-level timestamps:", response.alignment);
}
if (response.audio) {
// Process audio data
}
});
Make sure to replace 'your_voice_id' and 'your_model_id' with the appropriate IDs from your setup. This example demonstrates how to request alignment data for word-level timestamps.
what is the difference between: Text To Speech With Timestamps response, normalizedAlignment of websocket and aligment of websocket