#Hi. I'm trying to use the elevenlabs streaming endpoint to stream text to speech audio in typescrip
1 messages · Page 1 of 1 (latest)
Sure, I can guide you through the process of using the ElevenLabs streaming endpoint to stream text-to-speech audio in TypeScript. Here's a step-by-step guide:
Step 1: Create accounts with Twilio and ngrok
You'll need to create accounts with Twilio and ngrok for this guide. You can create these accounts at the following links:
Step 2: Get the code
You can get the entire code for this guide on GitHub.
Step 3: Create the server with Express
You'll need to initialize your project, install dependencies, install dev dependencies, and create your files. Here's how you can do it:
Initialize your project:
mkdir elevenlabs-twilio
cd elevenlabs-twilio
npm init -y
Install dependencies:
npm install elevenlabs express express-ws twilio
Install dev dependencies:
npm i @types/node @types/express @types/express-ws @types/ws dotenv tsx typescript
Create your files:
// src/app.ts
import 'dotenv/config';
import express, { Response } from 'express';
import ExpressWs from 'express-ws';
import VoiceResponse from 'twilio/lib/twiml/VoiceResponse';
import { ElevenLabsClient } from 'elevenlabs';
import { type WebSocket } from 'ws';
import { type Readable } from 'stream';
const app = ExpressWs(express()).app;
const PORT: number = parseInt(process.env.PORT || '5000');
const elevenlabs = new ElevenLabsClient();
const voiceId = '21m00Tcm4TlvDq8ikWAM';
const outputFormat = 'ulaw_8000';
const text = 'This is a test. You can now hang up. Thank you.';
function startApp() {
app.post('/call/incoming', (_, res: Response) => {
const twiml = new VoiceResponse();
twiml.connect().stream({
url: `wss://${process.env.SERVER_DOMAIN}/call/connection`,
});
res.writeHead(200, { 'Content-Type': 'text/xml' });
res.end(twiml.toString());
});
app.ws('/call/connection', (ws: WebSocket) => {
ws.on('message', async (data: string) => {
const message: {
event: string;
start?: { streamSid: string; callSid: string };
} = JSON.parse(data);
if (message.event === 'start' && message.start) {
const streamSid = message.start.streamSid;
const response = await elevenlabs.textToSpeech.convert(voiceId, {
model_id: 'eleven_turbo_v2_5',
output_format: outputFormat,
text,
});
const audioArrayBuffer = await streamToArrayBuffer(response);
ws.send(
JSON.stringify({
streamSid,
event: 'media',
media: {
payload: Buffer.from(audioArrayBuffer as any).toString('base64'),
},
}),
);
}
});
ws.on('error', console.error);
});
app.listen(PORT, () => {
console.log(`Local: http://localhost:${PORT}`);
console.log(`Remote: https://${process.env.SERVER_DOMAIN}`);
});
}
function streamToArrayBuffer(readableStream: Readable) {
return new Promise((resolve, reject) => {
const chunks: Buffer[] = [];
readableStream.on('data', (chunk) => {
chunks.push(chunk);
});
readableStream.on('end', () => {
resolve(Buffer.concat(chunks).buffer);
});
readableStream.on('error', reject);
});
}
startApp();
Step 4: Update your environment variables
Update the .env file with your ngrok domain and ElevenLabs API key.
# .env
SERVER_DOMAIN=*******.ngrok.app
ELEVENLABS_API_KEY=*************************
Step 5: Start the application
Run the following command to start the app:
npm run dev
Step 6: Set up Twilio
Follow Twilio’s guides to create a new number. Once you’ve created your number, navigate to the “Configure” tab in Phone Numbers -> Manage -> Active numbers. In the “A call comes in” section, enter the full URL to your application (make sure to add the/call/incoming path).
Step 7: Make a phone call
Make a call to your
I've already made one like this:
const voiceID = process.env.REACT_APP_ELEVENLABS_VOICE_ID;
const apiKey = process.env.REACT_APP_ELEVENLABS_API_KEY;
if (!voiceID || !apiKey) {
console.error('Voice ID or API key is missing.');
return;
}
const options = {
method: 'POST',
headers: {
'xi-api-key': apiKey,
'Content-Type': 'application/json'
},
body: {"model_id":"eleven_multilingual_v2","text":"${result}","voice_settings":{"stability":0.95,"similarity_boost":1,"style":0.06}}
};
const response = await fetch(https://api.elevenlabs.io/v1/text-to-speech/${voiceID}/stream, options);
But the audio would stutter and peak. What's the problem here?
The stuttering issue you're experiencing might be related to the specific model you're using. According to a discussion on the ElevenLabs GitHub page, users have reported stuttering issues when using the "eleven_multilingual_v2" model with the streaming method.
One user, JonathanChavezTamales, pointed out that the stuttering might be due to the fact that the text is tokenized and each word has an API call, causing the data not to come as fast as the clips are played. He suggested a workaround to add a buffer to the playback, but this could introduce a significant delay at the beginning.
Another user, RicardoEscobar, suggested mitigating stuttering by using the stream() function and not trying to playback or save the chunks individually. He also mentioned a new issue regarding the audio volume degrading on long-lasting audio stream responses from the API.
Finally, a user named abiel-lozano suggested switching to the eleven_multilingual_v1 model as a workaround to generate a stable audio stream in other languages.
Please note that these are user-generated solutions and might not work in all cases. It's recommended to check with the official ElevenLabs documentation or support for the most accurate information.
The problem with eleven_multilingual_v1 is that it's not pronouncing the inputs properly if used for streaming, since I want the model to speak in Indonesian
The AI language is specified by the text you input, and the accent is specified by the voice. If you're writing in a language that might share a lot of common words or is fairly closely related to another language, the AI might have a hard time understanding how to pronounce certain words or switch between accents.
If you are using a voice that is not native to the language - for example, one of the pre-made or generated voices since they are English - the AI might have a slight English accent when speaking other languages.
To get the optimal results, it's recommended to clone a voice that speaks the original language with the correct accent. This is especially important when dealing with languages that are very similar and share a lot of common words. This ensures that the AI has the most information to understand which pronunciation and language it should choose.
The Turbo v2.5 model has an experimental feature, available via API only, which allows you to specify the language code for your generation. As this is an experimental feature, results may vary. This feature is not available for other models, and providing a language code in API calls using other models will return an error.
Please see the API Reference for full details.
The problem with eleven_multilingual_v1 is that it's not pronouncing the inputs properly if used for streaming, since I want the model to speak in Indonesian