Does the python sdk support Text To Speech With Timestamps | ElevenLabs | Page 1

blazing wrenBOT Sep 5, 2024, 10:54 AM

#

AI-generated Answer for tractorjuice

Yes, the Python SDK supports Text To Speech with Timestamps. You can use the stream_with_timestamps method to convert text into audio along with timestamps indicating when each word was spoken. Here is an example of how to use it:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(
    api_key="YOUR_API_KEY",
)

client.text_to_speech.stream_with_timestamps(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    text="Born and raised in the charming south, I can add a touch of sweet southern hospitality to your audiobooks and podcasts",
)

Make sure to replace "YOUR_API_KEY" with your actual API key. This will generate audio with timestamps for each word spoken.

dawn geode Sep 5, 2024, 10:55 AM

#

Can I use this with client.generate from the python sdk

blazing wrenBOT Sep 5, 2024, 10:56 AM

#

AI-generated Answer for tractorjuice

Yes, you can use the Python SDK to generate text-to-speech with timestamps. Here is an example of how to do it:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(
    api_key="YOUR_API_KEY",
)

response = client.text_to_speech.convert_with_timstamps(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    text="Born and raised in the charming south, I can add a touch of sweet southern hospitality to your audiobooks and podcasts",
    model_id="eleven_multilingual_v2",
    voice_settings={
        "stability": 0.5,
        "similarity_boost": 0.75
    }
)

# The response will contain the audio and the alignment information
audio_base64 = response["audio_base64"]
alignment = response["alignment"]

# Decode the audio and save it to a file
import base64
audio_bytes = base64.b64decode(audio_base64)
with open('output.mp3', 'wb') as f:
    f.write(audio_bytes)

# Print the alignment information
print(alignment)

This script will generate the audio along with timestamps for each character in the text. The alignment dictionary will contain the characters and their start and end times in seconds.

dawn geode Sep 5, 2024, 10:57 AM

#

I am currently using client.generate, what do i need to change in my code

blazing wrenBOT Sep 5, 2024, 10:57 AM

#

AI-generated Answer for tractorjuice

To use Text to Speech with Timestamps in the Python SDK, you need to switch from client.generate client.text_to_speech.convert_with_timestamps. Here is an example of how you can modify your code:

Original Code

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")
response = client.generate(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    text="Your text here"
)

Updated Code

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")
response = client.text_to_speech.convert_with_timestamps(
    voice_id="21m00Tcm4TlvDq8ikWAM",
    text="Your text here"
)

# Process the response to get audio and timestamps
audio_bytes = base64.b64decode(response["audio_base64"])
with open('output.mp3', 'wb') as f:
    f.write(audio_bytes)

print(response['alignment'])

Key Changes:

Method Change: Replace client.generate with client.text_to_speech.convert_with_timestamps.
Response Handling: Decode the audio_base64 from the response and save it as an audio file. Print the alignment to get the timestamps.

Make sure to replace "YOUR_API_KEY" and "21m00Tcm4TlvDq8ikWAM" with your actual API key and voice ID.

dawn geode Sep 5, 2024, 11:09 AM

#

Does this automatically provide the previous and next text?

blazing wrenBOT Sep 5, 2024, 11:09 AM

#

AI-generated Answer for tractorjuice

To get timestamps for Text-to-Speech using the Python SDK, you need to use the Timestamps API endpoint. Here's a basic example of how you can modify your code to include timestamps:

import requests

url = "https://api.elevenlabs.io/v1/text-to-speech-with-timestamps"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
}
data = {
    "text": "Your text here",
    "voice": "en_us_male"
}

response = requests.post(url, headers=headers, json=data)
timestamps = response.json()
print(timestamps)

This will provide you with the timestamps for each character in the text.

Regarding the context of previous and next text, the model assesses the wider situation surrounding each utterance to generate appropriate emotions. However, you need to provide the context manually in your input text. The model does not automatically include previous and next text.

dawn geode Sep 5, 2024, 11:10 AM

#

Did the original client.generate provide the previous and next texts?

blazing wrenBOT Sep 5, 2024, 11:10 AM

#

AI-generated Answer for tractorjuice

The original client.generate method does not automatically provide the previous and next texts. You need to manually include the previous_text, next_text, and previous_request_ids parameters in your request to achieve conditioning on both text and past generations.

Here's an example of how you can modify your code to include these parameters:

import os
import requests
from pydub import AudioSegment
import io

YOUR_XI_API_KEY = "<insert your xi-api-key here>"
VOICE_ID = "21m00Tcm4TlvDq8ikWAM"  # Rachel

PARAGRAPHS = [
    "The advent of technology has transformed countless sectors, with education standing out as one of the most significantly impacted fields.",
    "In recent years, educational technology, or EdTech, has revolutionized the way teachers deliver instruction and students absorb information.",
    "From interactive whiteboards to individual tablets loaded with educational software, technology has opened up new avenues for learning that were previously unimaginable.",
    "One of the primary benefits of technology in education is the accessibility it provides.",
]

segments = []
previous_request_ids = []

for i, paragraph in enumerate(PARAGRAPHS):
    is_first_paragraph = i == 0
    is_last_paragraph = i == len(PARAGRAPHS) - 1
    response = requests.post(
        f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
        json={
            "text": paragraph,
            "model_id": "eleven_multilingual_v2",
            "previous_request_ids": previous_request_ids[-3:],
            "previous_text": None if is_first_paragraph else " ".join(PARAGRAPHS[:i]),
            "next_text": None if is_last_paragraph else " ".join(PARAGRAPHS[i + 1:])
        },
        headers={"xi-api-key": YOUR_XI_API_KEY},
    )

    if response.status_code != 200:
        print(f"Error encountered, status: {response.status_code}, content: {response.text}")
        quit()

    print(f"Successfully converted paragraph {i + 1}/{len(PARAGRAPHS)}")
    previous_request_ids.append(response.headers["request-id"])
    segments.append(AudioSegment.from_mp3(io.BytesIO(response.content)))

segment = segments[0]
for new_segment in segments[1:]:
    segment = segment + new_segment

audio_out_path = os.path.join(os.getcwd(), "with_full_conditioning.wav")
segment.export(audio_out_path, format="wav")
print(f"Success! Wrote audio to {audio_out_path}")

This code snippet demonstrates how to condition on both text and past generations by including the necessary parameters in your request.

dawn geode Sep 5, 2024, 11:27 AM

#

Does providing past generations increase costs of the api usage?

blazing wrenBOT Sep 5, 2024, 11:27 AM

#

AI-generated Answer for tractorjuice

The API is charged based on the number of characters generated. Providing past generations does not inherently increase costs unless it results in generating more characters.

dawn geode Sep 5, 2024, 11:49 AM

#

Can you provide me with an example using 'with timestamps'

#Does the python sdk support Text To Speech With Timestamps

Original Code

Updated Code

Key Changes: