#What is the optimal 'amount and variation of voice' to use as input for the model to clone your voi
1 messages · Page 1 of 1 (latest)
For Professional Voice Cloning, the optimal amount of audio is closer to 3 hours, but at least 30 minutes is required. Here are some key points to consider:
-
Consistency in Style: It's best to use samples that are consistent in style. If you want the clone to be used for online teaching modules, the training data should reflect the speaking style you intend to use in those modules. Mixing different styles in the training data can lead to a less accurate clone.
-
Quality of Audio: Ensure the audio is high-quality, free of background noise, and features only one speaker. The AI will replicate everything it hears, including any artifacts or unwanted sounds.
-
Variation in Speech: While some variation in speech styles can be beneficial, it should be within the same general context. For example, different tones of teaching (e.g., explaining concepts, asking questions, summarizing) can be included, but avoid mixing completely different styles like casual conversation or dramatic readings.
-
Segmenting Audio: If you plan to upload multiple hours of audio, it's better to split it into multiple ~30-minute samples. This makes it easier to upload and manage.
In summary, for the best results, provide a substantial amount of high-quality, consistent audio that reflects the intended use case. If you need different styles, consider creating separate clones for each style to maintain clarity and accuracy.
would it be okay to add some more 'enthouisiastic' speech in the mix with on the other hand slightly more calm speech?