#hey man.1 hour and 15 minutes is way to much.normally 5 mins are enough for an instant voice cl
1 messages · Page 1 of 1 (latest)
While it's great to hear that you've had success with shorter samples for Instant Voice Cloning, the information provided here is specifically about Professional Voice Cloning (PVC). PVC differs from Instant Voice Cloning in that it allows you to train a hyper-realistic model of a voice by training a dedicated model on a large set of voice data. This results in a model that's indistinguishable from the original voice.
For PVC, the bare minimum recommended is 30 minutes of audio, but for the optimal result and the most accurate clone, it's recommended to have closer to 3 hours of audio. This is because the AI will create a near-perfect clone of what it hears, including all the intricacies and characteristics of that voice, but also including any artifacts and unwanted audio present in the samples.
You're absolutely correct about the importance of clean, natural speech with no background noises, and the removal of filler sounds like "uh" and "um". The AI will try to replicate all of these elements in the clone as well.
Also, the speaking style in the samples you provide will be replicated in the output. So, if you're looking for a natural speaking style, the training data should correspond to that style.
Lastly, it's best to use samples where you are speaking the language that the PVC will mainly be used for. If the voice itself is not native to the language you want the AI to speak, it might have an accent from the original language and might mispronounce words and inflections.
I hope this clarifies the difference between Instant Voice Cloning and Professional Voice Cloning!
hope that helps