Hello,
I am looking into making my own voice model, however I am unsure what constitutes a good dataset for it.
How long should the dataset be?
Should the dataset be ONLY the voice of the person (no background music/noise)?
Should the dataset include coughs, laughs and screams? (or is it even better to include it?)
Would dataset speaking foreign language be able to be used with pretrains and subsequently with English while using it with w-okada? (Say example: German or Polish)
Thanks for the answers!