I have trained a model myself with datasets that recorded in a pro-level rec studio(48K 24bits).
the dataset length was around 12 minutes(4 songs). but the vocalist didn't sing above D3 note and when inferencing, It has voice cracking on F#3 pitch.
But found out that some models on AI hub like ariana grande model, even hit B3 note or more without sound cracking, although it sounds less alike the Ariana
I'm sure the dataset of the Ariana Grande's model wouldn't contain that much high notes.
What would make this difference?
It doesn't matter if it would sound just like the model that I trained, but I want my model to be able to hit that high notes sometimes.
I've also tried to zero the index rate but no luck.
Any suggestions?
