#How to create model that keeps the unique "nuances" of the original voice?

1 messages · Page 1 of 1 (latest)

empty thistle
#

Hello,
I am trying to create a model of a character from the game Mordhau. I want the AI model to keep the intonation/accent that makes the characters voice unique.

I have attached three audio files to demonstrate:
1: A sample from the original audio that I trained on, which totaled about 5 min.
2: A sample from using so-vits-fork to generate a line using auto pitch Dio.
3: A sample from using RVC with a 700 epochs Dio V2 model. I've also trained a 250 epoch Harvest V2 model. They sound very similar / have the same issue of sounding fairly lifeless.

I recently switched to RVC instead of so-vits-fork, hoping that it would give better results. But I was disappointed to find it sounding lifeless.

I've tried most settings under "Model inference" such as the different pitch extraction algorithms, except the feature index stuff which I don't understand. I just chose the corresponding one in the dropdown.

Does anyone have any tips? I must be missing something in RVC?

empty sparrow
#

1 - The sample audio you used to train your model is too short.

#

2 - You could have obtained better results by using OV2 pretrain with RMVPE algorithm

#

3 - Dio is pretty outdated, and probably you didn't train the .index file

haughty seal
#

Index is exactly what gives the model an accent.

Also, you could use better feature extraction methods for training.

#

And, as far as I can tell, the RVC one feels pitched down

#

Maybe if you pitched up that specific sample higher...

empty thistle
# empty sparrow 2 - You could have obtained better results by using OV2 pretrain with RMVPE algo...
  1. Unfortunately there are only 5 min of voice lines
  2. Clearly I am just a noob. I just followed some youtube guide, so I assume I am using original/mainline RVC, is that why I can't see the options for training with RMVPE etc? Is that possible to add without needing to install a fork? And where do I get the OV2 pretrain?
  3. I do have an index file from using the one-click training (which is apparently bad?) "added_IVF370_Flat_nprobe_1_FoppishDioV2_v2.index" is the one I am using.
empty thistle
topaz falconBOT
#

Ayo? @empty thistle level 1 !!! lfg

celest onyx
#

-colab

topaz roverBOT
# celest onyx -colab
☁️ Google Colabs
  • AICoverGen-WebUI, modded by Hina Google Colab
  • AICoverGen-NoWebUI [English], by Ardha, fixed by Eddy, Hina and Gdr Google Colab
  • AICoverGen-NoWebUI [Spanish], credits to Eddy, Hina and Gdr for translating and fixing Google Colab
  • RVC Disconnected, by Kit Lemonfoot Google Colab
  • easyGUI, by rejects Google Colab
🤗 Hugginface Spaces
empty thistle
celest onyx
#

I am not a model maker but as far as I know easy gui is not for training .you should use other collabs with a tensor boardboard

empty thistle
#

It did have a tensorboard, I followed instructions on how to avoid overtraining, it made no difference.
Here are the latest ones. Just can't get it to sound better... no matter what I do it just turns out the same.

What annoys me to no end is that I know it has potential to be better, because it was "better" / different on my old so-vits-fork for some unknowable reason...

uneven marsh
# empty thistle Hello, I am trying to create a model of a character from the game Mordhau. I wan...

Just to clarify tho about accent, RVC can't be guaranteed to have a 1:1 ratio result from your dataset and inference, since the audio files are being average out. To make it sound natural like the same with your dataset, use an audio that also sounded the same accent with your dataset. In other words, for example if your dataset is talking in japanese, then oviously they will sound bad or doesn't get the correct accent when speaking in english, so you use also with japanese voices for inference. That's the whole idea about RVC.

empty thistle
# uneven marsh Just to clarify tho about accent, RVC can't be guaranteed to have a 1:1 ratio re...

Thanks for the reply. I understand this limitation, in this case I used a british TTS from elevenlabs. But what I don't understand is why the one I generated on so-vits-fork sounds so different / more lively, when compared to the ones I generated in RVC, inferred on the exact same audio sample. I thought RVC would be better, even after using the recommended OV2 pretrain with RMVPE, and properly selecting the index file, it barely made a difference. Harvest, Dio, RVMPE, OV2 or not, nothing seems to matter

#

Maybe someone could try generating it for me just to see if it sounds better, or at least different, from what I have accomplished? Just so I can have the peace of mind that it isn't me doing something wrong

haughty seal
#

like 2 or 3

empty thistle
haughty seal