#RVC - Help with Unique Phonemes - Strong Rolling "R"

1 messages · Page 1 of 1 (latest)

barren glade
#

Hello, First post here.
I am trying to create a new RVC voice model to use in a language with a few unique phonemes. The models I have made seem to handle most sounds but I am not able to get the models to represent a strong rolling "R" sound.

The conversion kind of just skips that sound. To do this I have tried using a few pretrains like KLM49, Rigel, and OV2Super. I have used about 30 minutes - 1 hour of speaker data from multiple speakers from related languages to train. (Note that I want the model to sound unique and not like any particular voice from the dataset I used for training)

I have a few questions:

  1. Is there an existing pretrain or voice model that can reproduce this sound?
  2. Does training with special sounds help the model to represent those phonemes or does that kind of data need to be in a pretrain? (Essentially, is the training capturing phonetic variations or mostly just the timbre of the voice)
  3. Any suggestions on what I should try?
  4. I have A LOT of data that spans hundreds of languages and I would love to create a single pretrain that can handle phonemes from any language. Is this within the realm of possibility or do models just fall back towards a generalization that can't capture unique qualities of language outliers?
barren glade
#

Another question - Is it ok for the same audio file to be in your training data and in your pertaining data? What happens if you do this?

warm cobalt
#

the language of the pretrain doesn't matter, it's the embedder who handles phonemes

#

cvec and spin were trained only using english, so they only know english phonemes

#

the pretrain merely serves as a base for frequency reconstruction, people clamining certain pretrains made their model pronunciation better is pure placebo

warm cobalt
versed hedge
#

sorry mb, I'd agree that it's mostly the embedder

warm cobalt
#

finetuning contentvec with for example, spanish, would heavily improve spanish pronunciation
but doing that is a impossible task since no one has ever managed to finetune cvec

#

finetuning spin is possible but requires a big dataset and very strong gpus (5090)

versed hedge
#

I have tried it with a japanese model with the usual cvec, and also it still can't have strong "R"s

warm cobalt
#

its pure luck really, some models can pronounce words good while others cannot

#

but thats due to the embedder rng

#

not the pretrain

versed hedge
warm cobalt
#

the only thing a better pretrain can do for your finetune/small model is to allow better and faster convergence, meaning you will have to train less epochs

#

but og already converges very fast

#

the model is mostly done around 40-80e with the og pretrain

warm cobalt
# barren glade Hello, First post here. I am trying to create a new RVC voice model to use in a ...

tldr of my comments;
the reason why you're having this problem is because the embedders of rvc were trained only in english, so they're not perfect when it comes to extract the phonemes of non english languages

the pretrain previously learned phonemes are forgotten by the model in the first epoch, they're replaced with your dataset phonemes

you would need to finetune the embedder, contentvec is impossible (even the original rvc devs tried this but failed), spin is possible but i dont know how to do it, ive heard it requires a dataset of over 100 hours and a lot of vram tho

barren glade
#

Thanks for all of the information! Ok so it sound like from what you are saying the dataset is irrelevant for phonemes that the embedder hasn't encountered.

I would be interested in learning how hard it is to finetune the spin embedder. I have over 1 hour of data for around 2000 languages.

Does the number of times a phoneme appears in the dataset for the embedder finetuning affect how likely it is to reproduce it? Strong rolling R's are used infrequently across languages so the phoneme will be in the dataset but it wont occur very often.

How hard would it be to include the xeus embedder and what is the primary difference? If it gives me the result I'm looking for than I could maybe offload to using some cloud compute for a more powerful GPU.

#

If there is anyone or any resources you could point me to that would be much appreciated!

versed hedge
warm cobalt
#

its the vonovox dev, dr87

#

i dont think it would need too much data for the model to learn how to reproduce that sound, for example when you train from 0 in rvc, you only need a couple of hours of singing for the model to be able to sing

barren glade
#

I'm not sure how many hours it is exactly and I'll have some preprocessing work to get it organized and in the right format first but I would guess it's around 4000 hours.

#

Thanks for that contact, Ill reach out!

#

I mean in an ideal world I would love to be able to have one embedder/pretrained model that I could use for any language.

warm cobalt
#

spin was trained using 1000 hours iirc librispeech dataset

#

ye dr knows way better than me in this topic lol im going to let you know what he says

barren glade
#

Thanks Lyere, excited to see what we can do

warm cobalt
verbal shale
#

i already have the mhubert version i can base spin on, but training multi language (in spin) is not a good idea due to the way the codebook is created with spin, its going to try to group sounds from different languages together and slur and never be more accurate than trying say, just spanish. but since mhubert pretraining data from a ton of languages, it will work much better in general even without all the multi language data

#

if you want, send me your cleanest 30-40 hours of spanish speech (it can be as low as 16000 SR) since it gets converted to 16k anyway, as long as its cleaned

#

and it needs to be trained on an H200 for a quality embedder

#

spin does not need a large dataset, it inherits everything from the pretrain base and only tunes the top layers. it also creates thousands of fake speakers through its disentanglement process. Only about 40 hours, but they need to be really clean (white noise / hum is fine, but no music and stuff)

barren glade
#

Thanks for the info - I'm new to learning how the embedders are used with RVC.

The data I currently have is around 1 - 5 hours of clean speech (no music) per language in about 2000 languages.

My goal is to produce a voice model that will work in a language that isn't included in my dataset. My initial thought was to combine the data I have from languages similar to my target language. I can reference a tool like this to determine what languages are related - https://huggingface.co/spaces/Flux9665/NearestNeighborLanguages

It sounds like that may not work for spin embedder since it will probably merge sounds from the different languages together in a strange way.

Do you think MHubert is a better option for my use case? It looks like I can upload a custom embedder and use MHubert. Should I train it further with my data?

You mentioned that the spin embedder is able to create fake speakers, does this mean that it will work to train with a single speaker reading 30 - 40 hours?

Any other thoughts or ideas are appreciated.

verbal shale
#

and yes 40 hours is enough

#

per language