#Does anyone use replay to train voice models is it worth it?

1 messages · Page 1 of 1 (latest)

trail onyx
#

I’ve been trying to train models using replay and I’ve been tying to train using about 1 hour and a half data sets, should I be using auto epoch or should I be setting it my self, I’ve already been cleaning and isolating my voice files is there any step further I should use when training and also should I worry about batch sizes or does the application do the hard work for me? My goal over all is to create PTH files that are accurate and realistic for real time voice conversions. Any tips or help would be greatly appreciated ^.^

ebon sparrow
#

Learn how to train a high-quality custom voice model with RVC v2 AI in this step-by-step tutorial. Perfect for beginners, this guide covers everything you need to create amazing voice models quickly and easily!

It's recommended to watch the first video in this playlist before continuing with this tutorial.🎥 Playlist: https://www.youtube.com...

▶ Play video
#

Simple and no complication, good luck

peak crater
#

Applio fork is your best bet

ebon sparrow
ebon sparrow
willow rivet
ebon sparrow
#

the original v2 pretrain lasted this long and still is robust

#

I wonder who made it

willow rivet
ebon sparrow
#

haha, he just left us with a goldmine and dipped

willow rivet
#

thats what makes the og pretrain superior to every custom made pretrain here

#

it's the only thing trained in the "right way"

ebon sparrow
willow rivet
#

everyone uses applio nowadays because is faster than mainline, but the results are exactly the same

willow rivet
#

so a theory some have is that a pretrain from scratch has to be trained in a similar fashion vits trained their model

#

rvc-boss kinda did that

#

but there are no resources for that, too expensive

willow rivet
#

anyway, mainline and applio are the same thing

#

spin is cool, it has potential but needs a proper pretrain from scratch

ebon sparrow
willow rivet
#

he also didnt had over 1k dollars for the proper vits training

#

for reference, vits trained vctk with a batch size of over 200 and for like 10k epochs

ebon sparrow
willow rivet
#

rvc-boss due to lack of resources, used batch size 16 and 500 ish epochs

ebon sparrow
#

TONY STARK DID IT IN A CAVE! WITH A BOX OF SCRAPS!

willow rivet
#

lmao yeah

#

somehow it's the best pretrain we have despite being trained like that

ebon sparrow
#

we're not tony stark 🥺

willow rivet
#

for finetuning you don't need tons of epochs tho

#

thats only for pretrains

#

actually fun fact (probably you know this) but everyone is training using the config file of the og pretrain

ebon sparrow
willow rivet
#

you can't use the same lr used in pretrains while finetuning

#

but there's a debate in that at the moment

#

since the og pretrain is already trained in a lower lr than intended

#

it had to be trained with 2e-4 but as i explained before, rvc-boss didnt had 10k dolars, so he decreased the lr to 1e-4 because he also decreased the batch size

ebon sparrow
#

That seems like a marginal difference if corrected

willow rivet
#

exactly, no one knows a good lr value for finetuning

#

also technically the lr decay needs to be changed too

#

u cant just use pretrains parameters with finetune

#

rvc-boss didn't changed them because he wanted this to be easy for everyone

#

he noticed the config of the og pretrain worked with smaller datasets so he left it like that

#

but its not optimal

ebon sparrow
#

looks promising

willow rivet
#

oh yeah he gave up on rvc after his failed rvc3 tests

#

hes now working on tts

#

gpt-sovits and a new one i cant remember the name

willow rivet
#

so he gave up and claims rvc is perfect and nothing more can be changed

#

we wont ever get a rvc3, so really the only thing we can do right now is to find the "right way" to finetune and train from scratch

#

or wait for a better voice cloning ai that can do realtime voice changing and singing, like rvc

ebon sparrow
willow rivet
#

but the latter is also taking a break from rvc

#

so really noobies is the only one trying things

#

time to time i experiment with different finetune settings too

ebon sparrow
#

the second one sounds less robotic

willow rivet
ebon sparrow
#

theres a little more audio 'noise' tho in the second one.

willow rivet
#

dataset is noisy

#

but also the inference audio is noisy

#

speaking of noise, yeah spin also handles noise better

#

thats why one is less noisier than the other

#

spin also got more natural breaths

ebon sparrow
willow rivet
ebon sparrow
willow rivet
#

so far ive heard spin works great in realtime

#

i suspect if i retrain the spin model with my new settings, its going to sound better than the contentvec version

ebon sparrow
willow rivet
#
  • the embedder is pretty robust towards noise
#

applio's spin branch ^

ebon sparrow
willow rivet
#

they use a llm for that tho

#

its tts+llm

#

rvc is a bit old (2022 tech)

#

ive heard is possible to improve rvc's laughing by finetuning the embedder but i dont know if that claim is true

ebon sparrow
ebon sparrow
willow rivet
ebon sparrow
willow rivet
willow rivet
ebon sparrow
willow rivet
#

applio added an experimental mel loss named multi-scale, this also increases your model's singing range and also makes them cleaner

#

(klm was trained with that)

#

but multiscale was never meant to be used in rvc

#

so it has a couple of issues, the most notable being adding ringing to the models

#

ringing is like a weird static sound in the higher frequencies

#

is easily audible

#

after that was found, they brought back mainline's mel (single scale)

#

multiscale sounds great but yeah, the ringing may be annoying for some

#

i've tried comparing a multiscale mel pretrain vs a single scale mel pretrain and the single scale mel results were better

#

(og pretrain was trained with single scale)

#

singing is harder to train yea

#

there are "hacks" to make a speech model sing but it always come with issues

#

for the best results both the pretrain and the finetune should be trained only using singing

ebon sparrow
willow rivet
#

yup, but well, there is not much more that can be done rn for rvc, the original pretrain and contentvec will remain the best options until a pretrain can be trained from scratch with spin

#

the current "pretrains" of spin are finetunes using the dataset of the og pretrain

#

32k sample rate only ^

#

highly experimental

#

despite being more robust is still advised to clean the dataset

#

in theory, breaths should be better and the model should have a more natural pronunciation
in my test, spin sounded worse because i used a very high batch size (experimenting), but when trained properly, the results are great

ebon sparrow
#

haha we flooded the op's thread

trail onyx
blissful kelp
#

or till I've tried it on applio kaggle

willow rivet
final marlin
blissful kelp
#

then it's kind of misinformation

final marlin
peak crater
#

I think people would understand if they ran that f0_view thingy

gleaming crater
#

@willow rivet hey lyrey can you dm me i have a question regarding the mel thing. Is that in a fork of applio? Is it added before you convert vocals or only for training. Someone i was speaking too said they had an experimental rvc fork that had an additional step to increase vocal likeness but they were gatekeeping it and saying it was part of a private research (I highly doubt that lmfao)

#

Just asking since you seem highly knowledgeable, thank you for any support

#

I basically finetuned a juice wrld model with really clean 48k audio I even cleaned the dataset and its sounding worse and less real than my freaking sovit model from 2023, just looking for new methods

#

I used og pretrain and contentvc

willow rivet
#

it doesn't increase voice likeness

gleaming crater
#

Oh ok that does make since it would only make it sound more high bitrate

willow rivet
#

you're talking about the spin embedder, i've heard the wavlm version of spin does increase the likeness more but i havent tried

gleaming crater
#

But i am wondering if SPIN and doing a klm train would increase likeness

#

Yeah

gleaming crater
#

Can you tell me what someone means by custom pretrain

#

The most recent juice wrld model says recent pretrain and the dude told me he used klm

#

Yeah that's a moderate improvement you sent

#

Im just confused wth mrf hifigan etc are its so confusing

willow rivet
#

when you're doing your regular rvc model you're doing something named finetuning

gleaming crater
#

Yes I know what they are and why they are used

#

Because training an entire model yourself would be extremely hard amd long

willow rivet
gleaming crater
#

I see

#

So is klm a pretrain and is it including hifigan

willow rivet
#

so mrf hifigan, refinegan, they're vocoders

gleaming crater
#

And how does spin factor into the equation

willow rivet
#

vocoders are the thing that clone sounds

#

spin is a embedder, is the thing that extracts every phoneme found in the dataset

gleaming crater
#

Ok is applio using contentvc

willow rivet
#

"oh your dataset is in english, okay, lemme extract every phoneme for u"

gleaming crater
#

I see

#

That makes sense

#

Spin wouldnt make a large difference then huh

willow rivet
#

during inference, the model uses the phonemes extracted by the embedder

willow rivet
#

you see, cvec is quite old

#

and is not robust towards noise at all

#

actually it despises it

#

spin in the other hand, is newer and is robust towards noise

#

this translates in your model having more natural breaths

#

less ocurrences of robotic breathing

gleaming crater
#

That makes sense

#

So KLM is what then?

#

Just implementation of the spin

willow rivet
#

klm is a pretrained model, 800 hours worth of info

gleaming crater
#

I see so klm is a better pretrain for singing?

#

As it includes more singing data

willow rivet
#

yes it's better for singing, og pretrain can't sing

gleaming crater
#

And spin can improve vocal breaths and tics and reduce pops

#

Ok so my model might be shit since im using og pretrain for singing and complex melodies?

gleaming crater
willow rivet
#

these are the voices of each pretrain

#

when you're training a model, you're trying to change those voices

gleaming crater
#

So would you just use klm 6.3 to train a model for singing/rapping now and use spin or not since its only 32k

#

Bc my dataset is super high quality 48khz

willow rivet
gleaming crater
gleaming crater
#

I like to maximize bitrate if possible to maximize data grab

willow rivet
#

i'd recommend klm 4.9 and cvec, with multiscale on

#

for singing

gleaming crater
#

I see so you can train with myltiscale or is it only used in the conversion?

willow rivet
#

you can easily hide the issues caused by multiscale during vocal mixing

gleaming crater
#

Any reason for 4.9 and not 6.3 etx

gleaming crater
willow rivet
willow rivet
#

thats why multiscale sounds cleaner

#

the model actually does something above 12khz

gleaming crater
#

I used applio 3.2.8 bug fix to train

willow rivet
gleaming crater
#

Maybe my standards went up but can you look up the most recent juice wrldmodel in the discord? See how it says custom pretrain? I am wondering what that means also it sounds really good in comparison to my model it had 20 mins of data, my model.

willow rivet
gleaming crater
#

Thanks btw @willow rivet for the help

willow rivet
#

Gans are very unstable when training with small datasets

gleaming crater
#

Fuck man is there any easy way to clean data set bc I had to manually clip shit, it seems spin would save my ass from having to go through it manually for 5 hours

willow rivet
gleaming crater
#

Holy shit

#

Rip my life

willow rivet
gleaming crater
#

I lowkey could just download others but tbh I am nervous about someone trojaning a pth model on here

#

Especially since they are running locally

#

Is that data set you sent pushing boundaries it still sounds crunchy on multiscale

willow rivet
#

rvc-boss (the author of rvc) recommends 10 mins because thats when the results starts to be more consistent

#

below 10 mins every result is so random and unstable

#

but realism in models start to hit in the 2 hour mark

#

the pretrain is around 50 hours long (cant remember exactly, it's the vctk dataset)
so maybe for finetuning the max amount you can train is 50 hours (i may be wrong)

#

the more data u add, the model knows better how to reproduce stuff

willow rivet
#

using a "hack" that forces him to sing

#

it's normal for models to sound weird if you inference stuff too different from what they learned (which is also why you want to have a big and diverse dataset, to cover different situations better)

#

it's not a multiscale problem but a dataset problem

gleaming crater
willow rivet
#

but if both quality and voice are consistent, everything should be fine

gleaming crater
#

Like keeping dataset length the same, one is from a slow song and another has different songs combined wkuldnt the more diverse set output better results and less artifacting

willow rivet
#

yea rvc kinda knows how to do that

willow rivet
#

instead of using the low notes from the start*

#

if you really want the best voice likeness, train a big dataset where the voice is super consistent and the person is using their whole vocal range

#

smaller datasets tend to sound a bit different from the og voice due to rvc using the knowledge of the og pretrain more

#

and as you know, the voice of the og pretrain is a deep voiced woman

#

so like sometimes when the model doesn't know how to do X stuff, it'll use the pretrain knowledge instead, the woman voice you've heard before

#

rvc boss added the index file to try to fight that

#

forces the model to use the dataset's phonemes

gleaming crater
willow rivet
#

thats a very hard task

#

no one gonna do that for free lol

willow rivet
#

with bigger sets you rely less on the pretrain's voice/knowledge, so everything feels more natural and accurate to the dataset

gleaming crater
willow rivet
#

yup vits2 is already a reality

#

there is a github repo with the code

#

but you can't just slap vits2 onto rvc KEKW

#

almost everything has to be redone

#

that would be the real rvc3

#

the embedder spin may be the last major update to be released yt_nails

blissful kelp
limber prism
# willow rivet yea, first one is spin (the new embedder), somehow it made the voice way deeper ...

i find that for whatever reason, the version of the spin embedder in applio (if it was used for that build) has timbre bleed where the original speaker's voice bleeds through more and causes that issue, especially if you're doing male <-> female.

using the wavlm one doesn't have that problem iirc, but lack of good pretrains is an issue; and for whatever reason i can't get the wavlm one working well in realtime. the embedder's codebook might be too large, not sure

willow rivet
#

but it's weird because ive seen other spin comparisons without timbre bleeding

limber prism
#

yeah when i was testing a lot of them with doc it was weird where it was very inconsistent. one dataset would bleed, the next wouldn't. never found a pattern to it

#

even tried throwing a huge dataset at it of 7 hours and that one bled, where one 10 minute one didn't, yet another 10 minute one did. so didn't seem based on data length, or gender, or any type of processing. super odd

willow rivet
#

my set was 5 hours yet still bled too

limber prism
#

i think the wavlm one never bleeds, but that one always sounds in realtime like the speaker took a blow to the head before speaking shibakek so spin so far feels like you can either have bleed, or slurring speech

willow rivet
#

its safer to choose contentvec at the moment, for good consistent results
spin is way too experimental tbh

limber prism
#

yeah for sure. i can get consistently really good results in contenvec, where i don't think i've managed to ever get a spin model that sounds as good yet, despite the theoretical upside. it'll sound really good in some ways, but really compromised in others

#

interesting tech tho!

trail onyx
#

im really at a lost on where pretrain models come into play, i usually record audio for larger data sets but how do i utilize a pretrain?

willow rivet
#

original rvc too

trail onyx