I’ve been trying to train models using replay and I’ve been tying to train using about 1 hour and a half data sets, should I be using auto epoch or should I be setting it my self, I’ve already been cleaning and isolating my voice files is there any step further I should use when training and also should I worry about batch sizes or does the application do the hard work for me? My goal over all is to create PTH files that are accurate and realistic for real time voice conversions. Any tips or help would be greatly appreciated ^.^
#Does anyone use replay to train voice models is it worth it?
1 messages · Page 1 of 1 (latest)
Follow this -> https://www.youtube.com/watch?v=xACQMNA-iB4&lc=Ugxj9f6No_fRq-MOdAB4AaABAg.ADwAF-Wj34-AEAMFlw7cs3
Learn how to train a high-quality custom voice model with RVC v2 AI in this step-by-step tutorial. Perfect for beginners, this guide covers everything you need to create amazing voice models quickly and easily!
It's recommended to watch the first video in this playlist before continuing with this tutorial.🎥 Playlist: https://www.youtube.com...
Simple and no complication, good luck
I wouldn't recommend mainline in 2025
Applio fork is your best bet
Both work fine honestly, I tried both, I personally like that github one for simplicity sake
speaking of which, have any successors to that v2 pretrain model came out yet? I just got back to this after like 10-11 months
everything is v2 at the moment, we got a new embedder named spin which technically it's better than contentvec/hubert but is highly experimental and more tests are needed
but the results are very close to each other it's hard to differentiate a spin model vs a cvec one
That is so interesting
the original v2 pretrain lasted this long and still is robust
I wonder who made it
rvc-boss, the author of rvc
haha, he just left us with a goldmine and dipped
because the author never shared his training technique so no one really knows how to train a proper pretrain from scratch
thats what makes the og pretrain superior to every custom made pretrain here
it's the only thing trained in the "right way"
hahaha theres always that guy that drops something amazing and leaves everyone hanging. How would you even reverse engineer how he did it in the first place?
everyone uses applio nowadays because is faster than mainline, but the results are exactly the same
well we know rvc is for the most part, VITS
so a theory some have is that a pretrain from scratch has to be trained in a similar fashion vits trained their model
rvc-boss kinda did that
but there are no resources for that, too expensive
but he also didn't had the resources back then, so he reduced the lr and batch size of the original vctk vits model and trained the og pretrain like that
anyway, mainline and applio are the same thing
spin is cool, it has potential but needs a proper pretrain from scratch
I wonder what kind of model you would get if you had like 10 thousand dollars and rented a fat stack of servers for a few days to train a model like how rvc-boss did it
oh no he actually used 4x T4 GPU, they're pretty cheap
he also didnt had over 1k dollars for the proper vits training
for reference, vits trained vctk with a batch size of over 200 and for like 10k epochs
Yeah, imagine if you had the resources to train a model in the same fashion rvc-boss did. That's probably what the fine-tunes on here are like anyways
rvc-boss due to lack of resources, used batch size 16 and 500 ish epochs
LOL
TONY STARK DID IT IN A CAVE! WITH A BOX OF SCRAPS!
we're not tony stark 🥺
for finetuning you don't need tons of epochs tho
thats only for pretrains
actually fun fact (probably you know this) but everyone is training using the config file of the og pretrain
yeah, definitely, 50-150 is enough.
is that a setback?
yes because finetuning requires a lower lr
you can't use the same lr used in pretrains while finetuning
but there's a debate in that at the moment
since the og pretrain is already trained in a lower lr than intended
it had to be trained with 2e-4 but as i explained before, rvc-boss didnt had 10k dolars, so he decreased the lr to 1e-4 because he also decreased the batch size
That seems like a marginal difference if corrected
exactly, no one knows a good lr value for finetuning
also technically the lr decay needs to be changed too
u cant just use pretrains parameters with finetune
rvc-boss didn't changed them because he wanted this to be easy for everyone
he noticed the config of the og pretrain worked with smaller datasets so he left it like that
but its not optimal
I just came across his newer voice clone webui posted on github
looks promising
oh yeah he gave up on rvc after his failed rvc3 tests
hes now working on tts
gpt-sovits and a new one i cant remember the name
he tried to finetune hubert but failed
so he gave up and claims rvc is perfect and nothing more can be changed
we wont ever get a rvc3, so really the only thing we can do right now is to find the "right way" to finetune and train from scratch
or wait for a better voice cloning ai that can do realtime voice changing and singing, like rvc
naah someone will overcome his eventually, but yeah its near perfect haha
ok, that makes sense
sadly most of the people who were trying to improve rvc lost interest, i can say the only person who cares at the moment is noobies and dr87
but the latter is also taking a break from rvc
so really noobies is the only one trying things
time to time i experiment with different finetune settings too
the second one sounds less robotic
yea, first one is spin (the new embedder), somehow it made the voice way deeper than it should
theres a little more audio 'noise' tho in the second one.
dataset is noisy
but also the inference audio is noisy
speaking of noise, yeah spin also handles noise better
thats why one is less noisier than the other
spin also got more natural breaths
yeah, that is interesting. I wonder how it takes out the noise, that makes it more versatile.
no idea, there is also another spin version named wavlm which is even better at handling noise
thats definitely important, constant yapping without the tiny breath pauses gets super annoying to listen to
so far ive heard spin works great in realtime
i suspect if i retrain the spin model with my new settings, its going to sound better than the contentvec version
really? I might give that a go, I definitely could utilize that.
laughs are slightly better (but not perfect) and breaths don't sound robotic often like with cvec
- the embedder is pretty robust towards noise
applio's spin branch ^
are you doing tts? Voice to Voice helps getting the laughs even more accurate.
i personally dont use tts but i noticed chatterbox does very good laughs
they use a llm for that tho
its tts+llm
rvc is a bit old (2022 tech)
ive heard is possible to improve rvc's laughing by finetuning the embedder but i dont know if that claim is true
still gold 🥇
For laughing, try doing laughing as a seperate audio, and laugh in multiple takes, that works for me
yeah nothing is better than rvc for realtime voice changing, for singing there are better ai tools tho
Better singing models like the KLM Pretrain?
it has already been proven laughing in the dataset doesnt improve much the results, rvc is able to do certain laughs but not everything
the klm pretrain is either overtrained or diverged, it does give your model more singing range at the cost of having robotic speech
right, the robotic speech in that one is atrocious unfortunately.
applio added an experimental mel loss named multi-scale, this also increases your model's singing range and also makes them cleaner
(klm was trained with that)
but multiscale was never meant to be used in rvc
so it has a couple of issues, the most notable being adding ringing to the models
ringing is like a weird static sound in the higher frequencies
is easily audible
after that was found, they brought back mainline's mel (single scale)
multiscale sounds great but yeah, the ringing may be annoying for some
i've tried comparing a multiscale mel pretrain vs a single scale mel pretrain and the single scale mel results were better
(og pretrain was trained with single scale)
singing is harder to train yea
there are "hacks" to make a speech model sing but it always come with issues
for the best results both the pretrain and the finetune should be trained only using singing
yeah, that and ultra clean audio as well as the wide variety of ranges.
yup, but well, there is not much more that can be done rn for rvc, the original pretrain and contentvec will remain the best options until a pretrain can be trained from scratch with spin
the current "pretrains" of spin are finetunes using the dataset of the og pretrain
they're here in case you wanna try spin someday: https://huggingface.co/Aznamir/spin/blob/main/f0D32k_spin7-12_single.pth https://huggingface.co/Aznamir/spin/blob/main/f0G32k_spin7-12_single.pth
32k sample rate only ^
highly experimental
despite being more robust is still advised to clean the dataset
in theory, breaths should be better and the model should have a more natural pronunciation
in my test, spin sounded worse because i used a very high batch size (experimenting), but when trained properly, the results are great
oh thanks! I was just about to ask for that. take care lyery
haha we flooded the op's thread
lol all good i still learned alot
I'd look forward to the next applio release before I believe spin as sota
or till I've tried it on applio kaggle
yeah spin still needs more testing
That video uses mainline and suggested bad things like harvest
What's your PC GPU
isnt it mainline RVC 1006?
then it's kind of misinformation
I saw this yesterday too 😭
I think people would understand if they ran that f0_view thingy
@willow rivet hey lyrey can you dm me i have a question regarding the mel thing. Is that in a fork of applio? Is it added before you convert vocals or only for training. Someone i was speaking too said they had an experimental rvc fork that had an additional step to increase vocal likeness but they were gatekeeping it and saying it was part of a private research (I highly doubt that lmfao)
Just asking since you seem highly knowledgeable, thank you for any support
I basically finetuned a juice wrld model with really clean 48k audio I even cleaned the dataset and its sounding worse and less real than my freaking sovit model from 2023, just looking for new methods
I used og pretrain and contentvc
back in applio's 3.2.8 update noobies added a experimental mel loss named "multiscale" this new loss was added to reduce a issue named mirroring frequencies but also to allow models to generate frequencies above 12khz
multiscale bring these benefits:
cleaner result
higher vocal range
but also comes with the following issues:
ringing
in some rare occasions, unnatural speech inference results
it doesn't increase voice likeness
Oh ok that does make since it would only make it sound more high bitrate
you're talking about the spin embedder, i've heard the wavlm version of spin does increase the likeness more but i havent tried
Can you tell me what someone means by custom pretrain
The most recent juice wrld model says recent pretrain and the dude told me he used klm
Yeah that's a moderate improvement you sent
Im just confused wth mrf hifigan etc are its so confusing
have you seen this file named "f0D32k" in your rvc/models/pretraineds folder?
this is a pretrained model, these type of models are special because they're trained from scratch which means the AI has no idea about anything and has to learn everything about sounds
when you're doing your regular rvc model you're doing something named finetuning
Yes I know what they are and why they are used
Because training an entire model yourself would be extremely hard amd long
you're trying to replace the voice of the pretrain with your dataset's voice without overwriting the pretrain knowledge
so mrf hifigan, refinegan, they're vocoders
And how does spin factor into the equation
vocoders are the thing that clone sounds
spin is a embedder, is the thing that extracts every phoneme found in the dataset
Ok is applio using contentvc
"oh your dataset is in english, okay, lemme extract every phoneme for u"
during inference, the model uses the phonemes extracted by the embedder
not really
you see, cvec is quite old
and is not robust towards noise at all
actually it despises it
spin in the other hand, is newer and is robust towards noise
this translates in your model having more natural breaths
less ocurrences of robotic breathing
klm is a pretrained model, 800 hours worth of info
yes it's better for singing, og pretrain can't sing
And spin can improve vocal breaths and tics and reduce pops
Ok so my model might be shit since im using og pretrain for singing and complex melodies?
Thank you for this
these are the voices of each pretrain
when you're training a model, you're trying to change those voices
So would you just use klm 6.3 to train a model for singing/rapping now and use spin or not since its only 32k
Bc my dataset is super high quality 48khz
most people can't tell the difference between 48khz and 32k
Extreme difference
Most haha, me I can lmfao
I like to maximize bitrate if possible to maximize data grab
I see so you can train with myltiscale or is it only used in the conversion?
you can easily hide the issues caused by multiscale during vocal mixing
Any reason for 4.9 and not 6.3 etx
So without multiscale model cant sing above 12 kHz? That's terrible usually mics go to 20 khz right
if you're using the main branch of applio then you're already training with multiscale on
exactly
thats why multiscale sounds cleaner
the model actually does something above 12khz
I used applio 3.2.8 bug fix to train
idk lol
Shit man I wonder why my model is so shit lmfao
Maybe my standards went up but can you look up the most recent juice wrldmodel in the discord? See how it says custom pretrain? I am wondering what that means also it sounds really good in comparison to my model it had 20 mins of data, my model.
small dataset
Thanks btw @willow rivet for the help
Gans are very unstable when training with small datasets
20 min is too small of raw studio vocals?
Fuck man is there any easy way to clean data set bc I had to manually clip shit, it seems spin would save my ass from having to go through it manually for 5 hours
yes
aim for 2 hours minimum
this is a 5 hour dataset
I lowkey could just download others but tbh I am nervous about someone trojaning a pth model on here
Especially since they are running locally
Is that data set you sent pushing boundaries it still sounds crunchy on multiscale
rvc-boss (the author of rvc) recommends 10 mins because thats when the results starts to be more consistent
below 10 mins every result is so random and unstable
but realism in models start to hit in the 2 hour mark
the pretrain is around 50 hours long (cant remember exactly, it's the vctk dataset)
so maybe for finetuning the max amount you can train is 50 hours (i may be wrong)
the more data u add, the model knows better how to reproduce stuff
very monotone dataset trying to sing
using a "hack" that forces him to sing
speech is ok
it's normal for models to sound weird if you inference stuff too different from what they learned (which is also why you want to have a big and diverse dataset, to cover different situations better)
it's not a multiscale problem but a dataset problem
I see, thank you for your help so what kinda set would you aim for? Like should I just take 3 minutes of acapella each from 20 different songs instead of taking 20 minutes of acapella from 2 songs sessions? Like could the fact that I only referenced two songs destroy the dataset since each song is in 1 key/cadence?
the only thing that can destroy a dataset is quality inconsistency, for example using recordings from two different places and microphones
but if both quality and voice are consistent, everything should be fine
Hmmm I see but would output be more like the input if the training data had more diverse keys and tunes etc
Like keeping dataset length the same, one is from a slow song and another has different songs combined wkuldnt the more diverse set output better results and less artifacting
the model will simply use the low notes when needed, and the high notes when needed
yea rvc kinda knows how to do that
in this audio you can hear how the model used the correct high notes in 0:15
instead of using the low notes from the start*
if you really want the best voice likeness, train a big dataset where the voice is super consistent and the person is using their whole vocal range
smaller datasets tend to sound a bit different from the og voice due to rvc using the knowledge of the og pretrain more
and as you know, the voice of the og pretrain is a deep voiced woman
so like sometimes when the model doesn't know how to do X stuff, it'll use the pretrain knowledge instead, the woman voice you've heard before
rvc boss added the index file to try to fight that
forces the model to use the dataset's phonemes
Oh so larger dataset = more likeness? Got it! And index file isnt as important under higher datasets nice. Wish we could get a rvc 3 lol. When do you think quality will jump again, if ever?
probably never because we need to upgrade VITS to VITS2
thats a very hard task
no one gonna do that for free lol
since the pretrain voice is not the dataset's voice, it causes the model voice to sound a bit different
with bigger sets you rely less on the pretrain's voice/knowledge, so everything feels more natural and accurate to the dataset
Just looked it up there was a vits2 proposal in 2023
yup vits2 is already a reality
there is a github repo with the code
but you can't just slap vits2 onto rvc 
almost everything has to be redone
that would be the real rvc3
the embedder spin may be the last major update to be released 
here's my findings to sound realistic enough & stable (a rather subjective matter, take it with a grain of salt):
- "I'd care of perfect quality / I'd just want to try briefly despite being prone to glitches due to insufficient amount": 2-5 mins
- bare minimum: 10 mins
- good enough: 30m - 1h
- more realistic: longer than said above
i find that for whatever reason, the version of the spin embedder in applio (if it was used for that build) has timbre bleed where the original speaker's voice bleeds through more and causes that issue, especially if you're doing male <-> female.
using the wavlm one doesn't have that problem iirc, but lack of good pretrains is an issue; and for whatever reason i can't get the wavlm one working well in realtime. the embedder's codebook might be too large, not sure
yeah i feel spin 7_12 still has bleed issues
but it's weird because ive seen other spin comparisons without timbre bleeding
yeah when i was testing a lot of them with doc it was weird where it was very inconsistent. one dataset would bleed, the next wouldn't. never found a pattern to it
even tried throwing a huge dataset at it of 7 hours and that one bled, where one 10 minute one didn't, yet another 10 minute one did. so didn't seem based on data length, or gender, or any type of processing. super odd
my set was 5 hours yet still bled too
i think the wavlm one never bleeds, but that one always sounds in realtime like the speaker took a blow to the head before speaking
so spin so far feels like you can either have bleed, or slurring speech

its safer to choose contentvec at the moment, for good consistent results
spin is way too experimental tbh
yeah for sure. i can get consistently really good results in contenvec, where i don't think i've managed to ever get a spin model that sounds as good yet, despite the theoretical upside. it'll sound really good in some ways, but really compromised in others
interesting tech tho!
im really at a lost on where pretrain models come into play, i usually record audio for larger data sets but how do i utilize a pretrain?
nothing, applio already uses a pretrain by default
original rvc too
yeah i actually toyed around with applio a little bit more and totally figured it out, i cant wait to try other peoples pretrains and see how these turn out