#i have been trying to find a workaround for this issue for around 2 years
1 messages · Page 1 of 1 (latest)
The model sounds like it learned your dataset very well
Unsure how exactly you want to squeeze more out of it. To start, the dataset sample you shared is definitely processed
You mention tone isnt all the way there. RVC is great but it doesn't work miracles. You might have better luck re-recording what you're inferring with different tones of your own voice
A little tip i learned from a master trainer. Use rvc1 og settings. 40k 12 sec sampled dataset, clean vocals. Og pretrain and use harvest. Manually cut ur dataset. Vocals should be 1 layer . Don't over train, so start with 100 epoch then so on. Usually 150 to 200 should do it. 1000 epochs dont mean its always better. Ur welcome.
Mostly yes but please don't train a harvest model in 2025. RMVPE is much more robust. You're free to use harvest during inference if you want to try, it makes different results, but the model accuracy is really hurt if you train using harvest
Refer to the tensorboard to see training stats, its very helpful
For overtraining
Search bar in this discord server is better than Google 🔥
Naw not completely true , it depends on the model. And voice ur making. My boy makes impeccable models. He has compared. It sounds more realistic.
Yes makes total sense
I think some folks are going for a more un natural sound in music these days with auto tune so rmvpe might work in that aspect, or if you have a smaller dataset, but i think harvest is for folks that have over 40 min datasets of pure clean audio. 🤔
RVC isnt meant to handle processed vocals to begin with
The pretrain is raw vocal, vctk
Think it just depends. Rmvpe did make one of my smaller datasets sound pretty decent. But im convinced the older stuff has its benefits.
Also. Set stopping epoch to 10,000. Just watch the tensorboard for when it starts to overtrain
And epoch is based on batch + dataset size so you really can't say "just train it to X epoch", its kinda a weird measurement for training length
why are u sticking on v1 instead of v2?
Applio has already phased it out, though doesn't remove backward compatibility to v1 models for inference
Training a v1 harvest model in 2025 is absolutely DIABOLICAL
Give me resons why? Why should i listen to what you say apposed to someone who has been training the same model? With the same 1 hour dataset? It's not about what is better?
Harvest has better bass than rmvpe
but anyways, this post was not made to discuss about v1 and v2
answering the OP's question:
increasing the batch size makes the model sound more like the original voice, however by doing that the model loses generalization and that can affect the results (the voice may have frequent glitching)
maybe try batch size 12
This post was to help someone that obviously wants different results. 🙄 what ever you wana believe. Go for it
v1 has less complexity than v2 (256 channels vs 768)
Well ill try the latest mainline
Mainline is pretty outdated compared to the Applio fork. Applio is the standard these days https://github.com/IAHispano/Applio/releases
If you don't believe me just use the search bar here
Can you explain why is it outdated? 🤔 does it deal well with large datasets?
Long tale, but basically the OG maintainer lost interest in the project and isnt accepting new contributions to mainline
So we use Applio
They're the same RVC under the hood mostly. You won't notice much difference in how it handles stuff
There's just small improvements here and there, for the user and training pipeline. Dig into the applio changelog for more info
Idk but im hearing really good results from the rvc mainline , harvest.
Sounds so realistic compared to what I have heard out there regarding the voice model of the particular artist
huh so I don't think he was so wrong, at least for me, v1 gave me better results than v2
breaths are bad in both because the dataset didnt had any
v1 sounds muddy because is undertrained (v2 is also undertrained in the audio)
it would not be a bad idea to compare v1 and v2 with a better dataset (i don't have any
)
i trained them in mainline
48k v1 training didn't worked for me, but 40k worked
rmvpe as f0 estimator
v1 sounding good was not something i expected lol, but anyways, my examples are trash, and it would be 1000% better if someone trains a better dataset and let rvc fully train it, then compare the results
Glad ur doing this and providing results and researching this. Sometimes people have to do there own tests
yeah we need more tests before coming to a conclusion
Dont be surprised when they crucify you for this. Lol
well my examples are still trash because both models are undertrained and the dataset comes from league of legends lol
someone needs to try this with a better set + fully training the model
Jesus tried telling the jews but they weren't trying to hear him
@modest garnet would you like to try training your dataset using mainline's v1?
is it against mainline v2 or applio's?
both were trained in mainline, same epoch and bs
i dont have a 40k set 😭
My boy makaveli did and i know for sure his models are great. Experiment ur own ideas vs listening to the crowd will only hurt you.
40k is key
ok ill try mainline v1
buh bye 3s slices, auto sync graphs and speed
Also cut your dataset manually 12 sec each.
truncate the silence to 0.25
and let mainline slice the dataset
it'll generate 3s slices
yuh
didnt it like suck at doing that
idk
10 sec has been tested already, 12 sec is the sweetspot
but at least every slice i got was 3s
i just remember the graphs are going to so useless 😭
Harvest is outdated but will give better results
yeah lol
150 epochs to start off with i say. Dataset should be between 10 and 40 min maybe more
that depends in the set
like a 9 minute set will require less than that
since there is less info
u train less
Gotcha
i got an hour and 22 minute set
is that good enough? 
also, do you care what set i use?
for deep male voices I suppose?
wild
nah use anything
ight
Clean set only cut out silence except for breaths in my opinion
We dont speak without breath so
too bad there's no working mainline in kaggle
hina's seems to be abandoned
Maybe u never know
oh i could get claude to add in avg graphs
bam, done
nice🔥
i like v2 results more
you inferenced this in mainline?
no
i dont use indexs
time to install mainline wooo

im going to have to use cpu inference 

for some reason, it also doesn't work in fumiama's mainline
only works in rvc-boss mainline lol
yes
ok i did too
you used rmvpe?
yea
why did my model get cursed 😭
this happened when i tried 48k lol
what the 💀
first i thought it was some weird fumiama mainline bug, but i got the same results in rvc-boss mainline
40k worked just fine
idk whats going on
v1 voice is less powerful than v2, but v1 has a less audible metallic sound (at least in my results tho)
v2 here v
99% of the time they are useless
Also includ the index please 🙏
It's the features tho bro!
Like u don't want a hybrid voice
Just saying u can have the voice but it wont sound like them
he mostly uses his models for realtime, in these cases it doesnt really matter the usage of the index
it can make things worse
for local usage is fine, you can force the accent of the model so the voice ends up sounding very similar to the original one
in cases where you dont care about voice similarity (like realtime) is fine to not use it
Yeah bro its grok but i knew that index is needed before gtpt
in case you don't know what the index does:
it's the accent of the model
Especially if people prefer to use it for music
Yeah i know
Thats probably why rvc1 wins for some people not everyone is doing real-time.
This could be the reason why makaveli is saying rvc1 is better for him. The index might work better because rvc1 was designed with it in mind
People were not really doing realtime back then also they trained with alot more data such as makaveli does. Maybe
the index doesn't have a big impact in the quality
parameters, and the dataset quality do impact the final result
So ur telling me a index isn't needed? I don't use realtime.
nope, it's not mandatory, you can inference without a index just fine
it's merely a "oh i want the model to speak like the og voice" thing
Yeah u can also smoke crack just fine as well. Does that make it ok? 😄
Thats what im aiming for tho
I want the voice to speak like the og. I dont want it to have another accent
i get what are you saying
you're trying to say that the index works better in v1
in like, accent similarity
i actually agree with u in that
trained 2 v1 models and seems like there's a better similarity between the set and the model
Yeah cuz i like to make ai songs to sound like the real thing as possible
Maybe that's why some folks are divided when it comes to this matter 🤔. I think rvc2 works great for shorter datasets while rvc1 might be for folks wanting to create an actual song that sounds like their favorite artist apposed to speaking in their voice in real-time.
v2 seems to be more expressive than v1 i noticed
Its interesting none the less
but v2 also tends to sound more metallic/robotic
Yeah i think it just depends on what people are aiming for. 🤔
Also what is interesting is training experiments im discovering people are doing. One is some one figured out how to create a double layer vocal model. I was always told that rvc cant replicate voices that are mixed in with another voice. The idea was to isolate just those sections with that voice double effect as long as they were clean from say a master reel multitrack
I heard the results and they are amazing
i think this conversation should move to #🧬│ai-chat
the post wasn't made for this lol
Lol 😆 i thought we had a moment 🤣
yes
tldr;
v1 gives less metallic/robotic results at the cost of sounding a bit muddy
but only v1 40k training works, 32k and 48k doesnt work
Need more examples
Are these two demos separate models
I forget if there's backwards compat
I hear the difference in them but its so small I'd chalk it up go randomness. Unless they're the same pth
yuh the difference isnt massive
Nooooooo 🥀
there's no 32k in v1
its hidden
idk honestly
there is a 32k pretrain and config, but in mainline's github rvc boss said he didnt liked 32k v1 results
