#More complex tips+tools for vocal separation and dataset preparation to add to workflow
1 messages · Page 1 of 1 (latest)
Hi i’m looking for some more complex tips and tricks for vocal separation and dataset preparation to add to my workflow (that hasn’t been mentioned yet). Would be great if anyone could share their typical workflow, tools/software when preparing a dataset as well as any tips, techniques or guides (docs or vids) since looking around old messages/conversations isn't too effective.😅
For context i’m trying to get better at making models while documenting the process as a guide but i kind of realized i don’t really know that many options or methods for vocal separation. I’ve always used the same mvsep lineup for every type of encountered audio but i noticed that there’s still some bleeding left and that i don’t know what to do next from there. i’ve also read most of the guides available (related to making better datasets) but after finding lusbert’s izotope rx10 guide, i realized firstly there’s probably more i could do in my current default workflow and that i don’t know most of them
Timeline on what happened:
• Read lusbert’s izotope rx10 guide
• Tried rx10 and didn’t know how to clean actual background audio (not noise) rather than cleaning up imperfections
• Realized it’s not that good at vocal inst separation (at least for background type music)
• Used standard mvsep line up but properly noticed bleeding for first time
• Struggled with best uvr5 models+settings
• Struggled with debleeding methods
• Tried karafan and worked(?) after long processing time
• No idea how and when to use vst noise gate plugins or properly reading spectrogram regions
• Yea i should probably ask for some proper advice to then record/document in guide format as opposed to figuring it all by self
Basic format:
Step 1: get the audio somehow
Step 2: polish dataset
Step 3: manual editing
Step 4: dataset preparation
Step 5: RVC
My current workflow is:
Step 1: get the audio somehow
• Youtube video or twitch: yt-dlp
Step 2: polish dataset (for every variant of audio to fix)
• Mvsep lineup: (from Perfecting Audio Isolation on Low-End Rigs: A Practical Guide)
• occasionally MDXB karoke model to get rid of backing vocals i think
• MDX23C 8K FFT, full band (or instvoc in uvr5 i believe)
• MDXB Inst_HQ3 (noticed its good at random background audio removal)
• FoxJoy Reverb Removal (recently noticed that it doesn't do much)
• UVR-DeEcho Normal 0.3 aggression
• UVR-Denoise 0.5 aggression (noticed that this dampens the audio a very small bit)
Step 3: manual editing
• Put filtered track into audacity
• Noise gate everything first
• Listen very closely and label all the slight imperfections
• If it has ONE short instance of bad (noise, quiet music/drums/etc)
• Sounds distorted (loud noise distortion)
• Split at all labels, delete all labeled audio, join audio with silence and truncate silence
• Review entire thing 1 last time for slight imperfection and delete
• Yes i’ve made 400 labels on a 10 min stream clip before
• Normalize
Step 4: dataset preperation
• Export flac quality 8 mono WHOLE file
• Chuck audio file in dataset folder and chuck dataset folder into google drive
Step 5: RVC
• Usually rvmpe, rarely mangio-crepe as most datasets i work with aren’t near peak quality (or have intentional noise)
• Often batch 2 or 4 since been working with tiny datasets for while, batch 4 for larger (15 min+) datasets?
• Honestly need more information on ideal step to download model from (the graph drops alot, maybe just take the latest until it obviously visually overtrains?)
Problem with current workflow:
• Model lineup isn’t the best since bleeds, looking for even better model lineup
• I always use the same method for all types of audio like full music (music videos), background music (background music in a stream), background noise (not music, random audio pots banging etc, overall imperfections)
• Idk how to fix bleeding, looking for method to fix bleeding
• No precise editing for step 3 but rather just be flat out deleting clips that could probably be fixed, probably izotope rx10 modules?
• Too little information with step 4 and 5, need to ask more questions
Sorry if what i’m asking for is a bit messy. Basically my main goal is to first fix my workflow (as in what steps can i add to my workflow to make it better) while recording everything in an organized ideal workflow guide (so that all the information is in one organized place) so would be great if i could fill out all the missing gaps of information. Information on training (step 5) would also help although i might not be able to focus on it yet as opposed to step 1-4 information. Simpler questions/answers i’m looking for are:
• What most of the uvr5+mvsep models are best for (like use this model in this situation vs other)
• Kim
• Bandit
• All the mdx+vr models
• Vocal vs instrument model
• Ensemble mode?
• What to look for in a dataset like laughs, people saying SSS or pop noise when saying P in microphone, etc (that’s actually summarized as all the information is literally all over the place)
• Good uvr5+mvsep+karafan model lineups (and if there’s specific lineups best for specific audios)
• If vocal separation colabs are good
• If model lineups always result in a perfect or near perfect result (like if i need to pay attention to bleeding usually or not that much)
• How to deal with instrument bleeding in vocals
• If separation models perform differently in loud music videos vs background casual music in streams
• Are there models that are good for separating background audio/noise as opposed to music
• Is phase cancellation (like press invert in audacity and mix) important and if yes, any tips related to it
• Even more tools/softwares that people use, when and how it’s used
• Other people’s overall workflows so i can know when x tool is used and so that i can check+ask questions later when busy (i normally use x model lineup, i use this software, i use these plugin/tool settings at x step to get rid of x)
tl;dr: my workflow sucks and i need more information, how do i fix it?
oh yea this is the ideal workflow chart i'm trying to fill out (it's filled with holes and gaps that don't have enough info on)
https://docs.google.com/document/d/1dqIM2ioJtxJTP0mADyk7F2MNxYKwLOAyCnvLRBIevdw/edit
Holy shit this is amazing. @topaz portal might be able to help you
yay
mostly just looking for other people's workflows first so i can review and then ask+confirm questions after (currently offline, maybe busy tmr but have precise idea on wut to ask for next)
@rigid egret
@pastel island I think you've done a great job of setting out the issues. I'm going through a similar process myself and have only just produced my first high quality models (which have actually blown my mind). I've had good results with including laughing, s's, p's (as long as its clean i.e., as long as it’s a feature and not infected with noise etc.). Depending on my use case, I might want to include pops etc because it is a feature of the voice in the particular medium (i.e. if I want to emulate raw-style vocals, phone call etc -v- removing all “p’s” leaving the model unable to produce p consonant. So I work out my features not just on voice, but on medium as determined by use-case.) The key for me is being able to identify what is a feature, artifact and infected feature (interspersed with artifacts/noise.) A lot of this depends on how I extract samples from the source. For example, if it is a movie and I want a particular character's voice, my dataset workflow looks like this (there are probably much more efficient workflows, this is just what I’ve worked out myself):
- convert the whole video file to .flac in Audacity, make sure it is mono (much faster processing in Audacity and I've had better results with mono datasets in training).
- listen all the way through and manually pull out the cleanest samples of the voice I can find (background music is ok, but if they are laced with artifacts/noise at the same volume=infected, i.e. UVR5 models can't currently disentangle). I'll paste cleanest samples into a mono track in a separate Audacity project. This will then become my raw dataset.
- Process raw dataset in UVR5. Models will depend on the particular problems with the raw dataset (e.g. if lots of echo, then I'll focus on that [knowing the difference between echo and reverb became very important for me!]). As I'm working it through UVR5, I'll quickly check audio after each processing step to ensure I'm not degrading the dataset (usually never a problem, but I’m careful about denoisers).
- Once I've got the raw dataset as good as I can get it, I'll manually audit it in audacity to cut any artifacts/noise UVR5 has not been able to remove or that I’ve missed earlier (they are always there.)
- In Audacity: apply noise-gate, normalize volume/loudness, and truncate silence, and there's my final dataset.
With high quality datasets of 5+mins, I usually train at batch size 4 at e300 and have been getting good results, but all depends on unique features of the dataset. Again, I’m sure a lot of what I'm doing could be done much faster (esp. manually auditing datasets, extracting samples etc.), so I'd be interested to hear about what others do.
@pastel island just had a look at the google doc, this is a great guide, including for processing raw dataset: https://www.youtube.com/watch?v=Hx2IHzt5tAc. In the guide, its done on mvsep.com, not UVR5, but shows you a good stack you can translate to UVR5 which has most of the same models or equivalent.
A comprehensive training guide for AI voice models. Goes through dataset creation & vocal isolation, training setup, Tensorboard, and vocal inferencing
Further Reading & Downloads:
Vocal Isolation - mvsep.com/en/home
Mangio RVC Installation - youtu.be/ixB9oalT3cQ?si=wLMTnFOqABQIeLBj&t=79
Tensorboard Installation - us.aihubfrance.fr/guide-to-...
Below is my UVR5 line-up based loosely on the above. I find works really well for almost every raw dataset with few remaining artifacts/noise (which I'll manually remove in Audacity after). I'll usually always add echo removers, sometimes extra denoising depending on problems with raw dataset.
(In order)
- MDX23C-InstVoc HQ
- UVR-MDX-NET-Voc_FT
- UVR-MDX-NET-Inst_HQ_3
- UVR-MDX-NET- REVERB HQ (FoxJoy Reverb Removal)
- UVR-De-Echo [either Agressive/Normal/+reverb depending on raw data problems]
- UVR-DeNoise
Ayo? @brisk granite level 3 !!! 
Truth be told, this is a Little outdated, I need to update the guides n such
Ah, the one and only! Still a great guide, and I'd be nowhere without it!
I'm really not that but thanks

So there's a lotta wrong here with ur stuff
First
U r using some outdated uvr models
The Kim models aren't that good compared to the newer Mdx23c ones
And all the karaoke ones included
Also
You should never set batch size to 2
Especially with heavy uvr isolated audio
Ur the first newbie that also mentions labels
Good
But
Here are some more tip
Tips*
U should only isolate stuff with uvr
Then
Do all the cleaning in rx
Click on the 4tg guide
-audio
- Perfecting Audio Isolation on Low-End Rigs: A Practical Guide, by Litsa The Dancer and Faze Masta Google Docs
- Gathering and Isolating Audio, by SCRFilms Google Docs
- How to make a good model All-In-One guide, by LUSBERT
Rentry - Creating Datasets for RVC using iZotope RX, by LUSBERT
Rentry - Vocal Mixing Tutorial, by Roomie YouTube
Youtube Downloaders
Audio Separation/Isolation
I have some few guides in gathering audio in the audio guides. Getting highest quality audio is the most important process in the first place before doing a lot of post-processing. If you are referring to vocals in music, try to look up soundtracks that are in ||torrent|| for example, like an actual rip from streaming services like Tidal, Deezer, Spotify (maybe). Or you can just use spotify converter instead of youtube. Youtube compresses audio in like 128kbps (max usually) in opus codec so I don't recommend using one. If you happended to have these high quality 24bit Flacs of full mix and instrumentals, the phase cancellation method can actually help you a lot. Speaking if that I always done that in premiere pro, not in audacity. In premiere pro you can manually add keyframes of volumes to match both mixes together.
Ayo? @rigid egret level 2 !!! 
There are some cases where they are impossible to separate using phase cacellation because both tracks have different effects, like limiters, compressors, etc. That made it impossible to separate. Also include there the problem of clipping audio with phase cancellation. It creates pops/cracks/weird bleeds in some cases.
for sample purposes, it sounded like this.
Another tip for using truncate silence in audacity. Know the signal-to-noise-ratio first before doing any weird settings, the threshold setting is dependent with your audio or your audio's SNR. Know the noise floor also to gate that noise perfectly.
speaking of bleeding problems of models, try to experiment by adjusting the pitch before feeding to mvsep or uvr, try -6 pitch. I heard about it it may or may not help to remove the bleeds but it wouldn't hurt to try. Or you can do Masking, use the model that are clean in lower frequencies and use another model that are clean at higher frequencies. Then do basic frequency crossover at around 1Khz or 2Khz it depends.
its better to use labels
and not just truncate
Regarding on polishing your dataset, you also need to know what to maintain and what to remove. Certain models have advantages and disadvantages, the more models you use, the more you are destroying its details. Mdx23c models I can say are the best, no static or humming noise, for full mix separation. If your audio doesn't have reverb then why would anyone use foxjoy reverb removal or de-echo dereverb. Same gose to noise, if you audio doesn't have noise, then don't use denoise models, these models only works with white noise itself, no idea why people are using it as a standard workflow, people expects that it will remove foley sounds or bleeds I guess but it has nothing to do with that. Bandlt Plus or MVSep Demucs4HT DNR is your go to model for that.
For removing reverb, pitch down your audio to -6 and process it to UVR de-echo dereverb model then pitch it back up by +6 afterwards. That way you are maintaining the true high frequencies instead of mirroring them.
o yea ty for the extra info! some other extra main questions i have are:
• What are some more specific aspects/situations that the other models would be best at/for like when it's best to use bandit plus and Demucs4HT DNR compared to inst hq3 and etc? (cause still not too sure on each model and also want to make a list to be better at choosing specific models for specific situations)
• Is there a main difference in workflow/choosing models (x model good for calmer vs louder music) for music video type audio vs background music in a stream? (srry if very vague question, guessing it's just picking diff models, maybe diff software while still manually picking best clips to keep but wondering if more)
also i organized more of the info into the doc if curious about it or want to add stuff but slightly busy currently
well melband reformer can do a better job per se than hq3. also hq3 produces noise
and yes
every model on there does a different job
its mostly experimenting
but
an idea i can give is
mdx23c
great for any isolation
especially songs
mel band
great for clean up
and vit large 23
also good
so u can kinda follow all this for just songs n stuff
bandit does well with podcasts, shows and all that
even streams
and gaming
but it can only do speech
wait which one is the melband model?
i remember mdx23c on both uvr and mvsep and vitlarge+bandit is on mvsep
o
so the mdx23c models (d1581, instvoc) are generally good for everything but best at songs
yes
also
melband can sometimes clear out strong amounts of noise
but u shouldnt rely on it for all types of noise
ur best friend for dataseting really is RX10
what does it mean by good at clean up for vitlarge and melband like is it cleaning up artifacts and stuff after using another model or good at cleaning up random noises and sound effects?
well, they both are algorithms meant to remove music n stuff
so
they usually
assist mdx23c
in cleaning up any bleeding that may occur
they just clean up a little extra
without compromising quality
so it's like mdx23c gets rid of most music and then vitlarge/melband cleans up what's left from the first separation like mostly bleeding but maybe other stuff like buzzing or noise if still there?
yeah
and then the clean up models are better for cleaning up than vocal inst separation?
ohh
and then also for bandit is it like it's good for only speech as in like it can clean random background noises, sound effects, etc but not music specifically
so it's good for not music type audios/scenarios
and also for rx10 i'm thinking but is rx10 better for fixing imperfections in the audio (de-esss or de-pop) or can it actually be used for separation/cleaning stuff like separating music or random background noises as opposed to using models for that
Ayo? @pastel island level 3 !!! 
or is it like specific modules would be better for things that the model could get rid of as well like humming maybe or spectral denoise
it can get rid of music too
but its only meant for speech
not singing
o
and yes rx can also isolate vocals
just not with the same quality per se
but
it can also de reverb
but usually the uvr de reverb is better
however
it produces a lot of noise
its a very noisy model
so it's sort of like rx10 is another option to get rid of specific problems alongside models but better to use rx10 for those specific problems (usually fixing pure speech related) than just using models to solve problems that rx10 can solve better or using purely rx10 to do things that models are also better at?
cause wanna make sure it's not like using only purely rx10 or something since confusing
well, i see ur a little confused
so
RX10 is an audio repair tool
it helps u enchance the quality of ur audio
and uvr
is meant to clean out noise, music and all that good stuff
so
what u usually do
is
u simlpy use uvr for whatever u need to do
eg
isolate vocals
and then
take it to rx
and clean it up and enchance it
thats basically the whole idea
bc
for example
rx10's spectral de noise does not cut off
and offers really high quality outputs
so a workflow for example could be like first u have a music video:
first isolate vocals (models are better at doing it)
once cleaner vocals, repair it with rx10 (rx10 better at repairing it)
so you could use rx10 to isolate the vocals but it'd be ok quality
and u also could use models to repair the audio but it'd be meh quality
well not really, the uvr models cannot repair any quality damages
the just clean up the audio for u
yea so then if u try to repair audio with models it wouldn't rlly work so then you'd use rx10 since it's better at that
wait what overlapping features are there with models and rx10 and which would be better for each
like spectral denoise vs uvr denoise
or de reverb
or maybe like de wind or something for maybe "cleaning"/isolating background noises from the vocals (don't know if there's modules or models for that)
spectral de noise is an algorithm, it just looks at values n stuff and removes anything that it was shown to be as noise.
uvr de noise on the other hand is moreso a network
its AI
so
AI
can be random
and it tries to guess each time
what it should clean out and it shouldnt
so mostly just test both overlapping things first and then see which better for current audio?
unless 1 of them is mostly better
u want to stick rx
as it doesnt cut off
cut off is like 48k cut to 44.4k sample rate right?
also is there main quality related importance/problem related to cutoff or just that it's not the original audio sample rate so u can't resample it back
also to confirm more model stuff again before doing some other works
more recent/better:
mdx23c models: good for everything, best for songs [heavy vs calmer music? speaking vs singing?]
melband+vitlarge: good for cleaning up after mdx23c/music separators (bleeding, random noise that got through)
• melband sometimes clears out lots/strong amounts of noise
bandit: good for cleaning up SPEECH (random noises, background music), NOT singing
uvr echo+reverb: ai (more random), makes noise, has cutoff
rx10 echo+reverb: algorithm (more specific), no cutoff
outdated/usually less good:
inst hq3: less good melband
kim vocal: i think outdated/outclassed in some way?
extra:
models GOOD for isolating vocals from stuff (music)
rx10 generally better for audio repair
are those the main models to use or any other good models that missed (also wuts mdxb voc_ft and need bit of clarification for mdx23c model best for songs)
yeah good u got the jist
🎉
voc ft is basically kim voc 2 but with demucs ft stuff to clean audio a bit better
but its SDR is 9.63
and thats rather low
plus it cuts off too
o rip
does cutoff negatively impact quality in some way or just that it's downsampled so u can't resample it back (since don't know if very big quality problem or rip sample rate type problem)
speaking of cutoff in general audio, it doesn't affect the quality and such, it's just that you will lose details at higher frequencies. Humans can generally hear up to 17Khz average and above that you can't hear any details, which means 32000 or 34000 is already enough for high quality listening. But with rvc, you have to maintain the correct cut off to avoid artifacts. 16Khz for 32K, 20Khz for 40K, and 24Khz for 48K. Some test before, correct sampling rate eith proper cutoff can lessen the artifacts.
o so in 1 way it's that u lose more sharper details of speech but mostly that using 44k at 40k/48k setting usually worse than using 40k for 40k or 48k for 48k?
Ayo? @pastel island level 4 !!! 
sometimes those "details" u see might also be clipping
especially when dealing with mid quality audio
so cutoff often results in clipping aka artifacts rip
wait can u do cutoff and then clean the clipping somehow or more like u could but better to not since u lose details when cleaning it
also wut does it mean for the mdx23c models better for songs like better for singing or like heavy music vs quiet music in background
gtg tho ty for the help
Kinda what I noticed, 44.1 for 48K, not good. 44.1 for 40K, a bit okay. But it's better to target the correct sampling rate to avoid rvc resampling your audio, so 40 for 40K, 48 for 48K. Rvc might have problems with resampling the audio, bad resampling can lead to artifacts, aliasing, and distortions and such.
U can use rx to resample to 40k
Yeah any daw or audio production software can save the audio with custom sample rate, also my question is why you kept mentioning rx tho? There are programs that rx can't do, I am just mentioning the general tips with regards to audio. Kinda odd you kept mentioning the dataset MUST be done in uvr and rx only.
Rx is just viable and if say pretty noob friendly
I'd*
small problem but the uvr mdxsc23, mvsep bandit, vitlarge and melband models all have the cutoff at 44.1k so don't really know how to continue the workflow while preserving the original 48k sample rate. i could still probably use the models for a 44.1k workflow but any other tips/advice/tools for a good 48k workflow?
also what are some other daw and audio production software that could help/be used in a workflow cause idk much other than uvr, karafan(not sure how it fits in workflow yet), rx10, vst plugins(?) and spek(?)
(also am i able to keep the help-forum as a way to track all the question or is the forum more for single questions since been using the forum for a while)
That's not a problem, that's a feat these models have
Bc all the past models used to be a lot less
are there any other good isolation methods that also keep the sample rate (48k) or most of the best ones have minimum cutoff of 44k since don't know what the next isolation step in workflow while keeping same sample rate would be. i think rx10 would work for repairing any audio but last time there was alot of bleeding when i tried isolating vocals from music
very simple, if you want to maintain the 48K output, use adobe audition and interpret the sampling rate to 34000, that causes the audio to lower the pitch, then save. Then you can do what ever model you want without any cutoff, even using the uvr de-echo dereverb with 17.5Khz cut off, you don't need to worry with that.
Because your audio is lowered to just 17Khz, it fits all the details within it. After you do all processes in uvr or mvsep assuming that these are now at 44.1K, use adobe audition to convert the sample rate to 34000, then interpret the sample rate back again to 48Khz, then you are good to go.
Ayo? @rigid egret level 3 !!! 
experiment with it and you will get it right.
another one is bending the pitch to -6 semitones, process the audio you want with the models then pitch bend it up back again by +6 semitones.
afaik, the goyo plugin is the only ai isolation for noises but it is paid now. They support 48Khz and 96Khz.
ok so i was playing around with different tools but now i'm kinda curious on what should actually be kept in a dataset? i created this mini list on potential problems and things seen in audio samples:
Ideal: complete, clean sentences with different articulations and features
Not ideal: cutoff, short speaking with little differences in pitch and features, anything listed below in potential problems?
Potential problems:
• Noise (white, pink, brown)
• Other speakers/people talking
• Reverb/echo (very slight reverb vs obviously reverb)
• High frequency scratching noises
• Background audio (clicking, wind, plates, etc)
• Instruments/bleeding
• Complete cutoff speaking (cutoff in mid vs end of word)
• Distortion (buzzing, low quality mic)
• Mic problems:
• Clipping
• Plosive
• Harsh/high frequency sibilance (from “S,” “F,” “X,” “SH,” and a soft “C.”)
• Talking far vs close from microphone (difference in volume)
• Types of speech:
• Laughing (laughing while talking vs pure laughter)
• Coughing
• Breathing (sentence with breathing vs pure breathing, hmph)
• Speaking vs singing (similar sounding vs vastly different)
• Single consant/vowel grunts (aaaa, ohhhhh)
• Long complete sentences vs very brief sentences
• Different accents/voices (different speaking voices)
my main conclusion i currently think is this (this really really reminds me of it: https://www.youtube.com/watch?v=0QczhVg5HaI&t=537s)
The goal for a dataset is to get as many legitimate pitches and features as possible. (like full sentences)
When presented with an audio sample that has problems, determine if you’re better off keeping (better to have this as extra data for the model) or not keeping it (i already have a bunch of sources and this audio sucks, probably better to discard).
This was tested (for a private model) with Dataset 1 as a 2-ish min dataset, and Dataset 2 which is Dataset 1 but heavily cleaned out reduced to 1-ish min (this was with my old workflow btw). Dataset 1 with rvmpe performed better or had more variety than Dataset 2 with mangio-crepe which sounded more duller with less varying pitches (even when very very cleaned out). This could’ve been that rvmpe at least had more varying pitches/audio and maybe a couple of (maybe infected) features as opposed to mangio with very very little to go off of (although with a very good pitch).
Was for a private model but might've not been the best experimental environment (maybe mangio-crepe insanely specific rather than mostly specific) but it made me notice that there might be specific scenarios where leaving in some artifacts would be fine or have varying impacts (like in giant dataset maybe?).
A video about neural networks, how they work, and why they're useful.
My twitter: https://twitter.com/max_romana
SOURCES
Neural network playground: https://playground.tensorflow.org/
Universal Function Approximation:
Proof: https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
Covering ReLUs: https://proceedings.neurips.cc/paper/2017/hash...
these are my current questions that i'm stuck on (srry for alot but i have 0 idea on them)
Questions:
• How do messy/not clean samples influence the result in smaller vs gigantic datasets and what are some of their impacts? (will it completely mess up inferences with that same referenced pitch/word?)
• Are there special things that rvmpe or mangio-crepe do to datasets like being able to find patterns in audio with artifacts
Types of samples that i have no clue on its impacts or if it's fine to include:
• Laughing (laughing like giggling while talking vs pure laughter)
• Coughing (probably not include since it’s not speech)
• Breathing (sentence with breathing vs pure breathing, hmph sounding, maybe sentences that end with breathing are good for natural sounding/index?)
• Speaking vs singing (if the singing voice is similar sounding to speaking vs vastly different but combined in same dataset)
• What happens if you cut off the speaking mid sentence (cutting off in the middle of a word vs right after a word ends in a sentence)
• Impact of isolated vocals from instruments (like if the model would be completely fine or if there’s some unnoticable bleeding that’d ruin result in the model making it have scratching or weird noises or something)
• Microphone proximity while talking (if the person moves far from their microphone while talking so they’re barely audible or if they’re so close that the audio sample becomes super dense sounding)[better to just keep dataset focused on same proximity samples?]
• Long complete sentences vs very brief sentences (not enough features?) vs words cut from a whole sentence (similar to cut off mid sentence type)
• Single consant/vowel grunts (aaaa, ohhhhh)
• Different accents/voices (different speaking voices)[is it good since it has more features or very bad since it sounds like a different speaker?]
• Very very slight reverb vs heavily de-reverbed so it’s just very dry signal vocals that end words with bit of distortion
this is the only thing I can add
• How do messy/not clean samples influence the result in smaller vs gigantic datasets and what are some of their impacts? (will it completely mess up inferences with that same referenced pitch/word?)
- afaik with rvc, it will sound like bleeds since rvc is averaging out the audio, in inference some parts will be cleaned and some parts will indeed have hummung or buzzing sound problems.
• Are there special things that rvmpe or mangio-crepe do to datasets like being able to find patterns in audio with artifacts
- has been a debate for months, but generally rmvpe has the best sounding pitch algo. Differences between the 2? Rmvpe may or may not be free from artifacts but higher chance that your model will not sound the same as your dataset. Mangio may or may not be cleaned but mangio is more dynamic than rmvpe, which means it will have more pitch and accent variations and it will sound almost the same as your dataset.
For others:
Normal breathes like breathes in between sentences/phrases are fine to add. What is not good is coughing and laughing, since most often the models are only used for singing.
It is better to separate singing model and a speaking model because they have different characteristics to it, like singing is more expressive and when speaking it is more lifeless. Remember that rvc is averaging out the audios.
Cutting off speaking in mid sentences? Well it has big impact as sudden cut results in a distorted spike or popping sound in a spectrogram. What happens to your model with sound like too choppy at some point or it may create random distortion.
Impact of isolated vocals from instruments? I mean like it doesn't affect as long as there's no bleeding in your audio, if there are bleeds, discard it.
And speaking of microphone distance, yes it does affect the quality of the model, especially if there's slight room reverb in it. Better to just keep them at exact distance.
Single consonant/vowel grunts, yes they are fine.
Different accents/voices you mean like speaking english then suddenly speaks other languages? I mean they could help add more variations to your dataset, but not usually recommended. But not sure what you're referring to.
And lastly reverb, this is the most important part to keep in mind, is to keep your vocals as dry as possible. It can create hummung issues or reverb bleeds, or buzzing sounds.
tyty
something i notice for samples (speaking, probably singing but haven't worked with them in while) with a very very slight amount of reverb since i'm playing around with rx10 but when the vocal is dry, it'd sound a bit weird or distorted when it ends a sentence so is it better to keep it very dry even if it sounds a bit weird and then maybe add a tiny bit of reverb if u wanna make it sound natural again or is it better to get rid as much reverb as possible until there's no major reverb left but still sounds natural enough (since very dry vocals sometimes sound weird unless rvc works way way better with dry vocals)?
also for the diff accents/voices it's more of like voice actors switching to a different type of voice whether it's professional voice actor level or just a slightly different way of speaking
I can't really imagine why you have distortion after sentences when dereverbing it, normally it will sound a bit muffled but not distorted. As for rvc, yeah it is important to have dry audio tho. Tested before with reverb and they sounded idk I can't really explain, it's off and tons of artifacts
if the accent are too far from their original voice than I suggest not to included, as well as speaking with other languages, not recommended usually.
oks ty
Oh my god ur dropping all this I'm very overwhelmed
yeah, the audio should be dry (dereverbed), since reverb and echo are basically additive effects (even natural room reverb/echo are so), and the voice audio coming from the person's mouth is actually dry and mono (like any natural sound sources). MDX dereverb+UVR deecho generally works well unless the reverbed audio is mono/fake stereo. though some cases like "ahh" and "ohh" may probably still leave little reverb.
a bit of extra questions:
• Are there any useful settings or uses for uvr5 ensemble mode since i’ve never used it before and i’m not sure if it’d work as an extra option to have alongside other models by themselves
• Are there any differences on how the audio files should be formatted when processed in rvc like if it’s better to put a giant clip or to put many individual clips
• One huge audio file
• Multiple individual clips of person speaking start to end sentence
• Multiple individual clips cut at fixed times (all 5 seconds clips)
• Related to above, do clips shorter than 1-2 seconds negatively impact the training result as when i process individual clips that are 1-2 seconds long (with the speaker sample still being fine like their entire sentence is 2 words), the rvc colab puts a warning when i could just format all the individual files as one huge file and process it like that
• Is it normally really hard to address robotic sibilance, breathing and static/noise imbeded (when it speaks) in a model and are there any tips/examples on what types of samples to aim for to fix these problems? (especially robotic sibilance and breathing related problems)
• Is making high quality models very difficult with isolated+cleaned vocals as opposed to original very high quality vocals?
• How does batch size work and how do you get the best result of a model when selecting it
• Do you start with 16 and go down?
• Is it based off of dataset size or quality?
• Is it based on other factors like the pretrain or the f0 algorithm (rvmpe, mangio, hop length?)
• Do certain batch sizes have a different aspects best for specific model results? (like generalized result or something)
• when using mangio-crepe, does the dataset have to be insanely high quality (like original studio vocal quality) with 0 artifacts or can there be very few artifacts as idk if mangio-crepe dies when there's few artifacts in extensively cleaned dataset
also does anyone have any good vocal samples used for inferencing when testing a model as i don't have alot of them
hey litsa can u check ur dms rq ! @topaz portal would need help with some of this
would apprecaite it bro
nvm got some clarification
this thread has been left out haha, but I can answer just few things, others are just subjective. Addressing robotic sibilance and breathes are somewhat difficult, even if it was trained properly, little chance that there will still be robotic parts. So far the fix I encounter for this is to have more variation in your dataset, longer dataset length. And for making high quality models with isolated vocals, it's also kunda difficult to make it sound right. especially maintaining most of the details of fhe vocals. Pretty much easy to make high quality models with proper high quality samples as you don't need that much post processing. As for mangio crepe dataset, either way, clean or not, still works.