#How i rescale the audio?
1 messages · Page 1 of 1 (latest)
I think @woven sonnet or @kind nimbus might know
Para que lo necesitas exactamente?
you mean you want to upsample your audio?
if you mean that i wrote a small guide on it once lol
This tool is not designed for use with RVC Datasets; it serves as a useful standalone tool. While Audacity can do something similar, I find this method to be more effective.
Step 1: Visit this Colab. If you have concerns about the security of your account, feel free to use an alt account. But thi...
you dont really need it for rvc tho
if thats what you are looking for, this was probably the github you saw
true, thats why im saying this doesnt really make a difference for rvc
its useful outside of that tho
Are you referring to resampling the sample rate or upscaling it by recovering lost high frequencies?
that's the part I don't understand about rvc, if it limits to just 16khz then why do we have 40k and 48k pretrains?
for F0 iirc, but a proper 48k dataset will still output full 48k during inference
older model, ignore the HF meme around 22 kHz
source audio
i artificially cut it down with hpf and lpf though and did as a 32k anyways
It does that for hubert to extract features of speech, sort of understand what the audio is saying, marks every word and sound basically and that's it, sort of like an STT but a lil bit different, gan uses 32/40/48k cuts to actually train the voice model, so no, it doesn't downscale audio to 16k to then use it to create a model, it uses it to just mark words and sounds with the hubert model. Basically RVC has different modules, first is the slicer, when RVC cuts audio into small pieces. It creates two dataset folders out of your dataset, the f0 one for pitch and sound generation which will be sampled to whatever config u choose, and the dataset for the hubert_base which works only with 16k sample rate audio and extracts features. Then the gan module will just read allat information regarding pitch and features that the f0 model and hubert_base model created and it will create the Generator model and Discriminator model which will then try to fool each other to make the sound model sound like the dataset you gave it. What's funny you can skip the hubert part or ruin it so that it fails and then the model will sound like it's having a stroke mispronouncing every word, shit sounds funny af.
If it used downscaled to 16k audio to build models they all would sound 16k and low quality
I've looked at extract_feature_print.py, extract_f0_print.py, no code referencing 0_gt_wavs dir