#KLM 5 RefineGan (3rd Update - 24, Feb.)

1 messages · Page 1 of 1 (latest)

hushed crest
#
  • Please read before use KLM 5*

KLM 5 Uses a Pure Human Voice Dataset and Contains a Large Amount of Vocal Data

KLM 5 uses a dataset composed of purely human voices and includes a large amount of vocal data. If your dataset generates voices mixed with electronic sounds or special effects, we cannot guarantee its quality.

KLM5+RefineGAN allows the learning of ultra-high-frequency sounds that were difficult to implement with conventional HIFIGAN. This means it can learn very sharp high-pitched sounds, such as coughing, sneezing, and female screams, and can train and reproduce all high notes up to A6.

Even if your dataset does not contain high notes, the model itself includes data from the high-frequency range, enabling it to infer high-frequency sounds even from general speech recordings.

However, such high-pitch inference must be physically possible.

In general, ASMR voices or whisper-like voices that contain a lot of breath cannot infer high-pitched sounds. Even people with very breathy voices use modal voice (chest voice) when screaming or singing high notes. If a dataset consists only of breath-heavy voices, high-pitched sound inference becomes impossible.

Additionally, tough characters that use growled voices also cannot infer high-pitched sounds. Growled voices consist of quickly disconnected waveforms, and when the pitch is raised, the waveforms become too short, making the sound unnatural or inaudible. This is why growling techniques cannot be used to sing songs spanning three to four octaves.

In summary, most high-pitched inferences are possible as long as the sounds do not violate physical principles.

Guidelines for Generating Cover Songs Using KLM5

When generating cover songs with KLM5, please consider the following points:

The dataset must be very clean.
Do not use datasets containing reverb, delay, or harmonic residues. Since KLM5 generates purely human voices intended for final production through mixing and post-processing, we recommend using raw, clean recordings that require minimal cleanup.

There may be frequency limitations depending on the model (32kHz vs. 40kHz).
KLM5 enhances high frequencies by boosting weak ultra-high-frequency waveforms. However, if the boosted waveform surpasses the sample rate limit, the sound may become quieter or disappear.

Additionally, aliasing may occur in ultra-high-frequency ranges, which is a natural phenomenon and does not require concern. (The inverted waveform generated in this process is not a mirrored signal.)

**Dataset & Train **
800+ Hours
Applio RefineGAN
FP32
BatchSize per GPU 32
Emb. Model RVMPE
Total Loops : 650
Total Steps : 4.1 M
Total Speaker : 101
Gender Ratio : M 42 F 58
Vocal Ratio : 5~8%

The song included with the accompanying MR was specifically created to test KLM, and I hold all copyrights to the song

Vocal Model - NELL xe6 (RefineGAN)

32Khz
G -
https://huggingface.co/SeoulStreamingStation/KLM5/resolve/main/G_KLM50_RFG_32Khz.pth?download=true
D -
https://huggingface.co/SeoulStreamingStation/KLM5/resolve/main/D_KLM50_RFG_32Khz.pth?download=true

40Khz
G -
https://huggingface.co/SeoulStreamingStation/KLM5/resolve/main/G_KLM50_RFG_40Khz.pth?download=true
D -
https://huggingface.co/SeoulStreamingStation/KLM5/resolve/main/D_KLM50_RFG_40Khz.pth?download=true

44.1Khz
G-
D-

48Khz
G-
D-

KLM5 is updated with the same name each time training progresses, without a separate version notation.

polar pelican
#

is there a english speaker on the pretrain?

foggy heath
#

i'm cheering!

foggy heath
#

just started training a model with it and goddamn it trains slower than on hifigan from what i recall

#

like

#

the epoch's training in 1 minute per epoch

#

in refinegan

#

but in hifigan it trains around 35-40 seconds per epoch

worldly goblet
#

because i used to trained the model in hifigan with fp32, it took like a minute per epoch

foggy heath
#

i didn't change the fp settings

#

i'm pretty sure

#

either that or applio has choosen the fp32 training option by default

#

i can't really view which fp it was used to train a model

worldly goblet
#

yeah it could be the applio chooses fp32 by default

#

I'm using codename fork, but idk if the applio main branch has that option

foggy heath
gloomy root
foggy heath
#

is there a way to change the fp in the latest applio to fp16?

worldly goblet
gloomy root
foggy heath
#

idk man i still use colab to train models

polar pelican
gloomy root
#

anyway otherwise you might want to add these lines in Applio/rvc/train/train.py (perhaps for the colab/kaggle)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
foggy heath
#

6 gb vram

polar pelican
#

jesus

gloomy root
#

ah mb

polar pelican
worldly goblet
#

I can only do batch size 6 with fp32 on my rtx 4060, tbh I'm too worried to increase the batch size to 8

polar pelican
#

why?

gloomy root
worldly goblet
worldly goblet
gloomy root
foggy heath
hushed crest
#

If you plan to train high-pitched sounds like screams, be sure to use 40kHz.

midnight coral
#

Yoo let's go sss

jade sedge
jade sedge
gloomy root
jade sedge
#

if it is not, then it is not

#

just run python, import torch, then print the value

#

i'm going with "surely pytorch developers know better whether the operation benefits from tensor cores and use them when needed'

slender lance
#

This was trained using singing and speech?

wooden wagon
stiff python
#

can this be used in applio no ui?

chilly bronze
chilly bronze
gloomy root
hushed crest
# wooden wagon so this is better than hifigan with og pretrain?

I've said this multiple times, but evaluating a pre-trained model simply as good or bad is like asking whether a Ferrari or an excavator is better when they are priced the same. The reason we create new pre-trained models is not to make something "better" than the OG model but to cover areas that the OG model cannot handle. In return, this may introduce drawbacks that the OG model doesn't have.

Sometimes, I want to ask you back—if you like the OG model so much and are satisfied with it, why are you looking for something "better"? Also, where do your criteria for judging what is "good" or "bad" come from?

Some people consider a model "good" if it produces clean audio even with poor-quality datasets. Others need a model that fills in gaps where certain pronunciations are missing. Some need a tool that can infer high notes even when trained only on low-range datasets. Others want a voice model that, when used in a voice changer, ensures environmental noise and human speech are inferred as actual sound rather than artifacts. For each of these people, the definition of "good" is different. It would be great if we could solve all of these issues at once, but that's not easy.

Not only pre-trained models but even fork developers are constantly testing, developing, failing, going back, and trying again to create more diverse and higher-quality models. You have the right and the freedom to judge and discuss these things, but I hope you at least understand the purpose behind all of this. And when someone encounters these issues, I hope they can see these newly developed models as a possible solution rather than just a comparison point.

hushed crest
#

oh and one more thing..
sometimes, random people send me DMs, but please, stop asking overly obvious questions.
If you're expecting a fork where you put in shit and get gold, then you’re in the wrong place AI HUB isn’t where you should be; you should be looking for Alchemy HUB instead.

hushed crest
#

but rather than calling it a singing, it might be more accurate to describe it as a high-pitched voice.

versed anvil
#

cant wait for rappers , will post later good job

#

testing now will update later at 70 epochs on a 7 minute rapper data set 4 batch 2060 super

wooden wagon
hushed crest
# wooden wagon i guess i should've chose better wording I was trying to imply if refine at this...

don't get me wrong, that "gold" comment wasn't directed at you. it was about the people spamming my DMs. and it's not about "being a snob"; to be precise, it's just frustrating. I wasn't saying it to call you out, but if it upset you, I apologize.

you wanna simple answer? "yes RefineGAN is great!". however, like I said before, since I don’t know exactly what you’re looking for, the person who knows best whether refineGAN is necessary or not is you. you need to test it yourself and figure out.

Besides, you're a model maker. You know better than anyone what kind of model you want to create and what purpose that model will serve. Ultimately, you're the one who can make the best call on what’s needed.

jade sedge
#

you can always try my base pretrain

slender lance
gloomy root
rain beacon
#

hey can i ask the question? so my friend was training a boys voice with using pretrain KLM 5.0, but when he get result, he got a voice girl... can u help that?

hushed crest
#

or Is the voice of the target being inferred a female voice?

rain beacon
hushed crest
#

can I see your model?

rain beacon
#

is my friend model, I will add him in this forum

#

@fathom briar

hushed crest
#

korean?

rain beacon
#

He wants to show but it's his voice and he wants to give it on dm.

hushed crest
#

oh ok

rain beacon
#

oh sorry he wanna tell smthing in this forum but cannot send message 💀

gloomy root
hushed crest
chilly bronze
#

cus that pretrain does have a very high pitched female voice

polar pelican
#

someone update applio and put that to +34 trolley trolley

#

just for fun

gloomy root
odd nexus
#

I'm trying to use this pretrain and its saying this "The parameters of the pretrain model such as the sample rate or architecture do not match the selected model." I have it on RefineGAN setting Using Applio Exp

hushed crest
odd nexus
#

Alright

odd nexus
#

How do you select pretrains in this new one? or like the type of GAN

#

I get the same error with the new version

#

Also this is with the 32k ones

hushed crest
odd nexus
#

It still doesnt work

hushed crest
#

hmm, that's weird.

odd nexus
#

Your using V3.2.8?

hushed crest
#

did you download the one from the latest release tab?

odd nexus
#

Yeah

hushed crest
#

oh

odd nexus
#

Which one are you using

hushed crest
#

that's not new one.

odd nexus
#

So download the newest code

hushed crest
jade sedge
#

both 32k and 40k models from 3 days ago load fine with the latest main code

odd nexus
#

This pretrain is amazing

high swift
jade sedge
#

nope

foggy heath
#

since mrf and refine gans are both experimental and can only be used on applio's main branch, no it isn't.

high swift
jade sedge
#

noUI colab needs to pass vocoder parameters

jade sedge
#

I'm going to do a small update to Applio main for 44k models. If you made any with 44100 sampling rate, that's gonna break.
That's for both MRF and RefineGAN

high swift
odd nexus
#

Is this still in training?

hushed crest
odd nexus
#

Will another one be released soon

hushed crest
#

which one?

odd nexus
#

Another RefineGAN pretrain trained on more epochs

hushed crest
#

it will take like.. 5~6 days I guess.

odd nexus
#

Oh alright cant wait to see the final one

#

This one is already pretty good

hushed crest
odd nexus
#

The vocal range is amazing

jade sedge
#

it is pretty clear that OG pretrain was in a dire need of some singing included

odd nexus
#

Vocal range wasnt the GAN's fault it was just the dataset right

jade sedge
#

and of course training with a bunch of singing make it even better

#

well, some of this and some of that

#

hifigan had issues with building harmonics above 6k

hushed crest
#

I'm experimenting with training sonar sounds.

slender lance
jade sedge
#

RFG models seem to have no issues with reaching 15k+

#

and actually building something there

slender lance
#

realtime inference when

hushed crest
jade sedge
#

need to ask @empty palm

slender lance
#

@empty palm realtime refinegan inference when

slender lance
gloomy root
jade sedge
#

there are no other pretrains

empty palm
odd nexus
#

How is this pretrain smaller in size compared to HiFiGAN

hushed crest
odd nexus
#

So the model gets bigger with more data?

hushed crest
odd nexus
#

Yeah that gets bigger with a bigger dataset

#

This pretrain once fully trained is gonna be amazing

hushed crest
#

I hope so nails

odd nexus
#

I think its looking good so far right now you just need a long dataset for any good results

hushed crest
odd nexus
#

Cant wait until RefineGAN is standardized

hushed crest
#

yeah, me too..

odd nexus
#

Crazy how your making this for free devoting the compute and time 800 hours of high quality data is crazy whats it consist of singing and talking?

hushed crest
#

Oh, and while you're waiting, try testing the final version of KLM4. It should be pretty solid, even compared to the OG model. In particular, it should be able to infer high pitched singing and fast paced speech even without dedicated data.

odd nexus
#

So the KLM 4 HiFiGAN should be pretty good?

hushed crest
#

Yeah, it includes a massive amount of both singing and conversation

odd nexus
#

Oh wow nice

hushed crest
#

I think so. you can compare with OG.

#

it took 1year. so must be good..

odd nexus
#

I bet it would be much better and accurate

hushed crest
odd nexus
hushed crest
#

since klm 7s to ver 2 3 4..

odd nexus
#

Ohh

hushed crest
#

KLM4 HFG is lastest version for the RVC V2.

odd nexus
#

Definitely going to try

hushed crest
#

thanks for always participating in the testing

odd nexus
#

Np

#

Does the dataset have any English

#

For this pretrain

hushed crest
#

It includes english phonemes, but compared to Korean, the amount is much smaller not insignificant, but not enough to make a big difference.

odd nexus
#

So it definitely should be able to generalize to English

hushed crest
#

nails I hope so

odd nexus
#

It probably will I had a model with mostly singing and 2 minutes of normal talking and it can talk and sing pretty well

hushed crest
#

KLM 5 RefineGan (2nd Update - 15th Feb.)

odd nexus
#

Sounds really good

blissful dawn
#

Does the singing data have vocal fry?

versed anvil
odd nexus
#

Any updates on this model?

hushed crest
#

KLM 5 RefineGan (3rd Update - 24, Feb.)

hushed crest
odd nexus
#

Awesome cant wait to try it

odd nexus
#

Try the 24 pitch thing you did

hushed crest
odd nexus
#

The original definitely cant do that

hushed crest
#

My ears hurt.

odd nexus
#

Extreme vocal range which is good

hushed crest
#

and...it hurt your ears. haha

odd nexus
#

A little

#

Pretty much unlistenable but if you want to use it for that you can

odd nexus
#

Is there links to the new models?

#

Nvm found them

hushed crest
polar pelican
hushed crest
grave flame
#

Testing the latest one, hmm it start from 170-ish and look like going flat around 70-ish
is it normal?

I did tried KLM 4.4x2 MRF HifiGAN, graph look better start at 60-ish and flat around 30-ish
The setting is same for both, crepe 32, dataset 10min, Batch 4

hushed crest
grave flame
gloomy root
# grave flame

the dataset may lack of dynamic range due to compressor effect and/or improper normalization

grave flame
#

also the yellow one is KLM 4.4 x2, it sound fine

#

i will try again, maybe i did something wrong

hushed crest
#

hmm, since KLM itself contains a large-scale pitch dataset, there shouldn’t be a major issue with generating high-pitched sounds unless the index is set too high. The audio above has artifacts similar to those found in undertrained models, but I’m not sure if this is due to an embedding model issue.

gloomy root
slender lance
#

try another dataset and check if that works well

odd nexus
#

When finetuning this is it normal for the first like 100 epochs to be not really clear speech then it starts to clear up?

#

I trained this with 2 minutes 6 seconds of pretty clean audio of Dan TDM speaking at 8 batch size this is 300 epochs with this pretrain

#

This is HiFiGAN with the same settings at 60 epochs and it sounds more clear than RefineGAN at 60 epochs

grave flame
hushed crest
odd nexus
#

So do you need to train it on alot of epochs

hushed crest
odd nexus
#

I thought RefineGAN needed less training to achive what HiFiGAN can achieive atleast thats what I think I heard once

odd nexus
odd nexus
# hushed crest

Or staitic like sound whatever it is its just not clear like this

hushed crest
# odd nexus Or staitic like sound whatever it is its just not clear like this

That might be true. I'm not completely sure yet. More testing is needed. However, the understanding of training speed can differ depending on perspective.

On the spectrogram, RefineGAN does start generating details more quickly. However, since noise patterns are not completely removed from the audio, simply listening to the output makes HiFi-GAN sound much cleaner in comparison.

Even if the TensorBoard loss graph decreases rapidly, it does not perfectly represent the actual audio quality. So, if we compare based purely on listening, HiFi-GAN may feel like it trains faster.

odd nexus
#

So RefineGAN needs more training and will capture finer details and clear up?

hushed crest
#

The audio you heard earlier was recorded directly in a studio by professional voice actors. Since those models were trained on at least 1 hour of data, rather than just 3-4 minutes of audio, it's difficult to make a direct comparison.

odd nexus
#

Oh I thought it was just the pretrained model

hushed crest
#

umm, there is almost no difference in audio quality between a pretrained model and a fine tuned model. however, if training is done with only about 5 minutes of data, it may require more training iterations. this is something we need to continue testing to fully understand.

quoting Noobies, "If we reduce the number of generators responsible for producing noise patterns, the training speed would be much faster." however, personally, I prefer a model that produces more accurate results, even if it trains more slowly.

jade sedge
#

I'm gonna check what happens if you turn off the noise gens for finetuning

jade sedge
#

500e with default noise, and 500e without noise

#

ft - default

#

this is from my own pretrain, but it should not matter much

hushed crest
#

I see

jade sedge
#

see DMs

#

after all my other fixes it seems that adding noise is not really needed.. there are no lines with the current resblock, unlike hifigan's