Since KLM7S
For the past year and a half—a considerable amount of time—I have been fortunate to receive immense support from the AI HUB administrators, Devs, mods, users, and many members of both the Korean and Japanese communities. Thanks to the help of various studio engineers and voice actors, I have thoroughly enjoyed immersing myself in this work.
This will be the final version of the KLM series utilizing HiFi-GAN, and moving forward, all resources will be dedicated to training with RefineGAN.
KLM-HFG closely follows the structure and training methods of the OG model but also includes high-pitched singing data. Users currently utilizing the OG pre-trained model should be able to achieve satisfying results with this version as well.
KLM-HFG employs the same training methodology as VTCK and also includes VTCK data. While most KLM series models have been trained with perfectly denoised data, KLM-HFG incorporates all noise into the model.
This pre-trained model is capable of generating vocals in nearly all languages and has been trained on the same dataset size as KLM 5.
Important Considerations Before Use
One crucial point to note is that all high-pitch data in KLM has been sourced from real human recordings. This means that if your trained model is based on a normal human voice, you will be able to infer high notes even without dedicated singing data.
However, if the character you are training has an ASMR-style voice, a rough vocal texture, or any physically unnatural sound, high-pitch inference will not be possible.
Whispering sounds are produced by adjusting air pressure using the tongue’s position and the distance between the soft palate and the mouth, rather than engaging the vocal cords. This means that whispers cannot be turned into high-pitched vocals.
Techniques such as growling in rock vocals rely on intermittently breaking up waveforms while producing sound. Since high-pitched sounds have shorter wavelengths, they cannot be generated through this vocalization method.
In short, no matter how diverse the vocal techniques and singing styles in the dataset are, physically impossible sounds cannot be learned. If the model you are attempting to create is based on SFX-generated voices, mechanical sounds, or whispers, you should be aware that high-pitch inference will not be possible.
32Khz
G-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/G_KLM_HFG_32k.pth?download=true
D-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/D_KLM_HFG_32k.pth?download=true
40Khz
G-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/G_KLM_HFG_40k.pth?download=true
D-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/D_KLM_HFG_40k.pth?download=true
48Khz
G-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/G_KLM_HFG_48k.pth?download=true
D-https://huggingface.co/SeoulStreamingStation/KLM49_HFG/resolve/main/D_KLM_HFG_48k.pth?download=true


