Okay so, there have been a lot of pretrain being released recently but. What if we just merge all the dataset from different pretrain and train them, like Italian, Russian, VCTK, Korean, etc. I'm not talking about model merging, but rather dataset merging to a single model.
Because like the idea is to have diverse audio, different voices, difference frequency response, etc. Not just a pretrain that is only specialized for certain languages or such.
Cuz I noticed, some pretrains sounds so thin, lacks mic pops, some of it are raspy, weird static artifacts at 12Khz above, some have too much clarity, etc.
