Creating voice models is new to me, so I want to learn more about mode collapses.
How does a mode collapse affect the end result of a model? How many mode collapses should one expect? Some say a few are fine; others mention discarding their models after seeing one. How does batch size play a part in this, and how can one alter one's dataset to prevent this?
#Reducing Mode Collapse
1 messages · Page 1 of 1 (latest)
Seems like your model flatlined way too soon, so I believe it is more of a batch size issue rather than a dataset issue.
To check if it's a dataset issue, try training the model on batch 4. If this doesn't solve it, then it is most likely a dataset issue.
If it is a dataset issue, it could be either low quality (though I'm not all too sure about this, only heard one source about it) or badly trimmed silences.
Thanks for the help. Is it always better to train on low batch sizes? What does a badly trimmed silence look like?
-
Not necessarily. Every model has its "golden" batch size, which is the one that has the biggest first drop in loss/g/total.
-
Something over 50 ms can already cause issues afaik. You should trim silences to 0,04 or 0,05s on Audacity
That could be one of my problems. In my datasets, I always use truncate silence and set silence to zero.
Is that a bad decision?
Also, is it better to leave coughs and such in datasets? I usually leave them out.
so it doesnt read the whole dataset and gets stuck on the silent parts to avoid it truncate data and noise gate it , it negatively impacts model ability to accurately represent the full distribution of dataset