Hello! I’m very interested in collaborating on any of the Text-To-Speech works.
SoundStorm, Voicebox, EnCodec based vocoders, etc.
I’ve started with an implementation of the Voicebox paper and am currently working on the phonemes using Montreal Forced Aligner.
If this stuff sounds interesting, I’d be happy to lead a Voicebox paper review, but know that my overall ML experience is low, but I’m learning as I go!
Voicebox is a general infilling transformer based model.
It infills masked audio using reference audio and a textual representation.
During training Conditional Flow Matching is used to help as it acts as the conditioner objective. They use the “optimal transport” function for the CFM objective.
The idea behind the model is to act closer to the LLM models with the more general objective of infilling and they see quite competitive results, claiming it out competes VALL-E on word error rate and on the opinion scores.
Mel-spectrograms sampled at 100hz are the input to an Audio transformer model, along with phonemes whose duration is predicted by a side-along transformer model trained to predict the duration of phonemes.