Voicebox and TTS | Learn AI Together | Page 1

quick quiver Sep 28, 2023, 1:18 AM

#

Hello! I’m very interested in collaborating on any of the Text-To-Speech works.
SoundStorm, Voicebox, EnCodec based vocoders, etc.

I’ve started with an implementation of the Voicebox paper and am currently working on the phonemes using Montreal Forced Aligner.
If this stuff sounds interesting, I’d be happy to lead a Voicebox paper review, but know that my overall ML experience is low, but I’m learning as I go!

Voicebox is a general infilling transformer based model.

It infills masked audio using reference audio and a textual representation.

During training Conditional Flow Matching is used to help as it acts as the conditioner objective. They use the “optimal transport” function for the CFM objective.

The idea behind the model is to act closer to the LLM models with the more general objective of infilling and they see quite competitive results, claiming it out competes VALL-E on word error rate and on the opinion scores.

Mel-spectrograms sampled at 100hz are the input to an Audio transformer model, along with phonemes whose duration is predicted by a side-along transformer model trained to predict the duration of phonemes.

#

I'm missing some of the intuition in why the architectures are the way they are, but I'm training myself, and then my computer 😉

If that sounds interesting to anyone, I should have a repo up on GitHub soon with the basic code for the MLS dataset and representing it as Mel-spectrograms at 100hz.

#

If anyone has experience with phonemizers, I'd love to talk!

random tangle Sep 28, 2023, 4:08 AM

#

I've beend own this road (still am) for a while

#

📎 message.txt

#

This resource dump is a set of all the TTS research that has been made available to the public

#

You'll note how things change once we moved to the monotonic alignment module from the glow-tts repo

#

Note that there is one thing about TTS research is that, like LLMs, the cutting edge stuff is done because the researchers are able to afford scaling.

random tangle Sep 28, 2023, 5:08 PM

#

Overall, the architecture for TTS looks like this:

The Mel Spectrogram generator

At a high level, this is a seq2seq model (text to mel)
Current models love transformers
Encoder is for text, decoder is for generating mel spectrogram from the text representations
Between the text encoder and mel decoder, you have alignment performed. This is where the monotonic alignment search (MAS) algorithm from glow-tts has taken front and center. There are models like FastPitch who use their own alignment algorithm or use other methods.
There is research being done on having the mel decoder be diffusion based for higher quality mel spectrograms but pure transformer encoder/decoder architecture is still popular

The vocoder

This transforms the mel spectrogram back to waveform format
This doesn't have to be too special as we have several implementations (WaveRNN, WaveNet, HifiGAN, Diffwave, Univnet)
HifiGAN is the most prominent but you can also go for diffusion based models for higher quality audio

quick quiver Sep 30, 2023, 4:55 PM

#

That's awesome. Thanks for the information dump.

Why do you think they decided to train and use HifiGAN in the Voicebox case if there are better vocoders that are known that produce higher quality audio?

random tangle Sep 30, 2023, 5:54 PM

#

Diffusion based anything has the following handicap: while more stable to train than a GAN, diffusers are slower to run at inference time. This is because they must iteratively diffuse the random noise into the expected waveform. This process is strictly iterative and im not aware of any improvements that have been able to improve this.

#

So while a GAN needs only one pass through the generator network to transform noise to output, a Diffusion model needs multiple passes (say between 20-50 depending on the settings).

quick quiver Sep 30, 2023, 11:49 PM

#

Interesting. One component in Voicebox that sounds like it could be useful in this scenario is Conditional Flow Matching, its runs in steps on a ODE solver. That runs on the content before it goes to the vocoder. This might be why they achieve such high MOS scores, and the audio sounds so good already?

random tangle Oct 1, 2023, 12:55 AM

#

Kinda, each type of model has its ups and downs. You'd have to look at the difference between flow based models (ie glow tts, flowtron, voicebox), vs autoregressive (vits, fastspeech, talknet), vs diffusion (grad tts, tortoise tts). There seem to be some problems that are universal and that is just a factor of the type of model used. Also note that each paper fluffs themselves up and MOS is a very subjective benchmark because it relies on querying people's opinions

#

Also, disregard results going to the vocoder. When looking at the text to mel component, the metric is your MSE between the original (ground truth) mel spectrogram and the generated one. That's what you need to be looking at for that part of the model.

#

(vocoder is also trained with MSE loss, so look at that first before you put it all together and query for a MOS)

frail copper Oct 29, 2023, 8:22 AM

#

random tangle Diffusion based anything has the following handicap: while more stable to train ...

This might be related to this, it’s about reducing the number of inference steps of LDMs, but unrelated to this thread as a whole.

https://arxiv.org/abs/2310.04378

arXiv.org

Latent Consistency Models: Synthesizing High-Resolution Images with...

Latent Diffusion models (LDMs) have achieved remarkable results in
synthesizing high-resolution images. However, the iterative sampling process is
computationally intensive and leads to slow generation. Inspired by Consistency
Models (song et al.), we propose Latent Consistency Models (LCMs), enabling
swift inference with minimal steps on any pr...

gloomy vale Nov 19, 2023, 5:07 PM

#

Are you still doing this?

quick quiver Nov 24, 2023, 1:24 PM

#

Yeah in a way, I'm running through the work on SPEAR-TTS, training the HuBERT based semantic tokenizer, and replacing the Voicebox phonemizer with that.

#

I'm implementing some of the work from lucidrains on GitHub.

#

Using the MLS dataset in webdataset form; https://github.com/itsjamie/mls_webdataset

#

Generating the phoneme "hidden units" to then cluster is going to take ~20 days, and I'm about halfway through. It'll be about 47TB of MFCC features. I could probably massively reduce that by randomly sampling only like 10% of each speaker rather than using the entire dataset. But; I started down this path 🙂

Not sure what the final clustered results size will be, but it should be much smaller.

#

That process is using fairseq; https://github.com/facebookresearch/fairseq/tree/main/examples/hubert/simple_kmeans

But modified so that instead of using tsv file format, it's loading from the webdataset.

#

I modified it all to go on webdataset because my storage is a set of HDDs', and I get much better performance with that.

#

but i'm doing this all at home on a 3080, so it takes a bit 😅

gloomy vale Nov 30, 2023, 4:36 AM

#

@quick quiver if you wanna collab, you can hit me up.

noble coral Dec 8, 2023, 5:27 PM

#

@quick quiver

#

Hi

rustic arch Dec 13, 2023, 8:59 AM

#

gloomy vale <@71387721003700224> if you wanna collab, you can hit me up.

u have premium version!

gloomy vale Dec 13, 2023, 9:30 AM

#

What?

rustic arch Dec 13, 2023, 9:46 AM

#

gloomy vale What?

Account colab premium!

gloomy vale Dec 13, 2023, 11:04 AM

#

Hahah, I mean collaboration

rustic arch Dec 14, 2023, 1:38 PM

#

gloomy vale Hahah, I mean collaboration

aha sry I not got it cuz I run model and problem like it 😅

violet charm Feb 5, 2024, 2:11 PM

#

Hi! I'm working on something similar. I want to collaborate too, if you guys have formed some kinda group lemme know.

quick quiver Aug 31, 2025, 2:42 AM

#

Hey! So I finally got around to actually working on this in a meaningful way.

If anyone would like to help, it'd be awesome to get a trained vocoder on Hifi-GAN, I plan to open-source the weights here as they're being trained on MLS-en.

#

Instead of Voicebox, I'm training on Speechflow for the more general purpose model that implements the same ideas.

#

[pretrain] step 42100 | loss 0.316667 | lr 4.75e-05 | dt 264.32s | gpu_mem 23.08 GB
[pretrain] step 42150 | loss 0.305038 | lr 4.75e-05 | dt 264.45s | gpu_mem 23.08 GB
[pretrain] step 42200 | loss 0.332576 | lr 4.75e-05 | dt 262.69s | gpu_mem 23.08 GB
[pretrain] step 42250 | loss 0.300547 | lr 4.75e-05 | dt 260.97s | gpu_mem 23.08 GB
[pretrain] step 42300 | loss 0.308708 | lr 4.75e-05 | dt 255.11s | gpu_mem 23.08 GB
[pretrain] step 42350 | loss 0.347048 | lr 4.75e-05 | dt 258.13s | gpu_mem 23.08 GB
[pretrain] step 42400 | loss 0.322298 | lr 4.75e-05 | dt 261.63s | gpu_mem 23.08 GB
[pretrain] step 42450 | loss 0.286046 | lr 4.75e-05 | dt 255.54s | gpu_mem 23.08 GB
[pretrain] step 42500 | loss 0.376855 | lr 4.75e-05 | dt 259.16s | gpu_mem 23.08 GB | grad_norm 0.161 (clip 0.2)```

I'm working my way through, and would love collaborators, @gloomy vale @noble coral @violet charm

#Voicebox and TTS