#Searching for Good Audio Tokenizer

19 messages · Page 1 of 1 (latest)

compact jolt
#

Hello! I’m looking for an audio tokenizer. I’ve tried using Mel-spectrograms and K-means, but those methods didn’t work well. I need a tokenizer that operates at 48kHz (I don’t want any other sample rates) and can make around 50 tokens per second of audio. The token range should be either 0-2048 or 0-4096.

It needs to be trainable from scratch on macOS (please do not suggest pre-trained models) and should function as a standalone system. This means I want to train it as a single .pth file without relying on other models like HuBERT. Additionally, it should be versatile enough to handle various types of audio, including speech, music, and even my fart sounds. Thank you!

last inlet
#

Maybe check out WavTokenizer? It is decently close to what you need, 4096 vocab size, can have a 40 token/s, codebase has training from scratch (although this one, as with training any NN-based audio tokenizer will tend to take many GPUs multiple days)

It is 24khz by default but you should be able to modify it.

compact jolt
#

Yes, I saw it but:

  1. No, you can’t just “modify it”. You need to retrain
  2. Would be fun to train it from scratch, but no thank you, because if I can’t do it on 1xA100 in 2-3 hours then that’s bad
  3. It consists of multiple models so inference can be slow (or am I wrong?)
  4. Haven’t seen anyone fine-tuning that
dusty ridge
#

I think you can get like 3/5 or 4/5 max. MuCodec might work...

compact jolt
dusty ridge
#

I did my math wrong, what you're asking for might exist (I used natural log instead of log2 for bitrate calc)

compact jolt
dusty ridge
#

MuCodec is 25 tok/sec and 16384 vocab size in it's low bitrate config.
It also has a 100 tok/sec and 10,000 vocab size option but that's above what you wanted.

MuCodec is designed for music so it does a pretty good job at easier tasks like speech. It can of course be retrained with some python knowledge.

compact jolt
#

Oh my…5GB model will probably won’t work fast 😦

dusty ridge
#

bruh

#

You're asking for like the holy grail of codecs 😅

compact jolt
#

Probably

#

I’m planning on combining LLM+Audio Tokenizer for real-time 48kHz generation

compact jolt
#

Found this: