Searching for Good Audio Tokenizer | Unsloth AI | Page 1

compact jolt Mar 24, 2025, 11:11 PM

#

Hello! I’m looking for an audio tokenizer. I’ve tried using Mel-spectrograms and K-means, but those methods didn’t work well. I need a tokenizer that operates at 48kHz (I don’t want any other sample rates) and can make around 50 tokens per second of audio. The token range should be either 0-2048 or 0-4096.

It needs to be trainable from scratch on macOS (please do not suggest pre-trained models) and should function as a standalone system. This means I want to train it as a single .pth file without relying on other models like HuBERT. Additionally, it should be versatile enough to handle various types of audio, including speech, music, and even my fart sounds. Thank you!

last inlet Mar 25, 2025, 3:35 PM

#

Maybe check out WavTokenizer? It is decently close to what you need, 4096 vocab size, can have a 40 token/s, codebase has training from scratch (although this one, as with training any NN-based audio tokenizer will tend to take many GPUs multiple days)

It is 24khz by default but you should be able to modify it.

compact jolt Mar 26, 2025, 8:05 AM

#

Yes, I saw it but:

No, you can’t just “modify it”. You need to retrain
Would be fun to train it from scratch, but no thank you, because if I can’t do it on 1xA100 in 2-3 hours then that’s bad
It consists of multiple models so inference can be slow (or am I wrong?)
Haven’t seen anyone fine-tuning that

dusty ridge Mar 26, 2025, 8:27 AM

#

compact jolt Hello! I’m looking for an audio tokenizer. I’ve tried using Mel-spectrograms and...

I want to train it as a single .pth file without relying on other models like HuBERT
48kHz audio
50 tokens per second
0-4096
versatile and various types of audio
fast/small model size
What you're asking for doesn't exist

#

I think you can get like 3/5 or 4/5 max. MuCodec might work...

compact jolt Mar 26, 2025, 8:34 AM

#

dusty ridge > >I want to train it as a single .pth file without relying on other models like...

Why?

#

https://github.com/tencent-ailab/MuCodec

GitHub

GitHub - tencent-ailab/MuCodec

Contribute to tencent-ailab/MuCodec development by creating an account on GitHub.

dusty ridge Mar 26, 2025, 8:35 AM

#

compact jolt Why?

1 sec

#

I did my math wrong, what you're asking for might exist (I used natural log instead of log2 for bitrate calc)

compact jolt Mar 26, 2025, 8:36 AM

#

compact jolt https://github.com/tencent-ailab/MuCodec

This one is doing 384t/s, right?

dusty ridge Mar 26, 2025, 8:38 AM

#

compact jolt This one is doing 384t/s, right?

#

MuCodec is 25 tok/sec and 16384 vocab size in it's low bitrate config.
It also has a 100 tok/sec and 10,000 vocab size option but that's above what you wanted.

MuCodec is designed for music so it does a pretty good job at easier tasks like speech. It can of course be retrained with some python knowledge.

compact jolt Mar 26, 2025, 8:40 AM

#

Oh my…5GB model will probably won’t work fast 😦

dusty ridge Mar 26, 2025, 8:41 AM

#

bruh

#

You're asking for like the holy grail of codecs 😅

compact jolt Mar 26, 2025, 8:41 AM

#

Probably

#

I’m planning on combining LLM+Audio Tokenizer for real-time 48kHz generation

compact jolt Mar 27, 2025, 2:55 PM

#

Found this:

#

https://github.com/yistLin/universal-vocoder

GitHub

GitHub - yistLin/universal-vocoder: A PyTorch implementation of the...

A PyTorch implementation of the universal neural vocoder - yistLin/universal-vocoder

#Searching for Good Audio Tokenizer