Hello! I’m looking for an audio tokenizer. I’ve tried using Mel-spectrograms and K-means, but those methods didn’t work well. I need a tokenizer that operates at 48kHz (I don’t want any other sample rates) and can make around 50 tokens per second of audio. The token range should be either 0-2048 or 0-4096.
It needs to be trainable from scratch on macOS (please do not suggest pre-trained models) and should function as a standalone system. This means I want to train it as a single .pth file without relying on other models like HuBERT. Additionally, it should be versatile enough to handle various types of audio, including speech, music, and even my fart sounds. Thank you!