Hello,
I was wondering if anyone knows of some notable projects for extracting music latents. I am specifically looking for something like SentenceTransformers but for audio. I have ran across quite a few that do general audio classification, but none that can accurately carry over the majority of long-form and complex musical content.
I have tried training a custom model from acids-rave (Realtime Audio Variational Encoder), which generalizes pretty but it targets simple notes/sounds over short periods of time and (only) reduces the raw-> latent vectors at around 10x compression to get a decent decoding.
I am looking to use the latents for a variety of projects, including audio recommendation (custom spotify based on the actual content of songs), Rhythm game event generation that takes into account later parts of the song, and audio manipulation
As a note, I am aiming for a song capacity of something like 5minutes at 44,100 (14mil points)