There's been increasing interest lately in diffusion models for symbolic music modeling (i.e. MIDI instead of audio). Since symbolic music is naturally representable as a sequence of tokens, most approaches have taken the DDPM approach a la diffusion LMs. But there's also a subset of people who have been doing latent diffusion for music modeling, and the only pretrained VAE these people have to work with is the 8-year-old MusicVAE.
VAEs are nice for creative tasks because they allow for interpretability and tricks like interpolating between samples in latent space, and ofc you can train a diffusion model on the latents. MusicVAE does this well, but its main drawback is being monophonic: it can only represent one note at a time. This prevents it from working on the vast majority of musical data other than select melodies. I'm trying to train a new VAE that will handle polyphonic music, where multiple notes can play at the same time, for the above purposes (but mostly latent diffusion).
Right now, I have good reconstruction on sparse bars (melodies) and really bad reconstruction on dense bars (accompaniment). I think this is because the representation used (REMI) encodes dense bars to longer sequences, and apparently sequence VAEs are rather bad at modeling long sequences (apparently 256 is long...) without the decoder collapsing, according to https://arxiv.org/pdf/1511.06349 and the original MusicVAE (the premise of which was actually mostly an architectural innovation that helps address this, which doesn't really apply here). There doesn't seem to be much work on sequence VAEs as far as I can tell, and I haven't worked much with them previously, so I don't have very good priors on what I should be expecting from various hyperparameters. But I've tried a bunch of things already and none of them have really worked - code is here - so I would appreciate some advice or help.