#Polyphonic MusicVAE

41 messages · Page 1 of 1 (latest)

tall shard
#

There's been increasing interest lately in diffusion models for symbolic music modeling (i.e. MIDI instead of audio). Since symbolic music is naturally representable as a sequence of tokens, most approaches have taken the DDPM approach a la diffusion LMs. But there's also a subset of people who have been doing latent diffusion for music modeling, and the only pretrained VAE these people have to work with is the 8-year-old MusicVAE.

VAEs are nice for creative tasks because they allow for interpretability and tricks like interpolating between samples in latent space, and ofc you can train a diffusion model on the latents. MusicVAE does this well, but its main drawback is being monophonic: it can only represent one note at a time. This prevents it from working on the vast majority of musical data other than select melodies. I'm trying to train a new VAE that will handle polyphonic music, where multiple notes can play at the same time, for the above purposes (but mostly latent diffusion).

Right now, I have good reconstruction on sparse bars (melodies) and really bad reconstruction on dense bars (accompaniment). I think this is because the representation used (REMI) encodes dense bars to longer sequences, and apparently sequence VAEs are rather bad at modeling long sequences (apparently 256 is long...) without the decoder collapsing, according to https://arxiv.org/pdf/1511.06349 and the original MusicVAE (the premise of which was actually mostly an architectural innovation that helps address this, which doesn't really apply here). There doesn't seem to be much work on sequence VAEs as far as I can tell, and I haven't worked much with them previously, so I don't have very good priors on what I should be expecting from various hyperparameters. But I've tried a bunch of things already and none of them have really worked - code is here - so I would appreciate some advice or help.

#

Another part of it is that basically nobody has done VAEs on sequences in almost a decade so I haven't looked at any of the architectural improvements (eg from SD) that have happened since then. I expect those can help too.

lone bough
#

Could you brute force it by adding a lot more tokens? Say there are 32 notes possible, then there are 2^32 subsets that you might choose to play at any given point in time.

tall shard
#

I don't see what that would do. Ideally the latent space encodes musical information, not just note information.

#

Unless you mean expanding the vocabulary?

lone bough
#

I might mean expanding the vocabulary. I mean that e.g. instead of trying to have the model output two tokens simultaneously at time t, instead you have the model output one token that corresponds to two notes.

#

Like you give a different name to every possible chord or combination of notes.

#

Probably a lot of combinations would go unused, you can perhaps throw those away, especially the highly complicated ones with tons of noise.

tall shard
#

Sure. That just sounds like BPE with some induced structure. But I'm not sure how much I expect to get out of this apart from generally reduced sequence lengths, although that in itself may be helpful.

lone bough
#

I was just approaching this from the perspective of trying to reduce the problem of polyphonic MusicVAEs to something more like monophonic MusicVAEs since monophonic ones work.

#

Totally understandable if I'm missing something.

tall shard
#

I see, that makes more sense. Indeed the benefit of monophonicity is that you can model each timestep with a single token, because you are guaranteed to only ever have one note playing. But then for polyphonic music, either you have to have |V| = 2^N for N the number of acceptable pitches, or you have to accept that you can't process certain note combinations. I doubt that is worth the tradeoff, especially given the already short context length.

#

incidentally N = 90-128 which is super infeasible

lone bough
#

I mostly agree, but if we throw away unused combinations, maybe the set of combinations that actually need a token is sparse.

#

Just throwing the idea out there. It's definitely unsatisfying and has issues.

tall shard
#

So I was able to get nearly 100% F1 just by opening up the latent bottleneck a bit to 512d (from 64/128) and having a much more lenient KL weight warmup schedule. I suppose it was trying and failing to squeeze the data through too few dimensions. This is curious to me, as this implies that the space of reasonable bars of music is significantly larger than I expected it to be.

#

However 512d seems rather large for individual bars when doing latent diffusion, since one training sample will be T * M latents if you take a rectangular section of the piano-roll representation of a song (where T is the number of tracks and M the number of measures per track). This blows up very quickly. I'm also using beta=0.2 to optimize reconstruction quality, and I'm not sure to what extent greater regularization of the latent space will aid the diffusion model and/or hurt reconstruction meaningfully.

sullen trench
#

For BPM-aligned MIDI data, it may be useful to introduce a polyphony rate as an additional feature for each MIDI event, alongside pitch, velocity, start, and duration. This accompaniment parameter can provide a representation of the local chordal structure.

tall shard
#

Yes, I was going to try this after doing some more architecture ablations. This gets around one of my problems with the REMI representation, which is that you don't know the degree of the polyphony until you've parsed every event at that timestep.

tall shard
#

So far I've been running hyperparameter sweeps for 1 epoch each, where each epoch is one sweep over literally every bar from every piece of the Lakh MIDI dataset, after filtering and deduplication. The best models are now able to get almost perfect reconstruction on test pieces from POP909, and I'm trying to scale these configurations up to 2 or more epochs. However these are still being trained with latent_dim 512 and beta=0.2, and I would like to eventually decrease the former and increase the latter for reasons stated above. It seems inevitable that this will hurt reconstruction quality, so I am attempting to investigate strategies to minimize the hit.

sullen trench
#

Are you training on multi or single instruments per sequence?

tall shard
#

Single. Each train sample is 1 bar of 1 track, so that a T x M grid of T tracks and M measures encodes to a T x M grid of latents that can be used for diffusion.

sullen trench
#

Have you considered using the rests option in REMI? Including rests can provide fixed timing information for both sparse and dense training examples, even though they only represent silence. Since rests are more predictable, this may help reduce KL divergence.

lunar rune
#

Hi, vae for image generation takes in fixed size input, but music is variable size, how are you addressing this?

tall shard
#

just do one bar at a time and concatenate your latents into a [T, M] grid if you want to do anything with it

tall shard
lunar rune
tall shard
#

512 right now, having a hard time compressing it any further without losing reconstruction quality

#

this is really not ideal but w/e

lunar rune
#

I think one bar is perhaps too long for vae? Maybe it contains complex high level semantic structures which vaes might not be able to capture properly.

tall shard
#

what

lunar rune
#

I was thinking that for polyphonic music, perhaps if you work at the bar level then the patterns in a bar is complex enough such that all bars in your dataset are unique so the space is quite sparse. But for vae in image generation, each latent just corresponds to a patch and patches are small enough such that there are a lot of highly similar patches in the dataset, so the distribution is clustered and easier to learn.

tall shard
#

oh, well I deduplicate the dataset so the space is artificially sparse here, but it's also very sparse generally because of how many different combinations you can have. so maybe a patchlike approach may be useful. but i feel like a bar should be the minimum generally reasonable semantic separator - even then you lose long range dependencies - since you can have a lot of different sub-bar patterns, most of which will be broken with any choice of separator

lunar rune
#

Also if you use bars as the separator then it would be difficult to leverage datasets like aria-midi which isnt bar aligned.

tall shard
#

i think this is a reasonable tradeoff because aria is also kind of meant for a different purpose afaik

#

between lmd gigamidi and that other new one i forget theres plenty of bar aligned data to be had

tall shard
#

nontrivial to quantize well though. POP909 has the same problem

sullen trench
#

Would a MIDI-to-image representation be of interest to you? I made an encoder/decoder script a while back but never did anything with it. It packs 8 bars into a vertically stacked 2-bar format. The format is 768×768 with a 96-tick resolution, but you could scale it down.

tall shard
#

Yeah that could be interesting