#Neuro-Symbolic Music Models

1 messages Ā· Page 2 of 1

tiny coral
#

as per Oore et al 2018, iirc

hearty flicker
#

Ah yeah that is my approach to

ruby frost
#

is there a centralized doc where y'all are keeping track of progress somewhere? scattered messages can be hard to follow sometimes šŸ˜…

hearty flicker
#

It would probably a good idea for me to add the progress write-ups to the discussion tab on there.

#

I've nearly finished building tokenizer/training infrastructure. I'm trying to build it up properly, it's important to me that people can easily fine-tune on their own datasets if they wish.

#

As for the direction of the paper, I'm interested in doing a compressive review outlining how different factors impact the quality of musical transformer models.

#

For example: composition of the pre-training data, model size, format of the tokenizer.

#

A side effect would be creating a SOTA open source pre-trained general purpose musical transformer model (MuseNet+ but open and apache 2.0 licensed).

#

I would also like to include some research on ways of aligning the model to produce piano music (I have a very good fine-tuning dataset here)

#

I'll update you @ruby frost when things are moving along properly! Like I said, I'm currently doing this in my spare time. The majority of the code should be done by mid next week,

timber talon
#

this is a great overview @hearty flicker šŸ™‚
i'd just add to that and say that I've been pretty wrapped up in the EMNLP deadline, but once it's passed, I'll be focusing on evaluation specifically whether there are ways of analyzing the structure of the output, as in: https://arxiv.org/pdf/2210.08444.pdf

hearty flicker
#

Btw if anyone is interested in genre fine-tuning, I have about 600 hours of high quality midi paired with genre tags

#

Unfortunately they are all demos, only 1 minute long

sand nymph
#
raven kettle
#

The names for all these absolutely suck

#

MusicGen, MusicLM, AudioLM...

#

But the work itself is fantastic

#

And this from Facebook is MIT, code, models...

sand nymph
#

NC license for the models I think

raven kettle
#

Ah yeah MIT if only for the code

hearty flicker
#

Cool stuff

timber talon
#

has anyone done audio diffusion models?

raven kettle
#

In this channel, or in general

#

Because there's a lot of audio diffusion papers

timber talon
#

in general... wondering why the current trend is for autoregressive transformers

#

when I feel like musical phrases lend themselves well to diffusion setups

raven kettle
#

Current sota compression methods seem to lend themselves to discrete tokens

hearty flicker
#

Tokenizer is working on my end now

#

Just need to fiddle with the API a bit to make it easier to use. Should be very easy for people to build their own datasets (with custom hooks for pre-processing, filtering, ect.) & tokenizers.

#

Will push everything and add a HOWTO when the API is finalized

hearty flicker
timber talon
#

nice šŸ™‚

hearty flicker
#

Interestingly people have even trained image diffusion models on spectrograms to create audio

#

According to my supervisor the results are surprisingly good.

#

One of the things I'm aiming to start working on soon are non-auto-regressive symbolic (i.e. tokenized) music models

#

Could make a better foundation for compositional tools

timber talon
#

That makes sense. I’m curious how they handle long sequences

#

Is the length of the output constrained by the length of the diffusion window?

hearty flicker
#

The logic was a little tedious, it could probably do with a refactor

#

The bright side is that it supports multi-track, pedal, and drums. It is also customisable with a config.json file (for quantization).

#

I will add a 'howto' to the README after the training logic is done.

raven kettle
#

This might have been mentioned before - it's relatively old anyway

hearty flicker
#

Btw guys here is an example of what Aberesque (Debussy) looks with the current version of the lazy tokenizer with quantization. It has been truncated to 1000 tokens.

#

And this is what the midi sounds like after it has been converted back

#

For now I'm going to implement the model in PyTorch/Lightning for initial experiments. At some point it would be good to rewrite it using DeepSpeed.

quasi wren
#

Could you add my 60k midi dataset to the datasets

hearty flicker
#

Beethovens entire sonata number 8 is only 25k tokens lol

hearty flicker
#

On sunday I'm going to post a writeup on the Github repo, discussing where things currently stand. The implementation is basically completed at this point, time to start experimenting.

#

Very similar to the non-auto-regressive stuff I was working on a few months ago.

timber talon
#

sounds great man

#

wow, the samples in this Stanford demo are really impressive

#

what's the difference between "anticipation" and "masked infilling", though? is this not just "BERT for long sequences of masks"?

fallow birch
#

not bad (melody is new after 5 seconds, accompaniment is the original song)

somewhat in need of an actual interface

hearty flicker
#

Hey guys, I've been really busy. It's probably gonna be a while before I've got any spare time. I'll update everyone, perhaps we can start things up again in a few weeks : )

quasi wren
#

I would love to get this started back up

timber talon
#

no problem man!! I'm a lot freer, too, after EMNLP. maybe we can meet again to re-kick stuff off and talk about what a paper would look like? especially in light of some of these recent publications that we've chatted about in the past few weeks

hearty flicker
#

Hey guys. Things are still slowly plugging along! I'm currently writing tests in my spare time. I will make a post/roadmap on the GitHub page (I'll also link that here) when the repo is fully functional.

hearty flicker
#

Hey guys, as some of you know, I've been really busy recently (at my internship/job). Having said that I'm happy to announce that the code is pretty much done, and the project will be moving into the next phase now! I ran an initial test last night, training on ~200 bach midi files ($1 of compute cost). Here are some of the samples. I'm just happy that everything (code-wise) is working correctly.

In total I have gathered a significant amount of data (200k+ files). I believe that most of these are high quality, however to be safe I have also included support for data cleaning in the repo.

Later this week I'm going to post the roadmap I have in mind for the future of this project. One area that I haven't touched is evals, perhaps @timber talon has some ideas on this topic.

@here

sand nymph
#

@trim solstice

warm tangle
#

This is fantastic! How can I contribute to the project from here? It’s been quite a busy summer for me too!

timber talon
#

Yah I have some ideas on evaluation. Does anyone in this channel have a background in graphical models?

#

Would love to chat with you if so

hearty flicker
#

As a sidenote, would you mind changing the repo name to 'aria' if you have time @sand nymph? I don't have the permission to do it myself.

sand nymph
ruby frost
timber talon
#

cool! I'm playing around with ideas on how to model musical structure. The idea was to do latent criticism on the structure of the piece, a la: https://arxiv.org/pdf/2210.08444.pdf and then evaluate musical generations by whether they have structures that look like human-generated structures. I've made graphical models in the past, but they do require planning/etc. while designing them, since they're non-trivial to implement once the design is established

#

I was hoping to be able to toss around some ideas if you wanna chat abotu it some more

hearty flicker
#

Currently running a test using the GiantMidi dataset - https://github.com/bytedance/GiantMIDI-Piano. The tokenized data for this one is over 2gb so it should be a lot better. Will post a follow up later if all goes well. Still only training a ~60m model with 512 max_seq for testing purposes.

tiny coral
ruby frost
timber talon
#

ok cool. maybe i'll message you privately? unless others want to brainstorm as well

raven kettle
hearty flicker
#

I think this is a really good sign that things are working correctly

#

Roadmap/HOWTO should be posted today If I can finish it before the evening

#

@here

hearty flicker
#

There is still a small bug (some edge case to do with the sustain pedal) somewhere in the decoding, however this one is still quite good

ruby frost
#

alberti bass is all you need haha, but seriously sounds awesome!

hearty flicker
#

I'm glad how the experiment turned out, this is just a test using a subset of the data. I think the final product will be pretty cool.

timber talon
#

hey man, these are really cool!!

#

great work, super stoked man, and super impressed you're able to do this while also juggling what sounds like a super intense internship!

hearty flicker
#

If anyone wants to play around, dm me and I can give dl link to the model checkpoint. It's very easy to generate some samples from the cli

sand nymph
hearty flicker
#

Well this is just a test/experiment, far from the final product

#

I'm writing the readme/howto/roadmap atm

#

I will put it on HF eventually

timber talon
#

is there a way we can help you write a model-card? I know the training details, architecture, etc. have been discussed in this thread but i kinda lost them. Or is that already intuitive in the repo?

hearty flicker
#

The thing is, the architecture is probably gonna change quite soon.

timber talon
#

i see!

hearty flicker
#

Like I said, this was just an experiment I ran to make sure it was not just producing noise.

#

The experiment just worked out well haha

timber talon
#

ah great šŸ™‚

hearty flicker
#

I am writing the roadmap now, that should explain my ideas for how to continue

sand nymph
#

@subtle lance wears a lot of hats, but one of them is helping people with dataset and model documentation FYI šŸ™‚

timber talon
#

that sounds really great, really appreciate the organizational work you're doing here! maybe we can have another sync meeting?

hearty flicker
#

There are various things (such as data augmentation) which are half implemented

#

Good idea, I'll try to set that up after I've finished up the roadmap

#

The models also scale really well (in a musicality sense) with transformer context length. The experiment I ran was very limited in that respect.

#

With 2048 cl it should be able to product 2-3min pieces

rustic dirge
hearty flicker
#

@everyone

Hey guys, just a small update here. I've added a roadmap (https://github.com/EleutherAI/aria/blob/main/ROADMAP.md) and how-to (https://github.com/EleutherAI/aria/blob/main/HOWTO.md) to the repository. This roadmap is essentially a list of things I will be working on over the next month. I don't expect anyone else to get involved, but it's nice to have document to refer people to now! The howto is still a work in progress, as it stands it should be enough to fully grok the repo if you have typical Python/PyTorch (and some basic MIDI) knowledge.

I'm going to try blitz through as may of the issues as I can over next week, as I will have more free time. If you have any questions, feel free to ask me : )

sharp quiver
#
GitHub

Code of the paper "Byte Pair Encoding for Symbolic Music" - GitHub - ugtqphgirx/bpe-symbolic-music: Code of the paper "Byte Pair Encoding for Symbolic Music"

hearty flicker
#

I read this paper a few months ago. I think BPE would be nice to add way down the line (as it is essentially automatic/free) however it's not a priority of mine at all.

#

My goal for Aria is to make the best possible pre-trained symbolic music model with the currently available data and known LLM improvements. I'm currently meticulously min-maxing aspect that I can think of. I suppose BPE fits that criteria but I don't think it will make enough of a difference to warrant including at this stage. In my experiance the biggest factors are data quality and the tokenizer itself (inc the kinds of data augmentation that it enables).

#

I'm currently working on the functionality for collecting metadata and adding relevant bits as prefix tokens. MuseNet does this, it's actually been surprisingly straightforward to implement.

hearty flicker
#

Hey guys. I'm aiming to do a full training run (full context length, full dataset) during the first week of September if the cluster is available. Will be exciting to see how it turns out. After that, my rough idea is to start work on a paper about scaling pre-trained symbolic music models. This should align with when I'm back at school which will be nice!

timber talon
#

What can we do to help?

#

@ruby frost and i met to discuss 2 novel evaluation approaches, but they will be papers in-and-of-themselves and thus might not be ready for an arxiv release of this paper if you're already close to being done with experiments

hearty flicker
#

I've got some thoughts on the future that I need to go over with Stella first. Will update soon.

#

Should be fairly easy to get you guys involved with what I have planned. Pretty exciting stuff imo!

timber talon
#

ok sounds good man! just let me know. Starting to become a lot more flexible as the semester starts so whatever you need

hearty flicker
#

Just a quick update guys : )

#

I'm currently working on squeezing all the perf I can out of the model. I'm also rewriting the pretraining loop using hf-accelerate to implement proper checkpointing/logging/experiment tracking.

#

As for my ideas for the future direction, I really want to get a sense of how well nlp-style alignment (pretrain -> finetune) techniques can be used with musical transformer models.

#

I have two basic research directions in mind:

#
  1. How to scale symbolic transformer models. Considerations such as architectural details, allocation of the parameter budget, dataset weightings, data augmentation, and tokenizer differences, etc. There aren't any papers that delve deeply into these aspects that I have seen.
#
  1. How to align pretrained models. A good example for the sort of question I'm interested in: If I finetune Aria (pretrained on ~350k midi files) on this (https://bushgrafts.com/midi/) amazing (but small) jazz dataset, how much better is the resulting model than one that wasn't pretrained? How about other alignment techniques?
#

From my experiance of fine-tuning music models, (2) makes a massive difference. In some (unpublished) work I was doing a few months ago, I observed that pretraining helped a massive amount when training a symbolic music models on J.S. Bach's Fugues (from the WTC Book 1&2).

#

However I haven't seen any papers on it.

timber talon
#

Yah I’m really into this. We can also do a kinda ā€œlinguistic blood bankā€ kind of analysis for which styles help the most.. lemme find the paper

#

Ex. Pretraining in classical might help when fine tuning on jazz, but it might not help for fine tuning on pop, for instance

hearty flicker
#

I think its a cool research idea because it links very well with music stuff, nlp stuff, and transformer stuff!

timber talon
#

That linguistic blood bank paper won a naacl best paper award!

hearty flicker
#

Btw I reached out to the website Classical Archives and they agreed to donate their library of over 15k classical MIDI files to our pretraining dataset.

#

However this data is sensitive so it won't be availible for uses other than pretraining.

hearty flicker
#

I have a feeling that although it's pretty far away from classical/jazz, it will still increase the quality of the aligned models.

#

I really want to do a proper training run soon, it would be nice to have a proper pretrained model to start experimenting on.

carmine musk
#

Hi @hearty flicker I would like to contribute in the project wouod you please tell me more about it.

sand nymph
#

@hearty flicker So the next step, now that I've gotten you access to the HPC cluster, is to test the efficiency of the code at the 16 GPU (2 node) scale right?

hearty flicker
#

One node will be fine. I don't want to complicate things more than I need to right now!

#

I'm still working on making sure the training is as efficient as possible.

sand nymph
#

Ah I forgot if you had done the 1 GPU -> 1 node jump yet

hearty flicker
#

I've done 1x8 before for my training

#

I'm using accelerate for training so configuring for multiple nodes shouldn't be hard, however I haven't looked into it yet

sand nymph
#

The goal should be 120-140 TFLOP/s/A100, though if you're using Flash Attention you can add 30-ish to those numbers. If you're getting below 120 and your code doesn't totally suck you're probably doing something wrong in the configuration

hearty flicker
#

The training code will be pushed to the main repo soon. I'm pretty sure I've got everything configured efficiently however I will need to do some flop tests as you say. I am using flash attention btw.

sand nymph
#

Awesome

#

So the next update we'll be looking for is efficiency confirmation and an estimate for the amount of compute required to train the models you'd like to train

timber talon
#

very exciting!

hearty flicker
#

Hey, @everyone! I wanted to provide a small update: I'm now officially back at university, so I'll be working on this pretty much full time. Consequently, this channel should become a lot more active! Here's a brief overview of what I'll be working on in the short-term and long-term.

The v1 version of Aria is essentially ready to be trained. However, there are a few things that need to be looked at. As @sand nymph mentioned, I need to ensure that I'm utilizing the compute resources efficiently. I will conduct some tests this week. Additionally, I need to explore modern ways of extending context windows, as this is important for music generation. I'm not very familiar with this area, but I've heard that this paper describes the current best way of doing it:

https://arxiv.org/abs/2108.12409

If anyone has any thoughts on this area, please let me know. Once Aria has finished pre-training, I have several areas that I'll be looking into in the short term:

  1. Fine-tuning Aria on some small high-quality datasets (think jazz/classical) to create high-quality (possibly SOTA) generative models for symbolic music.

  2. @timber talon, among others, is interested in evaluations for AI-generated symbolic music. I'll be looking into this by fine-tuning Aria on some well-known AI-music datasets that have additional meta information, which I will interpret as additional meta-tokens.

  3. My supervisor is currently interested in and working on a paper on 'bum' (i.e., incorrect) note detection in symbolic music. I have an idea for a different approach that would utilize Aria. The general idea would be to run Aria over real music and flag notes when Aria assigns them very little probability. Due to the tokenization scheme I'm using, this should be easy to implement.

#

In the longer term, I'm really interested in scaling up Aria once again. Recently, audio to MIDI conversion (using deep learning) has become very good. For an example of this, you can see the GiantMIDI dataset, which itself will likely be largely responsible for the quality of Aria's generative output. I want to continue in this direction, building an even larger (and more diverse) dataset of MIDI that we can add to Aria's pre-training dataset in a future version. From the experiments I've done, I've observed that generative output just keeps getting better and more convincing the larger and higher quality the dataset. Since it is theoretically possible to systematically download (using tools like spotifydl) and process a large proportion of recorded piano works, I think this is a very promising direction for the long-term continuation of this project.

If anyone has any questions or ideas, feel free to message me!

sand nymph
# hearty flicker Hey, @everyone! I wanted to provide a small update: I'm now officially back at u...

We actually collaborated on a method for context length extension that has some notable advantages over ALiBi cc: @deft hedge @robust juniper @quasi steppe https://arxiv.org/abs/2309.00071

hearty flicker
#

Do you have paper or preprint?

sand nymph
#

Added to my reply; I was off looking for it

hearty flicker
#

Thank you Stella, I will give this a read after ALiBi

quasi steppe
#

Let us know if you have questions.
I have also been curious, do you guys have the full freedom to choose transformer architecture and directly train on some huge music dataset, or have to finetune on some large models available so far?

timber talon
#

Hey loubb, welcome back to uni — and your progress here sounds really exciting!

As for context window extension, here’s another technique worth looking at as it also enforces some locality bias: https://arxiv.org/abs/2209.10655

On point #3 that sounds like a fascinating direction. These kinds of analyses can be a bit difficult because of the noisiness inherent in these kinds of measures on the word/note level. See Figure 4 and 16 in this paper that many of us were involved in: https://arxiv.org/pdf/2306.17806.pdf for a sense of what word-by-word analyses can look like. As you can see, they’re very noisy and also not uniform throughout (way more variation in the beginning). The context of these experiments is a little different from what your advisor is going for, because we were comparing two model variations with these metrics, but perplexity curves tend to look similar. I think maybe because some words are ā€œcorrectā€ but still surprising… taking the sentence in a different direction or introducing a prepositional phrase or something. Not to say it’s impossible, but some kind of way of separating what a true ā€œwrong noteā€ vs. just an unlikely ā€œcorrectā€ note (maybe self-supervised perturbed data?) could be useful here.

There’s another approach for #3 that I think would be really, really fascinating to compare to, which detects jargon terms:

https://blog.allenai.org/words-as-gatekeepers-measuring-discipline-specific-terms-and-meanings-in-scholarly-publications-718dc56d08a5?gi=024bad4293a6

This approach wouldn’t be for detecting ā€œwrongā€ notes, but maybe a robust way to identify correct notes that are unlikely but specific to a genre — say, blues notes in jazz.

Curious to chat more about your other directions and take on some tasks if you need help with them!

#

I know we had a whole discussion about tokenization schemes… this might be a dumb question but can you humor me.. the compound word tokenization approach to me seems like it comes closest to actual sheet music representation: https://arxiv.org/pdf/2101.02402v1.pdf

Is there a reason why people don’t typically use this one?

hearty flicker
hearty flicker
#

My rough idea was to add the embeddings in the input (embedding) layer for the different quantities that need to be accounted for (instrument, pitch, velocity, duration)

#

Which would be fairly straightforward to implement

#

However I was unsure about how I could do the decoding (LM head) to get a probability distribution to compute the loss against. I foresaw some complications due to fact that I still would still need 'wait' tokens aswell of these compounded note tokens. The obvious solution would to use different LM-heads for the different quantities, and compute the total loss as the sum of the CEL derived from each LM-head, however it wasn't obvious to me how exactly to implement this way that would work well with the wait tokens.

#

The approach I took with Aria was to instead just compound as much as I can without explicitly adding representations together. Instrument, pitch and velocity are compounded into a single token. When quantizing the possible values for the velocity into, say, 10 different volume levels, the resulting vocab size is about 15,000. This feels resonable to me.

#

I did have to make the sacrifice of separating out duration tokens though. Excluding meta and special tokens, Aria therefore has three types of tokens: combined (instrument, pitch, velocity), durations tokens (in ms), wait tokens (in ms).

#

A typical sequence might looks something like: '(piano, pitch=62, velocity=70)', '(duration, 100ms)', '(wait, 100ms)'

#

The advantage of this approach is that I can do a very straightforward decoding stage (linear layer: d_model -> vocab_size, followed by a softmax, followed by CEL loss). Having said that, it does require more tokens than it 'should' in theory.

#

I'm really interested how they solve the problem that I mentioned, I'll give this a paper a read!

hearty flicker
#

By splitting the predictions apart we could imagine the separate lm-head suggesting something like:

duration lm-head: 50% chance that duration is 50ms, 50% chance it is 500ms
pitch lm-head: 50% chance that pitch is a b, 50% chance that pitch is a low c

However we surely don't want the model to suggest the compound token (pitch=b, duration=500ms) in this context, as that would sound bad over a c-major scale.

#

I'm not sure how much of an issue this would be in practice, it is just the line of thinking that made me go with a different approach. I'm looking forward to reading this paper to see how they deal with this issue.

timber talon
#

Ahh man it really makes one appreciate how linear text is! I think i pretty much follow your approach. Just to clarify, the "wait" token is the equivalent to a rest in sheet music, correct?

#

Ultimately, I think this is an empirical question and probably, with enough data, it doesn't matter. And you've experimented with this for a while, so you have more intuition here, slash the tokenization scheme has been set for a while so i'm sorry to keep questioning it

My only intuition is that the model's internal representation should be close to sheet music. There's some interesting neuropsychology research that indicates that rhythm and pitch comprehension are actually different cognitive processes (https://pubmed.ncbi.nlm.nih.gov/14681127/), so maybe the LM head approach isn't a bad one to represent that

That being said, I definitely think you're right that rhythm and pitch are correlated. However, even in the other compound word approach, it's pretty trivial to add cross attention between the multiple LM-heads to model that correlation, isn't it?

hearty flicker
hearty flicker
hearty flicker
#

Just a quick update - the training loop is finished. I still have to test that it will work correctly in a distributed setup and verify that it's utilizing hardware properly (e.g. measure the flops). I'm going to add some multiprocessing to the data processing and hopefully run a training test overnight on my home 1x4090 server.

hearty flicker
#

The dataset preprocessing utilities I've written have been incredibly enlightening. It turns out in most music datasets (commonly used MIDI ones) there are a ton of subtlety duplicate files with different names / meta information.

#

The unimportant differences lets them pass a file-level hash duplication check, when they should be a duplicate in the eyes of a tokenizer. This definitely has the potential to cause duplication leakage into validations sets.

tiny coral
#

I had thought about the possibility of creating an embedding model to try to detect duplicates at a more semantic level. How many MIDI files are recreations of the same original piece of music, I wonder?

hearty flicker
#

That's a cool idea

hearty flicker
#

Currently running an experiment on the home gpu server in my flat - will post some samples later

hearty flicker
#

Everything is essentially ready now for a full scale training run. The data library is all properly multithreaded now too, it no longer takes an few hours to build the full dataset lol

hearty flicker
#

With the data augmentation that I've added, the experiment I'm training is still decreasing in val and train loss after 75 epochs

#

This is only training on about 5% of the total dataset, and the model has only ~30m model size

#

Here are some samples prompted with Bach's Cmajor prelude

#

All of these are non-cherry picked btw

quasi steppe
hearty flicker
#

This particular prompt isn't (this exact MIDI file), however I'm sure a different version of the piece is in the training data

#

The prompts for each of these is identical, the only difference is the randomness inherent to the sampling process

#

If there are any pieces you would be interested in prompting with, let me know and I'll give it a spin

#

I should reiterate that experiment is not scaled properly (model seems to be unable to overfit the train set). I'm currently preparing to run a full scale training run

quasi steppe
#

but really fascinating results!

quasi steppe
hearty flicker
#

The full-scale version will be somewhere in the range 200m-500m

#

This is a continuation of my favourite Fugue by Bach

sharp quiver
hearty flicker
#

I think this one would really require a model with longer context to give it a chance haha

#

The full-scale version will have 4x-8x the context length

#

I will give it a go anyway though

#

Ok this is a non cherry picked example and it is wild lmao

#

Doesn't get to generate for very long since it runs out of context

sharp quiver
#

ā¤ļø

quasi steppe
hearty flicker
#

Although I'm pretty sure I'm going to implement ALiBi (or YaRN) before doing the actual train run

quasi steppe
hearty flicker
#

Right now I've got RoPe implemented in the exact same way as it is for neox

#

Idk if this is ideal to be honest, the position embeddings for the dynamic sequence lengths are handled by the following code ``` def forward(self, x, seq_dim=1, seq_len=None):
"""Returns tuple cos, sin"""
# Comment out bfloat16() specific code for now
if seq_len is None:
seq_len = x.shape[seq_dim]
if seq_len != self.seq_len_cached:
self.seq_len_cached = seq_len
t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq)
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
# if self.precision == torch.bfloat16:
# emb = emb.float()
self.cos_cached = emb.cos()[:, None, None, :]
self.sin_cached = emb.sin()[:, None, None, :]
# if self.precision == torch.bfloat16:
# self.cos_cached = self.cos_cached.bfloat16()
# self.sin_cached = self.sin_cached.bfloat16()

    return self.cos_cached, self.sin_cached
#

Due to the nature of the data, during training the model will only see contexts length of max_seq_len, so the re-calculation is only ever used when sampling

#

I definitely want support for longer contexts, it's the next thing to cross of the list

hearty flicker
warm tangle
#

Wow! How many parameters are there for this ^^?

hearty flicker
#

30m

sand nymph
#

@hearty flicker How do you plan on evaluating the models? Are there muscians who you can have listen to samples and guess which famous person is being mimicked?

hearty flicker
#

There is a large ai-music/ai-audio group in my department that I plan to use for this

#

Me and @timber talon are also working on some symbolic-music eval stuff

hearty flicker
quasi steppe
#

In particular I meant what's in the following chart. It's the vanilla Llama-2 without any finetuning and we only modify the model codes. The model only has 4192 ctx but dynamic-yarn stabilizes the ppl to ~8000.

The reason I'm really curious is that we didn't test it on a lot of downstream tasks and it would be extremely interesting to "hear" whether there is a difference for the first half and the interpolated second half in this case.

hearty flicker
#

I'm definitely up for this, I'll start looking into it tomorrow

timber talon
hearty flicker
timber talon
# hearty flicker

lol truly wild. I love that secondary melody that it just, like, randomly introduces lol

quasi steppe
#

there was that "world model" paper a couple days ago, but I'm thinking maybe music model is an even better one for that kind of experiment since every piece has a much more natural geographical/temporal label (as opposed to "ask the LM for latitude/longitude").

timber talon
timber talon
#

ohh thanks!!
@quasi steppe by "every piece has a much more natural geographical/temporal label", do you mean the geography of the composer and the time period it was written in?

quasi steppe
timber talon
#

could be interesting - I'm aware of some classical work looking at geographic differences in music/language: https://lchc.ucsd.edu/MCA/Mail/xmcamail.2009_11.dir/pdfZz8vEGN9aS.pdf

but in these cases they had to heavily restrict the analysis to pieces "of nationalistic character" because the "average" piece contains a good deal of cross-reference

sand nymph
#

The more I read the world model paper the less compelling I find it tbh

timber talon
#

i mean what you're getting at is broader than music, I think... you're getting at any kind of "style" differences that exist based on geography/time. Dialects in language, artistic movements in visual art, etc.

timber talon
sand nymph
#

Carlini told me that he liked the Pythia paper more each time he read it which is basically my goal with writing papers now

timber talon
#

wow! high praise.

The "world model" paper basically reminds me of what Jason Wei had to say recently: We've seen a transition along the "gradient" of publishing: conference papers -> arxiv/tech reports -> blog posts -> (tweet + code), and I expect the trend to continue. (https://twitter.com/_jasonwei/status/1709634375233716591). As in, it was written for the tweet

hearty flicker
#

Hey guys. My currently plan is to run a scaled up test over the weekend using the HPC. Before that I need to implement and test the context length extension, along with doing some refactoring.

#

I'm going to send Stella some more detailed information about the paper that I plan to write. The general idea will be 1/2 on scaling music transformers, and 1/2 on applications of transfer learning (e.g. fine-tuning the model) for generative and MIR tasks. On the generative side, I'm especially looking forward to fine-tuning on this dataset and tying in the LIMA paper since the dataset is quite small. I can't find the 2024 ICML deadline, however I might aim to get this work submitted there. Music transformer paper have done well there in the past.

#

If anyone wants to contribute, a good idea would be to experiment with your own fine-tuning and see if you can come up with anything interesting. I'd obviously add anyone who contributes something that makes it into the paper as an author. As I said, I'm refactoring the fine-tuning code this week so it should be fairly straightforward use.

#

Just a reminder that you can find the repo here - https://github.com/EleutherAI/aria. The HOWTO might be slightly outdated, however I can also update that this week too (after the refactor).

hearty flicker
#

Hey @quasi steppe I'm curious what method you would choose for positional embeddings / context extension if you were given complete freedom over the architecture. I'm tempted to just stick with rotary embeddings and train with 2048 or 4096 context.

#

A context 2048 corresponds to about 2-3mins of music with one instrument.

quasi steppe
hearty flicker
#

I've been reading the yarn paper this afternoon.

#

Is there anything that I'd need to change for pretraining?

quasi steppe
#

nope. The whole point of that paper is that you can extend it afterwards in a data-efficient way

hearty flicker
#

I could just carry out the fine-tuning procedure that you describe

quasi steppe
#

so far it seems better to worry about context length in finetuning, unless you really really really go for long-range qualities and have tons of budget

hearty flicker
#

I think it will be a pretty interesting test of your ideas : )

#

Long context music generation has been a fairy well researched area is the past too (for symbolic music).

quasi steppe
#

if you try to put in YaRN, make sure to try the dynamic YaRN for extension without finetuning. I really want to see how it is in concrete tasks

hearty flicker
#

Since aria will be in the range of 200m - 800m I think I'll be able to fit 4096 with flash attention on an A100 with ddp.

quasi steppe
#

I'm still locked out of SAI infra for GPUs, otherwise I might have time to help you integrating YaRN in your stuff

#

I only have tpu pods and I'm working on adapting models I want to test in JAX which is really taking a lot of time šŸ˜‚

hearty flicker
#

Do you have any code that accompanies the paper?

#

If not, I have a mathematical background (the same as you I think - algebraic geometry) so I should be able to do it by myself ha

quasi steppe
hearty flicker
#

I ended up pivoting in ML instead for my PhD ha

quasi steppe
#

me too and I'm doing it post-PhD šŸ˜‚

hearty flicker
#

Anyways I'll give the yarn stuff a proper read now and get it in my codebase

#

It's probably good to make sure it works before I do the pretraining

#

I was pretty skeptical about using ALiBi due to how much I like rotary embeddings

hearty flicker
#

I had some trouble finding it myself

hearty flicker
#

That should be very easy to implement

quasi steppe
hearty flicker
#

I can try this now for you then

#

Current mini experiement is trained with 512 so I can try to double it and see what happens

quasi steppe
#

gotcha. Basically when the ctx length is < 512 nothing should change. When it get bigger we just gradually scale up the factor from 1.0 to 2.0

quasi steppe
hearty flicker
#

Hmmn I think due to the way I'm tokenizing my sequences, there is an issue with just changing the code

#

Unless I'm making an obvious mistake

#

During pre-training the sequences normally have some padding on the end, I think around ~512 the model puts in a padding token and then fills the rest of the additional context with it too.

#

Not always immediately, however it kinda makes sense since <P> is a fairly common token at the end of a sequence

#

I can try setting the weight for the padding token to 0 and see if that helps

#

This is actually a pretty good find tbh, I should change this in my tokenization

sand nymph
#

@hearty flicker you should be masking the loss for the padding tokens, so that the model doesn't learn to generate them

hearty flicker
#

Yeah this is something I've just overlooked, super glad that I found it to be honest

#

It doesn't matter in the context of generating sequences < max_seq_len

sand nymph
#

Well

hearty flicker
#

But in this specific case it does matter a lot

sand nymph
#

That makes it sound like you're misconfiguring it

#

If a sequence has max sequence length, why is there a padding token at all?

hearty flicker
#

<P> tokens can only be found either directly after a <E> token, or within the range max_seq_len - 3 < x < max_seq_len as to not truncate a note midway

hearty flicker
#

I originally implemented it so that a note wasn't truncated midway (as it is described by 2 tokens)

quasi steppe
hearty flicker
#

Yah this is just something I've overlooked, glad I found it now

hearty flicker
sand nymph
hearty flicker
hearty flicker
hearty flicker
#

I just got finished training another small version of the model on a slightly larger dataset. The new data is had sequenced and has made a big difference - I'll upload some more samples throughout the day

hearty flicker
#

It seems to do a good job understanding musical form without being explicitly told about it, which is quite suprising. For those familar with the Fugue form, checkout this non-cherrypicked sample. Still quite short since I'm using 512 sequence length. The real piece is Bach's c-major fugue BMV 846 - https://www.youtube.com/watch?t=131&v=_3qnL9ddHuw&feature=youtu.be

timber talon
#

hey @hearty flicker this sounds exciting and i'm into it, apologies for the delay. I really like the idea of 50% of the paper being focused on fine-tuning... there are so many tasks already out there that can test for musical comprehension, even beyond the ones we've already discussed for eval, but they don't really get incorporated into GenAI papers at all.

Can I ask a more basic/fundamental question though, that's been on my mind?

Applying generative LLMs to other benchmark tasks usually works because the tasks can be reformatted into seq2seq tasks. Like, for a document classification task, instead of calculating probability vector p(y | x), we can ask the LLM to just generate the name of the class.

It's less clear to me how to transform many of the MIREX tasks into seq2seq musical representations that a generative music model could output.

The alternative is to use the output embeddings of the decoder, and then put a linear classifier layer on top of that. However, in text stuff i've done, I've found that doing this with autoregressive GPT models produces a worse classifier than using a bidirectional encoder model like Roberta.

So my question is — do you think it's worth it to also think about masked modeling, or encoder/decoder setups for this part of the paper?

hearty flicker
timber talon
#

octupleMIDI encoding nice lol

#

this is a really cool paper, exactly what i was curious about. Their tasks are also lacking .. in theory we could improve every aspect of this paper.

I guess a counter argument to my question is that maybe there is a way to format a lot of tasks as seq2seq tasks, would just require some thought

hearty flicker
#

That is a big motivator for me, all music symbolic music research I've seen lacks in one way or another. The library I've built is 'basically' perfect as I've not been lazy anywhere. My hope is that this pays off when it comes to downstream tasks.

#

Generatively speaking, I think it's pretty evident. My supervisor today wanted to submit some of the fugue I've generated, he was pretty shocked by how well they adhered to the musical form.

timber talon
#

here is a philip glass midi, btw — would be a great "easy" test to see if form-understanding is possible

hearty flicker
#

For more MIR stuff, I think that there are ways to introduce seq-to-seq. I haven't looked into it too much yet but I'm quite excited.

#

There are ways to use autoregressive models to do information retrival, although you are right that the bert style encoders are normally better for obvious reasons.

timber talon
#

it's actually not uncommon in NLP to have a generative model generate stuff, and have a separate encoder model evaluate it. BERTScore for translation and BARTScore for summarization are two examples of that

hearty flicker
#

I am very interested in training a MLM version of my current model, it would only take a few hours to implement but I haven't done it yet.

#

I'm currently training a context length 2048 version of aria on the small classical dataset I've been using. I'll try the Philip Glass midi on that when it is done!

#

I'm about to go to the pub with my friend, but I will get back to you tomorrow about this! Maybe we should setup a meeting sometime soon : )

timber talon
hearty flicker
#

I actually started off doing music stuff from the angle of non-autoregressive BERT style stuff! You can think of aria as pretty much being 1-1 with something like gpt. It really wouldn't be much effort to fork a version that does MLM instead.

#

Just need to change how the loss is calculated, the format of the (src, target) tensors, and the casual mask

hearty flicker
#

When including hand sequenced data in the training set, the generative quality of the model degenerates in some sense.

#

I seems to grasp onto patterns of repetition and repeat them over and over again. Makes sense why it would do that.

#

I think it kind of confirms my suspicion that scaling up the dataset increases its quality as a pretrained model, however if you want to use it for generative stuff you need to finetune away some of those unwanted behaviors.

hearty flicker
#

Hey guys, give this a listen. This is from a 2048 context length model. AI generated after ~12 seconds in

#

Unfortunely when training with longer contexts, the model seems to gravitate toward repeating itself over and over again when using normal sampling temps (0.6 - 0.8). As a result, I have to bump the temperature up to above 0.9 to get music with some variation in it. This results in some bad samples which you can hear in the music.

#

In this sample towards the end it degenerates too.

sand nymph
#

I can tell what you mean about the repetition

hearty flicker
#

I use top-p already. I think there must be a sweet spot top-p and temperature setting that will do the trick.

#

If I reduce the temp too much it will just repeat the same bar over and over.

#

Since notes are encoded over two subsequence tokens (pitch and duration), I think that beam search might be a good idea. It's a bit of a pain to implement though...

sharp quiver
#

I don't think beam search is a good idea for music generation

sand nymph
#

My mother is a classical pianist and she said

They are pretty songs but the playing lacks emotion and nuance.
[I told her it was AI]
Interesting because I didn’t think that at first.

#

Regarding repetition

Sometimes songs do that but the emotion of the player helps. It was particularly noticeable towards the end with the repeated high note.

quasi steppe
#

@hearty flicker have you tried CFG?

hearty flicker
hearty flicker
#

It is also interesting to note that the dataset I'm using for tests comes from a transcription model (audio -> midi). So in some sense the only primitive for that sample is audio

hearty flicker
quasi steppe
#

oh nvm, CFG might actually make it worse (sticking more to the prompt)

hearty flicker
#

I haven't heard of that but I'll have a look

hearty flicker
#

It's possible that the regularization I'm using is causing the 'repeating' problem. I'm going to run another experiment with double model size and a previous value I was using for weight_decay in AdamW

timber talon
timber talon
# hearty flicker

lol it goes from debussy to chopin
i keep wanting it to resolve and it doesn't lol

#

well at least we know it definitely understands chords, even if the progressions themselves are kinda atypical to say the least

hearty flicker
#

CFG sounds pretty promising especially for my use case. Will implement this in the sampling library

timber talon
#

yeah i was thinking about it more last night. the debussy example you gave is definitely repeating a lot, but it's also repeating a phrase that's pretty far away from the prompt — like, it never really returns to the prompt after drifting away. "it goes from debussy to chopin" is real but it feels a little more pop/video game music, even, it seems like it just reverts towards some central mean in the dataset

quasi steppe
#

CFG also reduces diversity so I actually see the arguments for both sides. Would be very interesting to find out exactly what will happen

tender dragon
# hearty flicker

I have a background in this field, but I'm more of a math-and-music-cognition person, and less skilled in ML.

I have a music generation project, but it's algorithmic rather than NN-based. My experience is that using temperature to avoid simple repetitions seems ill-founded (I'm using it too). I know more about the harmony aspect of temperature, but let's focus on phrasing since that is what seems to be giving trouble at the end of this piece. I'll talk about the algorithmic approach with naive temperature, what succeeds and falls down, then how these lessons transfer over to ML.

So let's consider the setting of my algorithmic generator. First, we have an existing phrase, and we generate another phrase that is similar to it. With temperature, the most likely result is simply a duplicate of that phrase. And this sounds bad. With sufficient grasp of harmony and rhythm, we can create a large body of good possibilities and then turn the temperature up. Fine - now there's a large diversity of sounds, and repetitions are infrequent. But when a repetition appears, it still sounds bad, most of the time.

What if we simply disallow repetitions? Now things are always new, but it still sounds bad. All the notes are consonant and novel, but it seems unstructured/meaningless. And it has no coherence.

Generating a stochastic mixture of new and not-new phrases, with the right proportions, occasionally sounds good, but it's clear that's by luck rather than intelligence. Usually it sounds bad.

So something about our approach is not working. The reason is that there's a hidden problem here, of tension and resolution, that this algorithmic approach is not modeling. Apparently the new and not-new phrases need to be in the appropriate places: sometimes the listener wants repetition and sometimes she doesn't. And randomness is the wrong approach.

#

So now what? For my algorithmic approach, I already identified the tension issue and solved about 3/4 of it, and I need to solve the remaining 1/4 to get good results. But for ML, the issue is different. Can the NN understand tension and resolution from existing compositions? I haven't seen ML music generators do well with it. But the answer should be yes: if the NN attends to the right things, it should be able to perceive tension and resolution easily.

The solution is to look at how humans perceive tension: it's caused by how human music memory works. And human music memory has two steps:

The first step is feature-matching, as in Feature Integration Theory. When comparing to a phrase in the past, the first few notes are compared to each other. This is captured decently by the attention of a transformer, although labeling metric stresses (as a bit of feature engineering) should make this attention more accurate. The second step in human music memory is retrieval, which is sequential in the forward direction. Which means that in human attention, if note A retrieves note B from the past as salient, then the note after A will retrieve the note after B as salient too. And a "resolution" happens when there is a strong connection between the present note and what the human's memory is attending to - usually a repetition in features. In ML terms, this means attention for phrase retrieval should also be sequential after the initial search, to accord with human music memory. So if a note has high attention, the next note should have its attention boosted in the next step. (There's annoying details with stream segregation and rhythmic stress letting memory skip notes, so it's probably safer to let the NN decide which of the next notes to boost.)

#

Being able to recognize and use tension requires the NN to see both the current and past phrases together. Then repetition becomes an active choice, conditioned on whether the NN wants tension or not. The highest probability choices shouldn't be dominated by repeated sequences, because that is a "bad" choice, not just an un-diverse choice. Then the purpose of temperature would not be to avoid repetition of phrases, but to avoid repetition of pieces.

It's not clear to me that as-is transformers can perform this comparison of phrases, because the attention of the current note depends on "what was the attention of the past note". That seems to require some sort of recurrence or information passing.

tender dragon
#

I looked at OctupleMIDI encoding. Its position information actually captures the "labeling metric stresses" issue I mentioned, absent syncopation, but a transformer can already spot syncopation. So no feature engineering for stresses is necessary, OctupleMIDI solves it automatically.

hearty flicker
#

Messing around with high values for cfg_gamma and temp produces some very weird stuff haha

#

This prompt is quite out of distribution for the training data (since the training data is all live performance), so it often produces weird results anyway

hearty flicker
#

It's also got a lot to do with the data. I if I train/ft a transformer on some very structural music with well defined cadences, it doesn't really have a problem perfectly reproducing it.

#

The data I'm currently testing on is very very varied and all from live performance (only 5000 recordings), so the model has a harder time producing very structural music. I do wonder if/how much strucutre would emerge when scaling the dataset 100x.

quasi steppe
hearty flicker
#

It still has the repeating problem, however part of me thinks that this sort of issue is unavoidable when you are working with smaller datasets.

#

Luckily it's straightforward (in theory) to scale it up, but will take some engineering effort on my end.

#

I'm actually super interested in trying to apply some of the ideas they used in alphafold to improve the audio -> midi conversion models.

#

To my ear, the cfg stuff is making a big difference. Even if the model is not perfect.

hearty flicker
#

Giving no prompt apart from the composer name & high temp gives weird results too...

quasi steppe
#

Yeah repeat might be a separate issue than CFG.

hearty flicker
#

The repeating problem has gotten a lot better in the version I just trained, I think it was especially bad before because I cranked up the weight decay in AdamW by 10x

quasi steppe
#

So the lower weight decay the better, even for small models?

When I train 3-13B models I set weight decay to 0 and saw no difference in eval šŸ˜‚

hearty flicker
#

I think the model was too small to have the large value I used

tender dragon
# hearty flicker One of my supervisors has a lot of experience with this algorithmic stuff. I sup...

I agree with your perspective and don't encourage people to generate music algorithmically. The main barrier that I see is that music cognition science is strongly necessary for algorithmic composition, but this field is not well-developed. I was forced into music cognition research myself to cover the missing theories; it's not something I intrinsically care about. It's much easier to let a NN figure out things by training on compositions.

tender dragon
hearty flicker
sand nymph
#

@hearty flicker How is the research question formulation coming along? Like, concretely what are you hoping to show with this model?

hearty flicker
hearty flicker
#

Hey guys, I've been extremely busy this week. I'll update when I get a chance.

#

Btw I have made a colab notebook if anyone is interested! I'm a bit wary of publicly sharing it as I don't want the model waits to get downloaded from gcp lots and lots (leaving me with a large bill). If you want a link, dm me : ) It's super easy to use!

hearty flicker
#

Does anyone have a sense of how hard it is to build something ontop of ggml? I've got a little c experience but not too much. I think it would be cool to build an applet to generate and play music in real time on arm macs.

hearty flicker
#

@here

Hey guys, update here - I'm planning to have this project released in mid-December. I have this entire month free to dedicate, so I reckon it is a realistic timeframe.

The first thing I'm going to work on is improving the Audio->MIDI transcription models and subsequently using this to significantly expand the live performance dataset I'm using (by 3-5x). If I can't make any improvements, I will use the current SOTA (Kong et al.) for this purpose. Either way, this should make a huge difference. Hopefully, I can get some GPUs for this purpose, as the actual transcription and audio processing is quite intensive.

I'm going to write a pre-print focused on generative controllability and the impact of pre-training scale. My supervisor wants me to submit it to ICML/NeurIPS; however, who knows if it will get accepted. If that fails, I can also submit to ISMIR in Summer 2024. He reckons that both the generative capabilities and controllability (using fine-tuning/CFmg/meta-tokens) are SOTA (though it is hard to measure quantitatively). There are quite a few models I can compare this against. I have access to ~50 music-related researchers that I can use for a qualitative study, so I should be able to get some data regarding this. While the audio files are being transcribed, I'm going to build on the inference library; there should be some really cool possibilities for controllability.

#

If possible, I'd like to produce two sets of pre-trained models. The first will only be trained on files with an open license (in line with EleutherAI's policy). As most of the good data falls under this license, this model should be powerful. I have explicit permission to use the MIDI dataset from ClassicalArchives too, which I would love to include in this version. I'd like to release these checkpoints publicly and use them for the analysis in the aforementioned paper. The second set will be trained on all the data I have access to. I understand that I might not be able to use Eleuther's resources here (or release the weights); however, I am actually really interested in how much of a difference using the bigger pre-training dataset will make. This is super valuable research IMO; for me, it would strongly inform the correct direction of further improving models like Aria. I have also curated a few high-quality fine-tuning datasets which should be really interesting to experiment with (and release checkpoints of). There are a bunch of improvements to the model that I have planned; I'm hoping to do this while the audio files are being transcribed.

I'm planning to release a blog post, similar to the musiclm (https://google-research.github.io/seanet/musiclm/examples/), along with the model weights (in Dec). Since Eleuther is involved in this project, I'd love to release it with their affiliation; however, I can also release it on my personal website.

quasi steppe
# hearty flicker @here Hey guys, update here - I'm planning to have this project released in mi...

Yeah evaluating any Gen AI is hard. Was watching James Betker's talk this weekend and even OAI only does human evaluation.
I guess EAI might have some resources to help organize human evaluations, though I'm not super familiar with the details.
When we wrote the CFG paper we made a simple UI and collected some preference data from the public. Was really nice to have some real world numbers in our paper and I guess you can do the similar thing too.

hearty flicker
sand nymph
#

@subtle lance can help with licensing & policy questions (just not today, as they're traveling to the UK for the AI Summit)

@timber talon is also very interested in helping I know.

hearty flicker
#

Hey guys, I'd just like to make it explicit that I'm more than happy to add anyone who has been contributing to the discussion in this Discord as a co-author (if you would like). This is particularly relevant for @quasi steppe, @timber talon, and @sand nymph. Going forward, there are some other areas which I see as ripe for collaboration:

  • Fixing/improving the repetition problem via different prompting/sampling methods (such as CFG)
  • Exploration of other prompting/sampling methods
  • Improvements to the data preprocessing and augmentation
  • Improvements to the model architecture (there are definitely changes to be made here)
  • Optimization of the training process and regularization
  • Expansion of pre-training/fine-tuning datasets
  • Generating ideas about improving the tokenizer and the format of the meta-tokens

There are lots of other topics as well; I'll mention them as they come to mind. As I said I'm going to focus on Audio->MIDI for the next few weeks, if anyone has any thoughts on that area

#

Also if anyone wants to try out the colab notebook, you can have it here. Both the checkpoint it uses and the sampling hyperparams are far from ideal, so beware šŸ‘»

tender dragon
#

https://github.com/EleutherAI/aria/blob/8a9b40814d5fc358254d311a84675ad227ffc26a/aria/tokenizer/tokenizer.py#L355
You are encoding the time of messages as offsets between start times, rather than absolute times.
I think this is wrong. Human memory follows the grid of the meter: so suppose a meter is 128 midi ticks. Then at the beginning of a new meter, long-term memory only retrieves certain notes: 256 ticks back, 384 ticks, 512 ticks, etc. This requires it be able to see the absolute time difference between two notes (and also know the meter distance, which you might want to do some feature engineering on). I can provide some demonstrations of this effect. With Wait messages, the transformer is unable to see absolute time offsets because it would need to sum the Waits of all the notes in-between.

That means here: https://github.com/EleutherAI/aria/blob/8a9b40814d5fc358254d311a84675ad227ffc26a/aria/tokenizer/tokenizer.py#L348
I recommend you append timestamps to tokens, and remove all the "wait" tokens. So the token becomes (_onset_time, _instrument, _pitch, _velocity). The attention head will want to see some sine-like timestamp difference.

Duration is a different matter; I also recommend you append the duration to the token, but this is much less clear. It doesn't actually belong in either place, and although I understand how duration affects harmony, I don't have a good idea of where the transformer wants it to be, because there is an architecture mismatch. I don't have a demonstration for this, and you should think twice before following this recommendation on duration. You could just leave it as-is for now since it is less effort.

https://github.com/EleutherAI/aria/blob/8a9b40814d5fc358254d311a84675ad227ffc26a/aria/tokenizer/tokenizer.py#L343
I'm not sure why you are quantizing the velocities and times. If it's for regularization, I think data augmentation would be better, just like you are augmenting pitch and velocity.

warm tangle
tender dragon
hearty flicker
tender dragon
hearty flicker
#

I mean that don't want to size of set the set of possible tokens to get too large.

tender dragon
#

I understand that you are constructing MIDI from performances rather than from scores. however, I think meter detection should still be possible. and over a short range, even an inaccurate meter guess would be better than no meter guess

#

for these 4 features, I expect the embedding to look like this: (float of onset_time, category of instrument, category of pitch, float of velocity). a 4D vector, where only the middle two dimensions are categorical. so I don't see how quantization is reducing the vocab

hearty flicker
#

I do agree with your point about the arithmetic possibly being a problem, however I'm not sure that it isn't practically solved by having multiple attention heads (as is standard). I've not noticed timing issues to be one of the problems with the models as it currently is.

#

When I fine-tune on chorales for instance, it perfectly reconstructs the beat-bar structure.

#

I do think it would be a cool idea to try out though, see if it makes any difference.

tender dragon
hearty flicker
#

So what is the loss function used to train the model? What does the output layer look like.

tender dragon
# hearty flicker I do agree with your point about the arithmetic possibly being a problem, howeve...

here's an incomplete specification. for output, you can have (token type which can be note/end of song/start of song/etc, four dimensions for a note). so I've put 5 dimensions in a vector. the loss is case-wise. first, cross-entropy on the token type. suppose the token is a note - then in the four dimensions of the note, loss is squared error in first and fourth dimensions, cross-entropy in second and third dimensions. if the token type is not "note", such as end-of-song, the loss ignores those four dimensions. or if the token is something else which uses one dimension, then append one dimension to the output layer to have 6 dimensions, and apply the relevant loss to that dimension when the token indicates the dimension is relevant

hearty flicker
#

I'm quite skeptical of approaches like this in general, as it decouples information that should be coupled

tender dragon
quasi steppe
hearty flicker
#

Timing, pitch, duration, and velocity are all highly connected when it comes to predicting the next note. Doing a linear sum over the losses does not accurately represent the situation. This is why most transformer models (and all music transformer models as far as I am aware) are trained with CEL where the token space is a product of sets.

#

If you want to try out your approach, I do invite you to. You can fork the repo and adjust the tokenizer and loss function.

tender dragon
# hearty flicker Timing, pitch, duration, and velocity are all highly connected when it comes to ...

ok, I see what you mean. it doesn't make sense to add cross-entropy of pitch to the other information. in that case, I'll adjust my model. so my changed proposal is: remove velocity from the vocab, express it as an output float attached to each possible categorical output. so for an output token (time, instrument, pitch), the output is (probability, velocity). and the loss is cross-entropy on probability, squared-error on velocity. this is because I do expect velocity to add linearly

hearty flicker
#

A loud melody over a quiet chord for instance.

#

I do think you idea about the meter has some merit, but I forsee a bunch of complications when implementing it that really really complicate things.

#

And I'm really not sure that it would actually help anything. With just eight attention heads I haven't noticed timing being an issue, I'm really not sure that introducing a small meter would actually improve anything. If you want to experiement with it though, I would be interested in the results.

tender dragon
hearty flicker
#

Anyway, with the quantization I'm using I don't need to use this trick anyway. It's not really perceptible (both for timing and velocity) so why change it?

tender dragon
#

the second proposal has one velocity per possible token. so it has separate predictions for loud C and quiet A.

changing is to reduce vocab by representing the continuous dimension as a continuous rather than discrete variable. but it's your decision.

hearty flicker
#

If the output space is (p, v) (p /in R^n, v /in R_{>0}), are suggesting to train on the loss L_1(p) + L_2(v) where L_1 is CEL?

tender dragon
#

in your existing formulation in your code, I assume that for each possible output (velocity, pitch, instrument), you output a probability through softmax. is that correct? so it's a ~127 x 127 x 10 vector

hearty flicker
#

Velocity is quantized so it's roughly 127x10x10

tender dragon
#

in my formulation, it's a 2 x 127 x 10 vector. the 2 is for (probability, velocity). it's CEL on probability as you describe. and the L_2(velocity) only kicks in if the note matches

hearty flicker
#

I don't think that is true. I think instead of 2 you need 127

tender dragon
#

for each (pitch, instrument), we output (probability, velocity)

#

and velocity is a float

hearty flicker
#

Oh I think I see what you mean now! I still have doubts though, you would need to use the velocity loss function in quite a weird way since you don't have targets for all bar one of the predictions. I really don't think that vocab size is a problem anyhow, if you want to try out it I'd be interested to know if it does end up working.

#

I have tried a variety of ways for messing with loss function in the past, and unfortunely all of them have gone badly. Turns out a bog standard transformer is pretty good at learning the structure of sequences, and is able to overfit the train set even with extreme data augmentation.

tender dragon
#

I'm very unlikely to try out model changes before your project finishes in December, but here's the extension of that loss function to time that I'm envisioning. first, time tokens are removed. then, for an output token (instrument, pitch), the output is (probability1, time1, velocity1, probability2, time2, velocity2, ...,). and let's say there are 4 of these triplets of (probability, time, velocity). what this does is mix the time in with the other note properties, so that it's no longer independent. the output dimension becomes 4 x 3 x 127 x 10. but it becomes complicated: it would probably generate notes out of order, and handling the loss function is probably not worthwhile in terms of improvement-to-effort ratio for your project. it's not a free win, and now that I know more about your project, it isn't something I necessarily encourage.

#

as for your advisor's opinion that your model is SOTA, I agree with it. I see some important medium-range correlational qualities in your generated music that I don't see in other models. I expect a large reason for this is the advantage of attending to simple MIDI messages, rather than attending to waveforms or waveform transformations.

hearty flicker
#

I really appreciate your input by the way, definitely good to have original ideas swimming around in my head

tender dragon
# hearty flicker Oh I think I see what you mean now! I still have doubts though, you would need t...

you would need to use the velocity loss function in quite a weird way since you don't have targets for all bar one of the predictions

For the loss function specifically, using velocity as a float is better than quantizing to 10 positions. The issue with float velocity is that only one predicted velocity is being trained. But with 10 quantized velocities, the same problem applies, only 10x more - you still only get training for one prediction, but now the prediction is split into 10 velocities, and only one of these 10 velocities is being trained. It's like fitting a histogram. Whereas with a float velocity, the gradient pushes the velocity to the right value directly.

hearty flicker
#

I haven't noticed any problems related to velocity to be honest. Having said that we could run an experiment with it completely omitted from the tokenizer, and compare the generative results (as far as pitch and timing is concerned). Could be a good way to tell if the way velocity is currently being handled is causing any degradation. My gut feeling is that it isn't, but it's good to be safe.

timber talon
# hearty flicker Hey guys, I'd just like to make it explicit that I'm more than happy to add anyo...

@hearty flicker sounds great! I'm excited to get things working on my end, and see waht kinds of hidden-layers get generated.

@quasi steppe feel free to message me about CFG or let's chat in this thread if we have interesting examples.

Regarding Audio->MIDI, @hearty flicker , can you say more? This is something I've been thinking about, too. Is the idea to be able to generate MIDI from raw audio files? I'm sure this is a whole line of work. Is the idea to do this for training data creation, or just for general interest?

hearty flicker
#

It's incredibly promising and can in theory be scaled up to as much high (piano) quality audio as one can obtain.

#

I'm personally going to work on this specifically over the next few weeks. Hopefully we can then introduce some of the advancements in an async way whilst the expanded dataset is being built.

timber talon
#

ahh i remember we talked about this before

One comment on that: I think we should assume the transcription function will be noisy but possibly in a biased way, like certain elements (e.g. thrills) will be consistently mistranscribed. In order to prevent our model from overfitting to that bias, it would be great to measure what kinds of errors transcribe(audio) = midi makes.

Trying to think how best to measure that — possibly for a gold set of pieces that we have real audio AND real midi for (i.e. with the MAESTRO dataset), would be interesting to see a list of the errors. And then, we'd have to think about how to de-bias. One way would simply be to warm-up the model with pairs of clean/noisy midi, and then freezing parts of the model before training on the audio-transcribed data for which we don't have any gold MIDI.

In other words:

input: midi dataset X, audio for X, audio dataset y

// warmup
noisy_x = transcribe(X_audio)
model.train(concat([X, noisy_X]))
model.freeze(layers=[0, 1, 2...])

// training
model.train(transcribe(y))

Another comment — if we're looking for alternative sources of data, how about running OMR on IMSLP, or other freely available PDFs out there? Apologies if we talked about this before, I'm remembering we've definitely been down this route before...

quasi steppe
#

@hearty flicker
https://gist.github.com/honglu2875/f3a1c78970ad055e758d0a9fa8e09e47
I implemented kv caching here. Only renamed the model.py and sample.py and put them as gist because I haven't carefully written unit tests for the logits with/without kv-caching. But I generated a couple samples and they sound alright so this is likely correct. Also did some other minor optimizations. They speed up the generation by quite a lot.

I also replaced the RoPE by the one from huggingface NeoX codes. To apply PI or NTK it should be a trivial swap using those in the same file in huggingface repo.

Need to go to bed now but if everything is totally alright I can submit a PR tomorrow.

quasi steppe
#

@hearty flicker @timber talon Listen to this one! Make sure you stay until the end! I just interpolated chopin with bach, not perfect but kinda interesting lol
@timber talon relevante to us because I used bach as "negative prompt" but made CFG < 1 (CFG=0.8) so negative of negative is positive (=interpolation)

quasi steppe
#

another way of interpolation allowing the end not to be exactly bach. This one is interesting (where is bach lol)

#

hmmm this one is quite cool. It somehow sticked to E flat until the end (that bach prelude is C major) but it tries to repeat a chord that relates them

#

This really made me feel that to generate great music we should traverse through the latent space along a path

quasi steppe
#

Dynamic YaRN s=8 (scale factor), GEN_LEN=8192. The attention scale uses llama2's parameter so likely not optimal.
The first sample completely broke down somehow...

hearty flicker
#

I don't want to break the notebook if we are going to make architecture changes

quasi steppe
hearty flicker
#

While we are at it we should probably make other changes in this vain

#

Any thoughts on Swiglu and mqa/gqa?

#

Also if there are any other issues with the implelmentation. I'm pretty sure it's up to date but idk

quasi steppe
hearty flicker
#

I might as well change swiglu I think, I don't see a reason not to

#

I originally just implemented the changes from llama1

#

@timber talon could we have a meeting about experiment planning? I've got some ideas already

#

If I want to use my department for a survey, I probably have to submit an ethics proposal or something along those lines

#

@quasi steppe have you got the context length extension stuff implemented?

#

I'll create a dev branch and we can merge prs into that

hearty flicker
#

Does it work? lol?

quasi steppe
#

not really working haha

hearty flicker
#

Current checkpoint is 2048 and about 100m params

quasi steppe
#

but I haven't double-checked carefully. Also the parameters are for llama2 and likely needs tuning

hearty flicker
#

I think final version could be 4k/8k context with 200m - 800m params

quasi steppe
#

@hearty flicker how do you think about the interpolation experiments? I feel the quality really improved and it doesn't do boring stuff

hearty flicker
#

I think it's very very cool lol

#

My supervisor was going to work on exactly what you implemeneted, seems you have beaten him to it lol

quasi steppe
#

if we have enough context length, we can use like 4-5 prompts and interpolate them in different intervals and eventually come back to the first prompt. That could end up with a complete piece of music

hearty flicker
#

Yes and btw in the lit, this is a big unsolved problem.

#

Controllability of these types of models, that is.

#

So these results are very cool

#

I'm going to refactor the inference part of the lib so we keep better track of stuff.

hearty flicker
#

I have some at my university too but their system is really really annoying to use, so I mainly just my own server (which has a 4090)

quasi steppe
hearty flicker
#

It's all in the repo actually

quasi steppe
#

right now there are 80x A100 idling lol

hearty flicker
#

All you need to do is built the datasets, I suppose I could also just give you a dl script for it too

#

I also have access to the SAI cluster, I didn't realise that there was that much idle compute wow

hearty flicker
#

Ok I'll tell you what I'll do. I'll update the HOWTO to explain how to train and finetune stuff

#

it's like 3 cli commands so not hard

#

I haven't tested that my train script (accelerate) implementation works in a distributed setting.

quasi steppe
#

how many tokens does your dataset have?

hearty flicker
#

idk it's like 3gb?

quasi steppe
#

if it's usual gpt2 tokenizer in typical LLM that's about 500M tokens I think but I don't know about your tokenizer

#

at this scale it's probably only gonna take 1 single node

hearty flicker
#

Nah it's way less than that because of how inefficiently the data is stored.

#

If you install the repo and run python tests/test_data.py you will see what the dataset files look like in tests/test_results

#

The checkpoint you are currently using 24hours on a 4090, so I doubt we would ever need more than one node.

#

If you really want to run a full experiment (for pretraining), we could run an experiment on the full dataset (about 50x larger than I've used for tests).

#

That does include non-classical data though, so it would be more of a pretraining checkpoint than anything else

#

I actually have class from 1pm today, but I'll work on this stuff now and get back to you in a few hours.

quasi steppe
#

yeah I will play with it later as well. Need to grab lunch

#

Oh you are in Europe as well?

hearty flicker
#

I'm in London : )

quasi steppe
#

Awesome

quasi steppe
#

I'm spending my lunch time listening to all my interpolations.
Man this one is like a real improvisation...

hearty flicker
#

It's cool that you are enjoying it lol, I also think it's pretty fun. I really want to create the best possible version of this stuff.

#

I think expanding the dataset will really make a huge difference. That is why I'm so focused on the audio->MIDI stuff rn

#

In theory you could just scale this up to all recorded piano on the internet

#

I mean, even with just 5000 recordings you can get pretty cool stuff

quasi steppe
hearty flicker
#

I'm on a AI-Music CDT

quasi steppe
#

Oh Queen Mary, I have a professor friend there

hearty flicker
#

Only a month in really, before this I was working on this stuff whilst being a software engineer / data scientist

#

Before that I did a ug/masters in mathematics (geometry) at Imperial

quasi steppe
#

cool!

timber talon
#

Oh man @hearty flicker you got @quasi steppe interested, this project will probably wrap up in a few weeks lol šŸ˜‚ just like CFG — that took 3 weeks total

#

Sure, I can meet whenever. I’m pretty free today, tomorrow. Only thing is I’m in California

hearty flicker
#

@quasi steppe I should probably merge the fine-tuning code before running another experiment, should be sometime today I'll let you know

hearty flicker
#

fine-tuning code is merged, you can also find a decent explanation of how to train a model in the 'quick overview' section in HOWTO.md

#

I can also share others on the gcp bucket if people want : )

quasi steppe
timber talon
#

I know!! I’ve been thinking about that

#

I wonder how well we could learn the autoregressive hyperparameter

#

Or whether there’s a parameter for the a section and b section

#

@hearty flicker I’m free any time in that window. 6pm gmt is fine. @quasi steppe are you free to join?

timber talon
hearty flicker
#

I'm off tonight actually, I was planning on tomorrow. I'll merge the pr tomorrow @quasi steppe

quasi steppe
#

oh sorry I didn't check. Yeah tomorrow sounds good

hearty flicker
#

Ok guys lets do 6pm today? lmk if 7 is better for you @timber talon

#

I'm writing a small tool ontop of spotifydl for downloading and keeping track of all of the music (for the transcription stuff).

hearty flicker
#

Actually guys lets reschedule for another day if possible. Some irl stuff has come up that I need to deal with tonight...

timber talon
#

oh shoot ok no worries

#

let's see. I'm probably doing an all-day train trip either tmrw or Friday. But I'm free whatever day I don't do it. And can do earlier too, like 5pm GMT

hearty flicker
#

Ok that sounds good, let us know which day you have free and we can try to schedule a time that works for @quasi steppe as well. I'm free pretty much every evening apart from tonight.

quasi steppe
#

yeah me too. Evening is good with me

timber talon
#

ok i'll let you know as soon as I decide. Will likely be in the next few hours

hearty flicker
#

Right now I'm compiling a huge list of spotify links to various classical piano recordings

#

I recon I should be able to pretty easily 5x the GiantMIDI dataset. Really depends how much of this mind numbing work I can take lol

timber talon
#

what does it entail?

#

I'm happy to split some mind-numbing work

hearty flicker
#

I actually am a bit burnt of coding rn, so it's good for me haha

timber talon
#

are you literally clicking "download" for classical piano albums lol?

hearty flicker
#

Basically the pipeline will be: compile a huge list of links to spotify albums -> use spotify_dl to download as many as possible -> transcribe them to MIDI using (Kong et al.)

timber talon
#

ok cool

#

lmk if you need help, whether you wanna set up a google doc of something

hearty flicker
#

I'm going to write a proper script so that it can be manged properly and we won't have to redownload from scratch ect.

#

I'm a bit burnt from coding today though, so I may as well start compiling the list haha

#

I think the best way to be systematic and efficient it so go by pianist instead of by composer. Easier to navigate that way I think.

timber talon
#

aw man i know that feeling haha

hearty flicker
#

But you have to be carful because you can also use piano-only recordings, no concertos or anything

timber talon
#

tricky, yeah... links can be whole albums, or just songs?

hearty flicker
#

With the GiantMIDI dataset, they took a lot of care to only have 1 version of each piece. For our purposes, this is actually not what we want at all

#

As many different versions of each piece as possible!

#

That's the best version of data-augmentation lol

timber talon
#

yeah... Yuja Wang has so many recordings of the same Rachmaninoff pieces lol

hearty flicker
#

I think that actually transcribing these recordings might be very GPU intensive, I might be able to persuade the head audio guy at SAI to help out with compute if it's too demanding for EAI.

#

My supervisor knows him decently well apparently

timber talon
#

cool. so why don't you shoot me a google docs link with what you have already, we can throw composers on there and divy it up. I need a few hours this morning to get some models running for an unrelated thing, but should be free soon

#

ok

hearty flicker
#

Cool : )

#

We just have to make sure that it's piano only, no rach2 or anything like that

timber talon
#

for sure

hearty flicker
#

because then the models will try to transcribe the other instruments as piano too lmao

timber talon
#

hmm. I know we can find underutilized clusters.... what memory requirements are we talking about here? Just because of high-batch? or even batch size=1 is high mem?

hearty flicker
#

I'm not even sure yet. Apparently the GiantMIDI dataset took 300 GPU hours or something like that.

#

Tbh I'm not super familiar with audio stuff yet so I'm not sure.

#

I'm going to be researching this over the next few weeks pretty hardcore though

sand nymph
#

What are these 300 GPU hours for? Data augmentation?

hearty flicker
tender dragon
hearty flicker
#

The SOTA for doing this transcription uses a neural net

timber talon
sand nymph
#

300 what hours? A100? A40? 2080 Ti?

hearty flicker
timber talon
#

i would bet it's smaller GPUs. audio-models tend to be smaller CNNs

sand nymph
#

I expect this to be non-problematic to run

timber talon
#

whisper for instance can run on 12GB gpus

hearty flicker
#

I'm aiming to get this setup in an async way, so we can transcribe the audio while doing other stuff that needs to get done

timber talon
#

we have loads and loads of those just sitting around

hearty flicker
#

That's why I'm working on it now

#

Yeah I'm pretty sure vram isn't really an issue, so a100s are not really needed.

sand nymph
#

Cool, we can easily throw 16 2080s at this and do it in a day

hearty flicker
#

Yeah that would be cool, thanks @sand nymph

timber talon
#

definitely.... I'm sure we all have so many of those clusters just sitting around. another lab group at uni tried to donate some 2080s to the general university cluster, and no one even wanted to maintain them

timber talon
hearty flicker
#

I find it funny that there are only like 50-100 Chopin recordings in the current testisng dataset

timber talon
#

clearly with a large enough dataset, we're not gonna be too bad, but there is the risk that everything we generate will start to sound like rachmaninoff, no matter what the prompt is

hearty flicker
#

Of just the etudes alone, imagine how many recordings exist lol

#

Thats definitely an issue, the nice thing about using spotify_dl is that I can tag everything with meta-data

#

So we will have a rough sense of the composition of the dataset. In Kong et ak. they do this type of analysis too I believe.

timber talon
#

i agree there's a lot of potential, here! i worry a bit about transcription noise, but i'm looking at the Giant-MIDI paper and those eval numbers are pretty impressive

#

idk if you already posted it, if you did, my bad

quasi steppe
#

it works for any mp3? I'm thinking about some non-public sources

timber talon
#

there's nothing in the paper about inference memory requirements. they just say training takes 1 V100 card, 32GB

#

and the model is 20m params

tiny coral
#

Since you're looking at gathering data from more sources, is there any interest in exploring semantic data deduplication? I.e. create an embedding model trained on tokenized midi to measure similarity

hearty flicker
#

I mean, if nothing else we should be able to just expand GiantMIDI

hearty flicker
tender dragon
tiny coral
#

I recently noticed a data quality issue in GiantMIDI. Some youtube channels that are included in the data have outro tracks unrelated to the main piece. In retrospect, these explain a common strange behavior from the RWKV models I trained some time ago. For example, this is one of the source audios for the dataset: https://www.youtube.com/watch?v=Zj_psrTUW_w
I have confirmed that this is all transcribed into the raw midi file.

#

I suspect we will also find single files containing multiple movements in series. I'm going to add a step in my own preprocessor that splits large delays into multiple documents, and add a minimum document length filter.

hearty flicker
timber talon
#

i think it's because they download from Youtube, through some complicated pipeline where they first scrape IMSLP and then query youtube. they have a whole section on their error rate... section 5.2

#

i have a feeling that @hearty flicker 's approach with spotify will address both the outtro issue and the mislabeled titles issue

#

I'm betting spotify's data is better quality than Youtube

hearty flicker
#

Yeah I also skimmed the paper a few hours ago

#

I feel like spotify_dl is a much better approach

timber talon
#

for sure

#

(what's the licensing if we do that? it's probably like, don't release anything, right?)

hearty flicker
#

I've just tested it out and it's surprisingly good, it gets the correct mp3 like 99% of the time

timber talon
#

cool!

hearty flicker
#

I mean it's all technically downloaded from youtube too, it's just done in a way where it only downloads if there is as complete match

#

They released the mp3 files for GiantMIDI, however I would be skeptical doing that myself.

timber talon
#

spotify_dl downloads from youtube? oh sorry, my bad

hearty flicker
#

Yeah it does, but only complete matches. Whoever made it did a really good job

timber talon
#

gotcha, sorry was misunderstanding, but yeah... it sounds better than GiantMIDI's approach

hearty flicker
#

I also think it's good to remember that ideally this stuff will be used for pre-training, higher quality datasets can be used to tune the specifics. My experience is that this sort of approach this works in practice.

timber talon
#

got it, that makes sense

sharp quiver
#

Why use spotify_dl rather than anything else? From reading the conversation, I'm assuming piracy is not an issue, so why not just download the actual rips from CDs?

quasi steppe
#

My guess is that there might be legal risks to circulate those data empties

sharp quiver
#

I mean, if spotify_dl gets exact matches, it's probably getting official uploads which are probably identical to rips except degraded

quasi steppe
#

I bought a 1T hard drive 11 years ago just to save my collections

sharp quiver
#

Reminds me I need to rip the Bach collection someone gave me a while ago...
It's the full set, 155 CDs berk I'm not even a Bach kind of guy

hearty flicker
#

Whoever made this tool was pretty smart, I might look into the src tomorrow to see how exactly it's finding these matches

minor hamlet
#

Thought i would ask here, whats the best way to get midi embeddings? Similar to say a t5 encoder module for text.

#

could I get some from the rwkv model?

#

im not too sure how that works however

versed trench
#

@quasi steppe hey there! Do you have any other opensource implementations for your cfg experiments? Can take other discussions to DMs if needed

quasi steppe
hearty flicker
#

@timber talon nice job on the albums : )

timber talon
#

Oh thanks haha WIP

hearty flicker
#

I like this piece, it might make an interesting prompt

#

There is also audio post-processing which makes everything sound way better.

#

@here This sample is pretty cool! This prompt works well.

#

The audio post-processing also goes a long way

hearty flicker
#

Here is an updated version of the notebook that will automatically convert to mp3 (you can listen inside the notebook)

hearty flicker
#

Ok I just put the orchestral version of Rach 2 (2nd movement) and the results are weird as hell haha

timber talon
#

Darn, i mainly listen to music by composer, not by performer.
But compiling this spotify list, it makes sense to be performer-centric. It's CRAZY how many of the biggest performers all play the same few composers

#

Like, Chopin over and over again. A bit of Schubert. Rachmaninoff. Repeat. lol

hearty flicker
timber talon
#

I'm gonna try to get some Bach on there. Glenn Gould is the undisputed master but he hums along with most of his tracks......

hearty flicker
#

btw do not under any circumstances put Autumn Leaves into the model, if you want to keep your ears

timber talon
#

lmao

hearty flicker
#

To be fair, it does surprisingly well given the circumstances lol

hearty flicker
#

Once we get the inference library built out and the context length extended, I think my dream of a never ending fugue might become a reality ha!

hearty flicker
#

There is a bug will the MIDI processing which is resulting in some notes being played as staccato when they shouldn't be. Gonna look into this tomorrow morning...

#

I thought I fixed this bug ages ago but obviously not...

hearty flicker
hearty flicker
#

Schubert, this one has pretty decent long term structure before the end

hearty flicker
#

Hey @here, the following is a rough list of things for the next week:

  • I'm going to fix some bugs related to MIDI conversion, which occasionally cause weird staccato notes.
  • Adding automatic audio processing to the inference library. Additionally, I'll refactor this part of the codebase to make it more sustainable.
  • @timber talon and I are going to work on experiment planning. I'll try to create a first draft (Google Doc) today or tomorrow.
  • @quasi steppe and I are going to conduct a pre-training experiment. I need to make some changes to the TokenizedDataset code first. After that, we'll aim to run an experiment on the SAI cluster this week. It might be cool to use a slightly larger dataset for this purpose. While we're at it, we might as well conduct some experiments with fine-tuning.
  • @timber talon and I are going to continue compiling the list of Spotify links. Great work on this, by the way, @timber talon! šŸ™‚
  • I'm going to write the code for downloading the audio (spotify_dl) and the code for performing the audio->MIDI transcription in parallel on multiple GPUs.
  • There are several composer people that I'm going to reach out to, to see if they are interested in the project. There is one guy in particular (https://www.youtube.com/@cedarvillemusic) who might be interested in the Fugue stuff. This might also tie into the paper, as @timber talon suggested in the meeting.
hearty flicker
quasi steppe
# hearty flicker Hey @here, the following is a rough list of things for the next week: - I'm goi...

I made the dataset according to what I proposed using Huggingface datasets api. It's a fixed parquet file with samples duplicated 20x and concatenated, regrouped (2048 length) and shuffled. Also applied your data augmentations as well. It's only 2Gb thanks to the parquet format and we have about 1.6B tokens in it.

But I really like your dataset codes and we don't have to make drastic changes. I can use Huggingface API to quickly train the medium model using my dataset, and let's see if it solves the quality degradation problem towards the end.

hearty flicker
#

If you can do it without code changes, then we can just use that!

#

The main reason I have MidiDataste and TokenizedDataset seperated out is so that we have freedom to change TokenizedDataset

quasi steppe
hearty flicker
#

Since TokenizedDataset is only used for generating training datasets

quasi steppe
#

it's just that hugging face api is good for quick-and-dirty stuff šŸ˜‚
But their datasets api is solid. The huggingface trainer is incredibly slow but with our scale it will be alright.

hearty flicker
#

TokenizedDataset is pretty similar to the hf datasets API.

#

It's actually in the backlog to make it inherit from that class

#

I just haven't gotten around to it yet haha

#

Ok cool : )

#

1.6b tokens is quite a lot haha

#

How many sets of data augmentation did you try?

#

Or is the 1.6b without any data augmentation in there

quasi steppe
#

I just followed the

[
  tokenizer.export_chord_mixup(),
  tokenizer.export_velocity_aug(1),
  tokenizer.export_pitch_aug(5),
  tokenizer.export_tempo_aug(0.2),
]

in your code

#

Did it for every sample

hearty flicker
#

I've replicate it in TokenizedDatasets with the same name

quasi steppe
#

I actually have to write a little custom mp codes for these. Somehow huggingface doesn't like the nested tuples and lists (after tokenize)

hearty flicker
#

Datasets.set_transform()

quasi steppe
#

oh those don't work after tokens are encoded into numbers

#

and huggingface datasets doesn't like nested list/tuples

#

šŸ˜‚

hearty flicker
#

No they don't! Thats kind of why I made my own class instead of using HF

#

I'm still trying to fix the staccato note bug, so we should wait to build the pre-training dataset until that is fixed

#

I'll push the fix to dev

quasi steppe
#

awesome!

hearty flicker
#

I think I have it figured out, but it's not quite working yet.

#

MIDI is such a shitty file format hahaah

#

full of eccentricities : )

quasi steppe
#

I will figure out training in the meantime. Hugging face trainer works with deepspeed and I should be able to set up something quick for multiple gpus

#

we can eventually improve the code base with or without those stuff but I'm really curious and want some quick-and-dirty results first šŸ˜‚

quasi steppe
#

Wow it's fast. Got a few checkpoints already

hearty flicker
quasi steppe
hearty flicker
#

It should train incredibly fast on tht

#

that*

#

The last checkpoint I did was 24hrs on 1x4090

quasi steppe
#

the tflops stat is bad... I never had good experiences with huggingface's stuff

hearty flicker
#

In my training code?

quasi steppe
#

no

hearty flicker
#

Yeah I think it's not correct

#

Oh

quasi steppe
#

huggingface is bad

hearty flicker
#

Ah

#

How many tflops are you getting?

quasi steppe
#

Also the model is very small. This could also impact the efficiency

#

I did it manually and got about 100+ tflops but we got 8xA100 lol

hearty flicker
#

I think it's measuring per gpu btw

quasi steppe
#

no I calculate by hand haha

hearty flicker
#

100tflops on 8xa100 must be wrong

#

Because I was getting like 80 on my 4090

quasi steppe
#

I'm using huggingface Trainer class + accelerator to do the training.

hearty flicker
#

It must be per gpu, if not that is insanely low lol

#

There must be something seriously wrong in the code

#

We should defoo fix all of this before we do the proper training run...

quasi steppe
#

ok the batch size per device is 16, and we get 2 batches per second

#

I will check my math haha

hearty flicker
#

I think with medium on my 4090 with batch size 8 I was getting 5its/s

quasi steppe
#

can easily forget a zero or something

#

what? LOL

#

last year when I used huggingface Trainer I actually got something similar and never figured out what's wrong

hearty flicker
#

Yeah I don't use trainer, only accelerate

#

I'll sort this out when I figure out how to use SAI cluster lol

#

gotta read the docs

quasi steppe
#

to carefully do it we should just use torchrun FSDP

#

I had good experiences with that combo.

#

I don't know what benefit accelerate has

hearty flicker
#

Because the model size is so low, I don't actually think any shardings is required

#

Can just use DP

quasi steppe
#

I vaguely remember in torch docs they say everything can be wrapped with FSDP

#

even if you want to do dp only

#

I could be wrong though. Or we use DDP. I remember that DP is not supposed to be used.

#

hmm the model still get repetitive towards the endmaybe not that bad....

hearty flicker
#

I've realised what my bug is too, but it's actually pretty annoying to fix

hearty flicker
#

The staccato bug is fixed now, it wasn't a problem with the tokenizer, just with midi.dict_to_midi

#

So it should not affect the model in anyway @quasi steppe

quasi steppe
#

This is awesome

#

I just submitted a couple jobs for 400m and 2b models. I will play a bit more with hparams

hearty flicker
#

If you wana do bigger jobs, do you want some more data?

#

I can give you access to the gcp bucket holding the data

quasi steppe
hearty flicker
#

There are some larger collections I have already zipped

#

But they contain other data such as pop ect...

quasi steppe
#

2B token is very very small for pretraining plus that it's already like 20 epochs

quasi steppe
hearty flicker
#

Giant is 10k files

#

I have over 200k files

#

But a lot of it is pop music

quasi steppe
#

Oh that's nice

hearty flicker
#

Randomly scraped from the net

#

so lower quality

quasi steppe
#

Gotcha

hearty flicker
#

For pretraining purposes it should still be good though

#

Might need to ft the model after if you want classical

quasi steppe
#

Boarding a flight now. Will talk later

hearty flicker
#

Have a safe flight

hearty flicker
#

I can always add it back in later if we want to use the old method for ft.

#

@quasi steppe The dataset building you wanted is now the default behavior, you should be able to do the dataset building and truncation with the relevent code on dev now

tiny coral
#

have you done any data augmentation?

hearty flicker
#

There is a few forms currently implemented

hearty flicker
#

There is also now a MIDI->mp3/wav converter in aria.utils. It will run automatically if you use the aria sample cli entrypoint after pip installing the package.

quasi steppe
#

Wow I got cuda oom even with batch size 1 for a 2B model šŸ¤¦ā€ā™‚ļø. Ok my hacky solution using huggingface Trainer doesn't scale. Will work on your training code later this week.

#

Got a checkpoint for a "large" model. It still prefers to repeat toward the end though

quasi steppe
#

Made a 100x augmented dataset. Gonna see if it overfits or gets better šŸ¤”

quasi steppe
#

oh wow @hearty flicker I think the model just needs more epochs. I'm listening to the completions by a large one trained on 2x augmented data

hearty flicker
#

I think the checkpoint currently on the colab is epoch 50 lol

#

With data aug being applied differently each time haha

#

And it didn't start overfitting

quasi steppe
#

yeah. It's half way through my 100-epoch dataset. The log-loss curve is showing signs of either overfitting or double-descent

hearty flicker
#

You will need to install fluid synth (apt install fluidsynth)

#

Makes a huge difference to the quality

quasi steppe
#

I put some converted mp3 here
https://honglu.fan/files/
It's the 400M model (large.json) with about 3B tokens.
I used some other soundfont but will try yours later

#

they are small so I will keep updating

hearty flicker
#

Btw @quasi steppe I've found that when sampling, the better prompt you use the better the results

#

That chopin MIDI is a little off beat so the model continues that off beat nature

#

If you find a better recording you might get a better result

#

This website is pretty good

#

Try using this

#

Make sure that the composer name is in the name of the file too, so that the model pulls the composer meta information

quasi steppe
#

gonna try bach and beethoven haha

hearty flicker
#

the chopin you are prompting with is from audio-> MIDI

#

So it's a little weird

#

Honestly I've been experimenting with high quality MIDI files for prompts, the results are way way way better

#
  • using the fluidsynth soundfont that I included in aria makes it sound so much better too
#

There is actually a bug in dev that means it's not using the correct soundfont, gonna push the fix npw

#

now*

#

Here is your chopin sample with the proper soundfont

#

Almost sounds like a real perf

quasi steppe
#

indeed

hearty flicker
#

The change is in dev now

quasi steppe
hearty flicker
#

Weird lol

#

I always liked this pieced by chopin

#

But the repeated note might make the model do some weird stuff

#

Also this is a nice etude

quasi steppe
hearty flicker
#

Very weird, maybe the stacatto notes stuff isn't completely fixed yet

#

It's actually such an annoying bug to fix

quasi steppe
#

Overall these are actually pretty good.

#

wow empties. I haven't even started trying style interpolation yet.

hearty flicker
#

Ok, I also merged dev into the main branch

#

with kv-caching the code currently does 20 toks/sec on a t4

#

Pretty decent

#

It will work very well even with the free version of colab

#

I like this

#

I think there is still a slight issue with staccato notes which is frustrating... The way that MIDI deals with the sustain pedal is very strange

quasi steppe
#

Style interpolation. Sounds more interesting than before. It's clearly trying to mix them instead of gluing the two pieces together

hearty flicker
#

This prompt works really well for some reason

hearty flicker
#

I might add a key-meta token to the tokenizer

#

Could be good for controllability stuffs

quasi steppe
#

@hearty flicker what does this message mean? WARNING:root:Tried to decode unexpected note message

hearty flicker
#

That happens when detokenizing into a MidiDict object

#

It can happen when tokens are not in the expected order

#

For instance a note token not followed by a dur token

#

It's normally a sign of degenerate behaviour of the model

#

It should always get the order of the tokens correct

quasi steppe
#

I was trying to go for 4096. It's actually not too crazy when going out-of-bound. It might have potential for YaRN. Gonna try it out later

hearty flicker
#

I mean in theory we could train 8k right?

#

With flash attention it would fit on a 80gb a100 pretty easily

#

If you find any that you like in particular lmk, because I'm compiling some good samples for an email rn

#

@quasi steppe Can I send you dataset files for a dataset larger than just GiantMIDI

#

I can build it without the padding as that is now the default behaviour

quasi steppe
quasi steppe
hearty flicker
#

I like this one

quasi steppe
#

Oooooh I made a mistake in the yarn parameter. Also found a bug in my code in sample.py (not affecting us because it bugs out only when cfg turned off). Will push the fix.
I just tested the ppl and YaRN should be working fine with our models. Will try some 4096 generations and see how it looks.

#

@hearty flicker Can I put use_yarn and some other parameters directly into ModelConfig?

hearty flicker
#

Yeah sure but you need to change all the relevant stuff

#

If you make a pr I'll review to make sure nothing breaks

quasi steppe
#

Oh by the way, the ModelConfig could have been a dataclass.
I think I could define a new YaRN config as an optional param (because there are quite a lot of params and I don't want it to overcomplicate the original ModelConfig)

hearty flicker
#

I think I originally had it as a dataclass but I changed it for some reason

#

You can change it back if you like

quasi steppe
#

hmm I start to feel that dynamic yarn is only making ppl artificially low. The stuff sounds weird

hearty flicker
#

What do you mean by people

#

We should also do a proper yarn ft

quasi steppe
hearty flicker
#

ha!

quasi steppe
#

dynamic yarn has the advantage of not blowing up perplexity until 2x original max_length without finetuning

#

I'm testing this

#

but yeah with finetuning it should be all good because it at least sees some actual long samples

timber talon
#

oh man there's a better way to do the spotify-dl... if you go composer first, and type "Prokofiev complete piano works", or something, into spotify, then you'll often get these massive albums with like 200 songs — every piano piece ever written by that one composer

#

anyway, I have ~20 of those sitting in a commit on my branch, waiting for the first PR to be merged

hearty flicker
#

I'm going to try to include some solo-piano detection in the pipeline like GiantMIDI, but it's best to be as safe as possible

#

That is a really good idea though Alex šŸ™‚

#

I will merge your pr! Just got held up yesterday ha

hearty flicker
#

This is a great strategy btw, the nice thing about spotify_dl is that is can work using any playlist

#

I'm gonna add some regex stuff I think too, scraping the titles to exclude all other instruments apart from piano

#

Also to exclude other bad keywords like concerto ect.

#

A good strategy might also be to automatically skip any spotify tracks that have more than two artists credits

#

Ok I think instead of rejecting these albums based off of there being a single bad track, I'm going to use other methods to filter out non solo-piano

#

I'm going write the code for downloading the tracks and processing the metadata done this weekend. Then next week I'll try to get the transcription pipeline up and running.

timber talon
#

i'm asking some of my friends about more niche piano composers, to continue diversifying a bit

hearty flicker
#

I tried it out btw, works well on these compilations

#

I ran spotify_dl over one of the them and got a ton of hits

#

About 50%

timber talon
#

oh great!

#

i wonder if this is something @sand nymph would consider tweeting out from the EleutherAI account?

"Who is your favorite classical piano composer (bonus points if they're niche or underrepresented, e.g. Clara Schumman, Scott Joplin, Fanny Mendelssohn)"

#

so we could crowdsource a wider range of piano composers for this dataset?

sharp quiver
#

Do you already have a list of composers you have data from or that you are planning to get data from?

ruby frost
#

can i beg for some scriabin

hearty flicker
timber talon
#

oh yes, we have a lottt of scriabin

sharp quiver
hearty flicker
#

We are mostly going by pianist as it's easier when gathering links

#

Most of the popular composers are well represented

hearty flicker
#

But this could be a more promising direction for doing the transcription.

#

This is the same method I had in mind to improve over Kong et al.

#

However I've just found it during lit review lol

#

If this model isn't open source, I think I could pretty easily reimplement it in the aria repo

#

Might even work well with the aria tokenizer for the decoder, instead of using MIDI representation

#

I think there are other improvements possible too, like using synthetic data and more data augmentation

sharp quiver
#

it looks like sony has a (SotA?) open source piano transcription model
https://arxiv.org/abs/2307.04305
https://github.com/sony/hFT-Transformer

GitHub

Pytorch implementation of automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture (hFT-Transformer). - GitHub - sony/hFT-Transformer: Pytorch...

hearty flicker
#

Good find !

#

I wonder if this was presented at ismir this year, I wasn't there (it's going on rn)

sharp quiver
#

yeah I've seen some people tweet about it, it looks pretty fun xD
https://fixupx.com/is_s_yun/status/1722416557559583231?s=20

Amazing night at @ISMIRConf - thanks to Uri, Romain, Genís, and André for playing with me. Also, "Fear of MATLAB" was the best thing I heard in years! 😭😭

ā–¶ Play video
hearty flicker
#

I know the guy playing guitar ha (well I met him once)

quasi steppe
#

@hearty flicker do messages like
[2023-11-09 15:28:02,465] aria.data.datasets: [INFO] MIDI at bigger_data/bitmidi/bitmidi/96166.mid failed preprocessing tests: [('max_programs', 12), ('max_instruments', 8), ('total_note_frequency', 37.87731885463125)]
matter?

hearty flicker
#

It's performing some filtering as part of the data lib

#

You can adjust them in the config file to your liking

#

Might be worth tuning for bitmidi

quasi steppe
#

I'm running it across bitmidi and got my terminal screen full of these scrolling nonstop lol

hearty flicker
#

Yeah that makes sense, I'd adjust the values to whatever you think would work

#

max_programs is the number of different intruments in the midi file

#

max_instruments is the number of different instruments (e.g. saxophone and clarinet would both be classified as woodwind)

#

total_note_frequency is max midi notes per second

#

you can adjust in config/config.json under preprocessing tests I think

#

btw the code for building MidiDatasets is multithreaded so I'd suggest running it on a machine with multiple cores or it will take ages

quasi steppe
#

do "dataset_gen_args" matter any more?

quasi steppe
hearty flicker
#

lmao it should take no time then

#

Dataset gen args are for making TokenizedDatasets

#

All you need to do is set the max_seq_len you want

#

So if you use aria tokenized-dataset then it will make the tokenized dataset based on dataset_gen_args in the config

#

That code is also multithreaded so should be fast

quasi steppe
#

how many processes by default? Is there an argument for that? Oh nvm, looks like it's max cpu count.
But it's still running šŸ¤”

hearty flicker
#

It should just use all available cpu cores

#

On 16 cores I think GiantMIDI takes about 5-10mins

quasi steppe
#

strange... it just froze. The log should keep scrolling

hearty flicker
#

Hmmn, it shouldn't run out of ram or anything

#

weird

#

Maybe try settings pool size to smaller than the max

#

Change Pool() to Pool(32) or something

#

I've never had it freeze on me before

quasi steppe
#

oh one process got an error

Traceback (most recent call last):
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/fsx/home-honglu/aria/aria/data/datasets.py", line 212, in _get_mididict
    failed_tests = _run_tests(mid_dict)
  File "/fsx/home-honglu/aria/aria/data/datasets.py", line 180, in _run_tests
    test_res, val = test_fn(_mid_dict, **test_args)
  File "/fsx/home-honglu/aria/aria/data/midi.py", line 660, in test_note_frequency
    notes_per_second = (num_notes * 1e3) / total_duration_ms
ZeroDivisionError: float division by zero
hearty flicker
#

Eek, I'll patch this

quasi steppe
#

got this when hitting ctrl-c. Maybe the pool got some deadlock caused by this error in one process

hearty flicker
#

Leme have a look

#

Must be an edge case

#

pull main and try again? I patched it I think.

#

@quasi steppe

quasi steppe
#

@hearty flicker Got

[2023-11-09 21:12:11,906] aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments
[2023-11-09 21:12:12,063] aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments
[2023-11-09 21:12:12,432] aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments
Traceback (most recent call last):
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/fsx/home-honglu/aria/aria/run.py", line 179, in <module>
    main()
  File "/fsx/home-honglu/aria/aria/run.py", line 171, in main
    build_tokenized_dataset(args=_parse_tokenized_dataset_args())
  File "/fsx/home-honglu/aria/aria/run.py", line 138, in build_tokenized_dataset
    dataset = TokenizedDataset.build(
  File "/fsx/home-honglu/aria/aria/data/datasets.py", line 613, in build
    buffer += entry
TypeError: 'NoneType' object is not iterable

When running tokenized-dataset

#

but the jsonl dataset is done alright

quasi steppe
#

also json.decode long dict seems incredibly slow. The MidiDataset.load is taking unbearable time to load stuff

quasi steppe
#

The bitmidi jsonl file is huge. Parsing json is so slow that I got fluctuating performances between 20 - 100 lines/sec but we have 90k there lol. Probably the MidiDataset needs to do lazy loading in order to hide the latencyšŸ¤¦ā€ā™‚ļø.

hearty flicker
#

Ok let me look at this tomorrow...

#

That bug should be an easy fix too

#

MIDI is a messed up file format tbh, you can't imagine the amount of random bugs related to it

quasi steppe
hearty flicker
#

There should be a way around it

quasi steppe
#

MidiDataset lazy loading seems like a small fix. Let me submit a PR and you check if it makes sense

hearty flicker
#

Thanks : )

#

I can profile it tomorrow

quasi steppe
#

+12 -10 such a small change lol.
Hmm I tried it but doesn't seem to help much. Guess json decoding is a real bottleneck but I need to go to bed now.

hearty flicker
#

I haven't checked yet, but it could be due to having to load the json and then pickle it to be sent to the process

#

It's slightly annoying that the json loading has to run on the main process though

#

Once you have a tokenized dataset, using that should be really fast since the json loading doesn't run on the main process

hearty flicker
#

Would be an easy fix

#

I'm super busy during the day today but I'll push these fixes in the evening

hearty flicker
#

@quasi steppe I pushed a fix, can you see if this speeds it up before I look at your pr?

#

It should move the json loading into the multiprocessing

#

Also the bug with += buffer should be fixed but I haven't tested it

#

Oh btw if you are building a large tokenized dataset from a mididict one, you should definitely use the 'aria tokenized-dataset' cli as it doesn't load the entire thing into memory

quasi steppe
#

hmm the speed feels similar

hearty flicker
#

The bottleneck might be the pickle then

quasi steppe
#

I ran the cProfile and I'm still trying to understand the result.
For my own script I tokenize, transform and encode right away and make it a parquet file. Run this routine 100 times async and save to 100 different parquet files and then read and combine as the last step.

hearty flicker
#

There is no way around that to my knowledge

#

If it's not the json load then it's probably the pickle

quasi steppe
#

yeah

hearty flicker
#

I can't think of what else it could be

#

That's a bottleneck for mp in general, esp for this sort of thing where you are sending input to each process

quasi steppe
#

the biggest bottleneck was.... acquire Lock šŸ¤”

#

wonder which thread did that

hearty flicker
#

Hmmmn

#

If you are getting 20/s that should be 1.5 hours

#

Kind of sucks but idk

#

It's doable

quasi steppe
#

yeah totally

#

I'm just curious what was going on

hearty flicker
#

It's defo related to mp btw

#

In the mp function, the tokenizer and the mididict are passed in

quasi steppe
#

yeah mp has a lot of overheads and I'm never fully comfortable about what it's doing