Neuro-Symbolic Music Models | EleutherAI | Page 3

hearty flicker Nov 10, 2023, 12:13 PM

#

So pickling the tokenizer might be the bottleneck too

quasi steppe Nov 10, 2023, 12:13 PM

#

oh you pass tokenizer in through pickling? I thought it cannot be pickled. I tried that when writing my own script and got an error

hearty flicker Nov 10, 2023, 12:14 PM

#

Not sure if there is a good way to share python objects between processes, I'm not super familiar with python mp in general I just threw this together to speed things up

#

Actually it shouldnt be pickling the tokenizer since I'm using partial

quasi steppe Nov 10, 2023, 12:15 PM

#

hearty flicker Not sure if there is a good way to share python objects between processes, I'm n...

we could just use a bunch of processes reading from a queue, and each one initializes tokenizer separately

hearty flicker Nov 10, 2023, 12:16 PM

#

I'll look into it later...

quasi steppe Nov 10, 2023, 12:17 PM

#

or like I said just have a script to build the dataset in memory directly, then save the 100x epoches to a giant file

hearty flicker Nov 10, 2023, 12:17 PM

#

Could also index the file, and read from it using mmap in each process seperately

hearty flicker Nov 10, 2023, 12:18 PM

#

quasi steppe or like I said just have a script to build the dataset in memory directly, then ...

How would that speed things up?

quasi steppe Nov 10, 2023, 12:20 PM

#

I mean we do the whole I/O, tokenization, transform, concat and chunking, encoding in the same process that draws from a queue or just do round-robin. Save the token ids to a file and next time we don't need to do anything

#

if the bottleneck is that we spin up too many processes and pickle the tokenizer too many times, this helps too

#

if the problem is still the json.decode this should be optimal anyway if we use a queue to let processes grab whenever they can.

hearty flicker Nov 10, 2023, 12:23 PM

#

Oh so, tokenized datasets are built to a file

#

So they only need to be built once and then they can be accessed whenever

#

And accessing them is fast

quasi steppe Nov 10, 2023, 12:24 PM

#

yeah

#

and put everything in one process to minimize communications

hearty flicker Nov 10, 2023, 12:24 PM

#

That's how I have it implemented currently actually

#

Unless I'm misunderstanding

quasi steppe Nov 10, 2023, 12:25 PM

#

you only save the tokenized dataset, not the token ids right?

#

tokenized sequences can do data augmentations which is nice. But I say we do augmentation when building it, and save a static file full of integers

#

or maybe I misunderstood

hearty flicker Nov 10, 2023, 12:27 PM

#

Could do, although I don't think that there is any CPU bottlenecks during training

#

Fetching, loading and doing data aug takes about 5ms per entry

#

So one cpu core can do like 100-200/s

#

And there is no pickle stuff in the mp, so with 8 or 16 cores it's never gonna be an issue

#

I'm not sure how this scales to multiple nodes though

quasi steppe Nov 10, 2023, 12:32 PM

#

oh yeah the dataloader itself is not multiprocessing for now... As long as this problem doesn't happen for training it's probably alright

quasi steppe Nov 10, 2023, 1:17 PM

#

ok it's faster with fixed workers that receive items from a queue. Tokenizer is only pickled once so maybe that's it.
But it's not as fast as I imagine though.

quasi steppe Nov 10, 2023, 1:34 PM

#

Doing fixed workers is about 1.5-2x faster
Actually for giant_midi, before 4000 samples the speed fluctuate a lot but afterwards it was so fast (like 3x). Happens for both methods. If you just do I/O and read the whole file to memory, the whole thing is like 5 seconds in single process.... So I bet there is still some massive overhead depending on sequence lengths.

hearty flicker Nov 10, 2023, 3:45 PM

#

I'll look at your commits I'm interested about how you implemented it

#

I was trying to avoid loading the entire mididataset jsonl into memory at once

#

I was worried it would be too big to fit in memory for the larger datasets

quasi steppe Nov 10, 2023, 3:55 PM

#

hearty flicker I was worried it would be too big to fit in memory for the larger datasets

no I mean if you do one pass with the data from file to mem, the lower bound of latency should be about 5 sec. If it takes half an hour with varying speed depending on sample length, there is def something inefficient still going on.

#

not that I want to load everything actually to memory.... Just trying to rule out the possibility that the file system in the cluster node is doing some weird things

#

The jsonl file is also getting fat. The bitmidi is already 6G. If we do 50 epochs it's gonna be 300G but there will be a storage limit in SAI nodes. I definitely can't save 300G with my current quota

hearty flicker Nov 10, 2023, 4:00 PM

#

Yeah that's kinda why I have the data aug implemented dynamically

#

The dataset files get huge

#

Ha

quasi steppe Nov 10, 2023, 4:00 PM

#

yeah makes sense

#

I also throw in a jsonl.zst reader/writer. When I was working with pilev2 jsonl.zst was the go-to format that really brings storage down

hearty flicker Nov 10, 2023, 4:01 PM

#

I think the issue for going from Mididataset- > tokenized dataset is the pickle

#

It makes sense that the profiler is saying require lock is the biggest timesink maybe

#

Because it's trying to acquire the lock on the mididict string / dict

#

Maybe idk, the pickle happens inside that perhaps?

#

Someone skilled in python mp can probably tell what is going on

#

The way that the MidiDataset and TokenizedDatasets are designed is very subpar in general

#

I built it pretty fast

#

I wish I had access to my server so I could run some tests

quasi steppe Nov 10, 2023, 4:07 PM

#

if there is a way to do mmap with cheap random access for zstd stream reader, we could stick to zstd all the time without worrying about storage. But not sure if it's possible

#

was trying that and failed

hearty flicker Nov 10, 2023, 4:09 PM

#

Ima have a look at this properly this weekend

#

Just to confirm 100%, you are concerned with the speed of 'aria tokenized-dataset' (aka the build method) and not the tokenized dataset class once it is built, right?

hearty flicker Nov 10, 2023, 4:12 PM

#

quasi steppe I also throw in a jsonl.zst reader/writer. When I was working with pilev2 jsonl....

This is great btw, I never got around to this

quasi steppe Nov 10, 2023, 4:14 PM

#

hearty flicker Just to confirm 100%, you are concerned with the speed of 'aria tokenized-datase...

haven't tried your training script yet, but from the code it looks like it should work great

quasi steppe Nov 10, 2023, 4:15 PM

#

hearty flicker Just to confirm 100%, you are concerned with the speed of 'aria tokenized-datase...

so yeah it's actually not that important but just trying to get familiar with your setting

hearty flicker Nov 10, 2023, 4:20 PM

#

If pickling is the problem, another alternative would be to chunk the file into n parts and then each process can convert only it's chunk

#

That way, there would be no overhead with starting up and killing each process

quasi steppe Nov 10, 2023, 4:23 PM

#

hearty flicker If pickling is the problem, another alternative would be to chunk the file into ...

pickling should be fine now. They will only be pickled once in my implementation.

#

was just thinking if there is anything else

#

@hearty flicker got a lot of aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments when trying tokenized-dataset for bitmidi.

#

oooooooh never mind... The config.json was changed for bitmidi but reverted back when pulling the updates.
But this made the workflow a little inconvenient.

hearty flicker Nov 10, 2023, 4:51 PM

#

That's expected

#

It's not really an error

quasi steppe Nov 10, 2023, 4:52 PM

#

oh doesn't help, still got that

hearty flicker Nov 10, 2023, 4:52 PM

#

It's not an error

quasi steppe Nov 10, 2023, 4:52 PM

#

gotcha

#

Traceback (most recent call last):
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/fsx/home-honglu/aria/aria/run.py", line 178, in <module>
    main()
  File "/fsx/home-honglu/aria/aria/run.py", line 170, in main
    build_tokenized_dataset(args=_parse_tokenized_dataset_args())
  File "/fsx/home-honglu/aria/aria/run.py", line 137, in build_tokenized_dataset
    dataset = TokenizedDataset.build(
  File "/fsx/home-honglu/aria/aria/data/datasets.py", line 641, in build
    buffer += entry
TypeError: 'NoneType' object is not iterable

Also got this. Not sure if it's from my changes or not. Will try some debugging

hearty flicker Nov 10, 2023, 4:53 PM

#

It's happens when you have a midi dict which only has instruments that should be removed according to the config

#

Uhhg this error again

#

I thought maybe it's happening because entry is None so I put in a check for that

#

But I guess not

#

I'll debug this tmo too

#

Properly

quasi steppe Nov 10, 2023, 4:55 PM

#

I run it on my branch that has a lot of changes on _get_tokenized_seqs_mp. Could be my problem too. Will let you know

hearty flicker Nov 10, 2023, 4:58 PM

#

If you get that with aria/main let me know

#

The error message was confusing me a bit tbh

#

I didn't see how the type error could apply to that line

#

buffer += entry

quasi steppe Nov 10, 2023, 5:02 PM

#

oh another problem, now it seems that sequences are concatenated and chunked in a fixed way (we save them to tokenized dataset), and then augmented.
But ideally the data needs to be shuffled first for every epoch, and then concatenated and chunked. There will be a difference when running multiple epochs

#

like
abcde and 123456789 becomes abcde12, 3456789 if sequence length is 7. But with multiple epochs it's gonna be always these two with augmentations.

#

ideally I might want 1234567 and 89abcde or even stuff that starts in the middle like cde1234, ...

hearty flicker Nov 10, 2023, 6:07 PM

#

Ok added to the list, will be good to get this all squashed now ha

#

That might be a hard issue to fix. I wonder how it's done in NLP

#

I can't think of an obvious way to make it so that the sequences are concentrated differently for each epoch

#

There must be a way that it's done in NLP

#

Any thoughts on this @sand nymph ?

quasi steppe Nov 10, 2023, 6:51 PM

#

hearty flicker I can't think of an obvious way to make it so that the sequences are concentrate...

concatenate and build the samples on-the-fly. Or if it's enough to fit the memory, duplicate and shuffle and build one static dataset into files.

#

we are already close. Basically we just need to skip the tokenized-dataset step, and do this in data loader

hearty flicker Nov 10, 2023, 6:52 PM

#

Not sure how to integrate that with multiple workers / processes though

quasi steppe Nov 10, 2023, 6:54 PM

#

hearty flicker Not sure how to integrate that with multiple workers / processes though

mmap the original jsonl file, and let each worker generate random indices, do all the manipulation until buffer is full, encode everything and yield the result

#

basically start with one single process DataLoader that does all of these lazily. To scale to multi-gpus have another DataLoader wrapped around a few processes each targeting the original dataloader (or maybe torch dataloader has some built-in stuff for this). For multi-node I don't think the code needs to change because we sample at random, and global_rank doesn't really matter.

hearty flicker Nov 10, 2023, 6:58 PM

#

Ok that sort of makes sense I'll look into it tmo

quasi steppe Nov 10, 2023, 7:18 PM

#

hearty flicker Ok that sort of makes sense I'll look into it tmo

yeah no hurries. I will have some time Sunday and I could help. I have done these before actually. But now there are probably even better api from pytorch

quasi steppe Nov 12, 2023, 12:38 PM

#

Man.... Can't believe training a 400M model with 8192 length on a single node can only do batch_size=1... I'm sure pp will help but I thought we only need that when > 1B...

#

and it's 13sec per batch lol... total batch size 128 (grad acc 16 steps)
Running profiling tools now.... Didn't do the math but I'm sure the flops is incredibly bad

hearty flicker Nov 12, 2023, 6:17 PM

#

The reason why might be that your not using gradient checkpointing

#

If the issue is the vram, otherwise there must be a different issue

#

You can use the profiler to make sure that flash attention is working?

#

Also are you training in bf16?

quasi steppe Nov 12, 2023, 6:48 PM

#

hearty flicker Also are you training in bf16?

Tried both bf16 and fp16

quasi steppe Nov 12, 2023, 6:50 PM

#

hearty flicker The reason why might be that your not using gradient checkpointing

Yeah. There is a bug with gradient checkpointing right now. I will fix that later.
But when I used to do pretraining we just went without gradient checkpointing at all and still fit a couple on 3-7B models

quasi steppe Nov 12, 2023, 6:54 PM

#

hearty flicker You can use the profiler to make sure that flash attention is working?

Already did. Also tried to change the code to explicitly use flash_attn. No difference in both vRAM usage and speed so flash attention is working well.

Used a profiling tool to dump a json for perfetto visualization. It seems it went alright. Didn't see a lot of bubbles other than low level stuff between ops inside kernels. I don't have an immediate idea about what to improve

#

For vRAM I'm sure it can improve with pp. Need to change code a little bit I think. I'm still really confused by flops

quasi steppe Nov 12, 2023, 9:04 PM

#

@hearty flicker DDP seems to have trouble with gradient checkpointing

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1
) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acros
s multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module gr
aph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, 
if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of
 parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready mu
ltiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workarou
nd if your module graph does not change over iterations.

Guess that's probably why people didn't recommend gradient checkpointing when I worked on pretraining.

quasi steppe Nov 12, 2023, 9:34 PM

#

FSDP and cpu offload seems to make it fit and automatically scale to multinodes. Looks slow but didn't do the math carefully. I thought I need to manually write the wrappers but it's nice that accelerate takes care of them already.
I should be able to start a bigger run tomorrow on the bitmidi data we have.

sand nymph Nov 13, 2023, 1:59 AM

#

@quasi steppe it looks like you can just pre-compute this? Build the two epochs separately and concatenate them. It's only tricky if you try to do it on the fly I think

quasi steppe Nov 13, 2023, 8:12 AM

#

sand nymph <@823129585230544906> it looks like you can just pre-compute this? Build the two...

The dataset? Yeah right now I do exactly this. I precompute everything for a few epochs and save to disks, then load sequentially to train. Huggingface datasets library is doing a great job at this scale.

quasi steppe Nov 13, 2023, 8:56 AM

#

@hearty flicker the xlarge sized model (I made a mistake, it's 700M rather than 400M) trained on the full bitmidi at around 8 epochs

#

the chord is fairly different from the start already.

hearty flicker Nov 13, 2023, 10:59 AM

#

Try it on some multi-track stuff

#

This is just trained on bitmidi right?

#

Not including the classical data?

hearty flicker Nov 13, 2023, 11:00 AM

#

sand nymph <@823129585230544906> it looks like you can just pre-compute this? Build the two...

Is this the way it's typically done? The issue is that I normally train for like 50 epochs...

#

The file size might get too big

#

My immediate goal is to get the audio downloading and transcription working, then I'll and try sort out the dataloading and training stuff

#

It would be nice @quasi steppe if eventually we could unify all of the code that we are using into aria/data/datasets.py and aria/train.py

#

Don't worry about it for now though, haha

#

What did the loss curve look like for giantmidi-700m?

quasi steppe Nov 13, 2023, 11:04 AM

#

hearty flicker What did the loss curve look like for giantmidi-700m?

log-loss. Very beautiful. But the model sucks lol

hearty flicker Nov 13, 2023, 11:04 AM

#

It might be undertrained

quasi steppe Nov 13, 2023, 11:04 AM

#

I'm listening to the generations and it really sucks. There are some good ones but failure cases are too many

quasi steppe Nov 13, 2023, 11:04 AM

#

hearty flicker It might be undertrained

very likely

hearty flicker Nov 13, 2023, 11:05 AM

#

It sounds like it from the sample you sent

#

btw if you test it, you should test it on pop music not classical

#

There is very little classical piano stuff in bitmidi

quasi steppe Nov 13, 2023, 11:07 AM

#

but the loss is already below 2 (where the "overfit" classical one we got). The loss value doesn't mean too much but I would have expected classical is more repetitive and structured than pop and should have a lower loss given same training conditions

quasi steppe Nov 13, 2023, 11:07 AM

#

hearty flicker btw if you test it, you should test it on pop music not classical

already did. Don't want to spam the channel haha

#

not looking good

hearty flicker Nov 13, 2023, 11:08 AM

#

Weird haha

#

I think it's useful to have val loss just so we have an idea of how over/underfit the model is

#

My experiance is that the best checkpoints occur when val loss stops decreasing

quasi steppe Nov 13, 2023, 11:10 AM

#

hearty flicker I think it's useful to have val loss just so we have an idea of how over/underfi...

I'm quite sure that the vram blows up if I evaluate. I remember huggingface did a terrible job on mem management when you have eval going on during training

#

or maybe they have fixed it by now. I will try

hearty flicker Nov 13, 2023, 11:11 AM

#

Hmmm, ive never used hf trainer to be fair

#

Seems to work fine with my train script, idk

quasi steppe Nov 13, 2023, 11:12 AM

#

hf trainer actually doesn't have fancy codes. I tried to glance through it in the weekend. Actually similar to yours but the only advantage is that it works with HF datasets out-of-box and supports a lot of fancy wrappers like FSDP or deepspeed pp, etc.

#

(well I mean it's not hard to adapt codes for those stuff... just that it's a little faster to try out things quickly haha)

hearty flicker Nov 13, 2023, 11:14 AM

#

Yeah true, I'm going to adjust the dataloader stuff and train script this week

#

So that is does what we need

#

There must be some sort of issue going on, musenet was also trained on bitmidi

#

And the samples from that sound good

#

As a sanity you could try recreate the checkpoint we've been using on the notebook

#

50 epochs at medium

#

On giant midi + kunderfug + mutopia + maestro

#

On the bucket

quasi steppe Nov 13, 2023, 11:18 AM

#

I'm gonna do some basic data stats. There are a few possibilities:

10 epoch is not enough (but it kinda worked for classical so idk...)
I changed the optimizer from Adam to Adafactor for mem reduction but maybe Adam works better
model architecture? 96 layers is really really deep and honestly it's my first time training that deep of a network. I could add some monitoring of each layer and I suspect it's really too deep.
data quality issue. I did random sampling and many are short (but not short enough to be filtered).

hearty flicker Nov 13, 2023, 11:19 AM

#

There is a min length filter in the config that you can use btw

#

When building MidiDataset

quasi steppe Nov 13, 2023, 11:19 AM

#

hearty flicker There is a min length filter in the config that you can use btw

yeah, I noticed that. Didn't tweak that when building the dataset

#

this training job wasn't serious anyway. Just left it on yesterday and I can't believe it's not preempted for a whole day haha

hearty flicker Nov 13, 2023, 11:21 AM

#

It could also be that the concatenation stuff we are doing isn't helping

#

Maybe try to recreate the old checkpoint with the new training methods

quasi steppe Nov 13, 2023, 11:21 AM

#

hearty flicker It could also be that the concatenation stuff we are doing isn't helping

it helped for classical I think

hearty flicker Nov 13, 2023, 11:21 AM

#

And we can directly compare the quality

quasi steppe Nov 13, 2023, 11:21 AM

#

the 100x large was good IMO

hearty flicker Nov 13, 2023, 11:22 AM

#

Try with medium because it's hard to know exactly what is affecting what

#

Might be just using a larger model or more epochs

#

It's only 24 4090 hours so shouldn't be hard

quasi steppe Nov 13, 2023, 11:22 AM

#

oh by the way, this xlarge run used 8192 as max length....

hearty flicker Nov 13, 2023, 11:23 AM

#

Yeah lol crazy

#

Insane context length

quasi steppe Nov 13, 2023, 11:23 AM

#

I suspect that 8192 is too long and we got too few training steps so I actually think being undertrained is very likely

hearty flicker Nov 13, 2023, 11:24 AM

#

If you test the medium cp, test with 2k

quasi steppe Nov 13, 2023, 11:24 AM

#

yep

hearty flicker Nov 13, 2023, 11:24 AM

#

There are so many factors here, it's probs best to be systematic

#

Hard to know what is doing what haha

quasi steppe Nov 13, 2023, 11:25 AM

#

yeah absolutely

#

The sample length distribution of bitmidi. Super long tail and I need to zoom in a bit

#

actually most samples are very long. If we filter out anything below 2048 tokens, we are only removing about 3% of the tokens.......
And we got (only) a total of 700M tokens here.

quasi steppe Nov 13, 2023, 6:22 PM

#

multi-node training works. Still crappy flops though...
I will start with trying medium 2048 bitmidi later this week. DDP should be more than enough for that

hearty flicker Nov 13, 2023, 6:42 PM

#

Very nice stuff

timber talon Nov 14, 2023, 7:57 AM

#

question —

I got the wav -> midi transcription inference working, and it's outputting a set of files per wav file:

#

📎 valid_000_2nd.velocity 📎 valid_000_2nd.onset 📎 valid_000_2nd.offset

📎 valid_000_2nd.json 📎 valid_000_1st.onset 📎 valid_000_1st.velocity 📎 valid_000_1st.json

📎 valid_000_1st.offset

#

do you know, offhand, if these are canonical enough to work with, or should i get the rest of the inference scripts running as well? (these 10 files are all for the same song)

#

because of how bad the code is, the bar goes exponentially higher for getting the other processing scripts working — would probably need a full rewrite. And most of what the other processing scripts are doing is calculating F1 score and such, so just wanted to check

#

oh wait – reading section 4.3, now — definitely want to ignore all the *1st* files in those lists, those are just for the first-level loss calculation. I wonder whether the 2nd.json is enough? If the pitches look weird, there is a conversion function in another file: https://github.com/sony/hFT-Transformer/blob/master/evaluation/m_transcription.py#L98-L106 which converts the json to .txt for mir_eval and does a functional transformation on the pitch specifically

golden chasm Nov 14, 2023, 10:26 AM

#

@hearty flicker Hi, I have been trying to start a pretraining test using a small dataset created using some random midi files. I have encountered a crash at line 801 of file aria/tokenizer/tokenizer.py.
It crash on this line, which raise an IndexError if stack is empty:

stack[-1]["dur"] = tok

For the moment I solved in this way, but I am not sure it is the proper solution:

if len(stack) > 0:
    stack[-1]["dur"] = tok

Maybe it is just a problem with my midi files having content which is not supported? Could you share the midi files you used to train the model?

hearty flicker Nov 14, 2023, 10:31 AM

#

golden chasm <@150031585553547264> Hi, I have been trying to start a pretraining test using a...

Hey! I'll look at this, in a few hours

hearty flicker Nov 14, 2023, 10:33 AM

#

timber talon question — I got the wav -> midi transcription inference working, and it's outp...

Their code is so bad it's shocking

#

So the output is just a bunch of eval info, not a midi file

#

Very annoying

#

I could probably write a MIDI converter for this

#

Are those files really just for one song?

hearty flicker Nov 14, 2023, 10:43 AM

#

timber talon oh wait – reading section 4.3, now — definitely want to ignore all the `*1st*` f...

In the paper there is an algorithm for converting from the 2nd files to notes

#

I wonder if that is implemented in the repo

#

If this is too hard, we can just use Kong et al for now

hearty flicker Nov 14, 2023, 11:13 AM

#

@timber talon did you figure out what the 'mpe' acronym means?

#

This file seems to hold the code for converting to MIDI

#

https://github.com/sony/hFT-Transformer/blob/master/model/amt.py

GitHub

hFT-Transformer/model/amt.py at master · sony/hFT-Transformer

Pytorch implementation of automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture (hFT-Transformer). - sony/hFT-Transformer

#

If we can find the right config file, this might be all we need

#

I think you just have the run the AMT methods sequentially to get a MIDI out

timber talon Nov 14, 2023, 7:19 PM

#

great... yeah, I'll take a look. There was a config file was kinda the root of the problem yesterday, but I solved that

#

omg haha i totally missed this one little method all the way at the bottom of the file: https://github.com/sony/hFT-Transformer/blob/master/model/amt.py#L347-L355

#

that's not used anywhere else. ok, sweet, this is totally doable. sorry, i'm just not entirely fluent in midi, yet, so wasn't sure if the offsets/mpe files were something someone more familiar than me would look at and be like "oh, obviously, we can turn this into midi "

timber talon Nov 14, 2023, 9:33 PM

#

ok — not bad!

#

i tried doing some transcriptions for songs that were not in the training data (Paderewski isn't in MAESTRO)

timber talon Nov 14, 2023, 10:00 PM

#

Here's my fork of the repo along with the command to run the transcription on a directory:

https://github.com/alex2awesome/hFT-Transformer/tree/master

#

transcription is not super fast — 10-20 seconds per piece, on average. But fits on a 12GB GPU. I have access to loads of those, and can easily parallelize this processing

hearty flicker Nov 15, 2023, 10:50 AM

#

timber talon transcription is not super fast — 10-20 seconds per piece, on average. But fits ...

This is great news Alex, amazing work

#

Yeah we can parallelize this heavily.

#

I'm still working on the downloading pipeline

#

should be done by friday

hearty flicker Nov 15, 2023, 11:04 AM

#

timber talon

This sounds great

#

Here is what it the midi sounds like once I converted it back into audio using fluidsynth

#

This is so promising lol, has gotten me excited !!

#

I love this piece by the way, I've never heard of this composer

hearty flicker Nov 15, 2023, 2:39 PM

#

@quasi steppe Over the next few days I'm going to implement a different version of TokenizedDataset that supports the functionality that you wnat

#

Aka all the epochs are concatinated to one big file.

#

During fine-tuning it might be best to have the original implementation with padding, so I'm going to keep that functionality in a different class

#

I'll also make the changes to the train script so we can work from there directly

#

Currently working on the spotify_dl stuff so alex and I can get our pipeline working

timber talon Nov 15, 2023, 5:30 PM

#

I spent a lot of time yesterday trying to test out some OMR systems — it would be another source of MIDI that is definitely open-domain, and would open up all of IMSLP. Was hard to find anything that looked credible

#

I tried this and despite it being under pretty constant/recent pushes, it was still not working: https://github.com/BreezeWhite/oemer

GitHub

GitHub - BreezeWhite/oemer: End-to-end Optical Music Recognition (O...

End-to-end Optical Music Recognition (OMR) system. Transcribe phone-taken music sheet image into MusicXML, which can be edited and converted to MIDI. - GitHub - BreezeWhite/oemer: End-to-end Optica...

#

@hearty flicker do you know of any OMR libraries? Or is OMR just not really a thing?

hearty flicker Nov 15, 2023, 5:32 PM

#

Not something I've ever looked into

#

There might be some deep learning research on it though

#

the download stuff is nearly working btw

#

I've just forked spotify-dl and added the extra functions we need

timber talon Nov 15, 2023, 6:03 PM

#

I was disappointed when I tried to use the Spotify_dl in some trial runs. Especially for more diverse composers — women piano composers in the classical era, for instance— I was getting very few hits

#

That’s what motivated searching for some good OMR tools

#

I’ll keep looking around, but if you hear of anything (maybe your advisor knows?) that would be really cool

#

I signed up for IMSLP, btw, so that’s another potential source for public domain audio and midi

hearty flicker Nov 15, 2023, 6:10 PM

#

Yeah, I mean spotify_dl can only find stuff it matches on youtube

#

It really depends how much of it is on youtube

#

Any other way of sourcing solo piano recordings also works

#

We can use a combination of different methods. As the (audio -> midi) transcription is so good, any source of solo piano audio would be great

hearty flicker Nov 15, 2023, 6:29 PM

#

@timber talon You can use my fork of spotify-dl

#

https://github.com/loubbrad/spotify-dl

#

It automatically prunes out non solo piano recordings by ensuring that the number of artists on the spotify metadata is <= 2

#

It also skips downloading duplicate files (only album/playlist wide at the moment, will improve this)

#

And it takes a text file of links with the --file arg

#

Actually it should skip downloading dupes for every album in the text file

timber talon Nov 16, 2023, 4:37 AM

#

some initial results using Audiveris to OMR on some sheet music.

It's not great, but if we're doing any data augmentation/ noising, it will fit in on that level

#

📎 Waltz_in_G_major_D.844.pdf 📎 test-2.mvt2.mxl 📎 test-2.mvt1.mxl 📎 3_Minuets_D.380_1st__2nd_Minuets_only.pdf 📎 test.mxl

hearty flicker Nov 16, 2023, 12:11 PM

#

How do the mxl files work?

timber talon Nov 16, 2023, 4:27 PM

#

Oh you load them into musenote to listen. They’re primarily for musical notation software. But they’re also convertible to MIDI, I can do that quickly

quasi steppe Nov 19, 2023, 12:39 PM

#

Got a lot of idle machines in the cluster so I sent a training job this morning. 64 A100, large.json (700M?), 100 epochs of bitmidi. I have gone back to 2048 context length. We can extend the length later easily.

#

That bump is interesting though. It does happen all the time in LLM pretraining

hearty flicker Nov 19, 2023, 2:53 PM

#

That's cool

#

I've been bugged down with a bunch of irl stuff, should have more time this next week however I am moving apartment so we will see

#

The dataset stuff with @timber talon is looking very promising. Going to try to retrain the hft model on monday

quasi steppe Nov 19, 2023, 3:19 PM

#

25 epoch checkpoint is actually pretty good!!!

#

I think directly pretraining on 8192 is somehow hurting the quality

#

I'm more and more convinced an LM should pretrain on 2048 or less, and then extend procedurally

#

the prompt is 200 token (first 2-3 bars) from some random online stuff. No CFG, an honest 1.0 temp for the current one.

hearty flicker Nov 19, 2023, 3:41 PM

#

This is pretty good

quasi steppe Nov 19, 2023, 3:42 PM

#

this is how it handles classical

#

bitmidi is pop heavy right? How much classical could it have?

hearty flicker Nov 19, 2023, 3:44 PM

#

It might have some

#

hard for me to know

#

It's a webscape of like 200k midi files

#

The context length thing is interesting

#

This sounds way way better than lasttime

quasi steppe Nov 19, 2023, 3:45 PM

#

hearty flicker This sounds way way better than lasttime

yeah totally agree

hearty flicker Nov 19, 2023, 3:45 PM

#

Could it be a problem with the freq used for rotary embs?

quasi steppe Nov 19, 2023, 3:46 PM

#

hearty flicker Could it be a problem with the freq used for rotary embs?

I think it's just that there are fewer gradient steps. 4x length means 25% gradient steps

#

I should probably have increased lr by 4x but I doubt how much it helps. Each step is an attempt of searching the loss landscape, fewer amount definitely covers less ground

hearty flicker Nov 19, 2023, 3:56 PM

#

Will be cool to see if the new tokenizer improves the results

#

My suspicion is that it will make timing related stuff better

quasi steppe Nov 19, 2023, 4:19 PM

#

quasi steppe this is how it handles classical

chopin might still be kinda popular on the internet. Trying some less heard stuff.
This second one shows a smart strategy of just repeating the prompt lol

#

something different. So it knows to improvise at temperature 1.0. Better decoding param should give much better results (esp I haven't applied CFG and we haven't implemented beam search)

quasi steppe Nov 19, 2023, 4:54 PM

#

This is quite interesting. The job stopped and I had to use an earlier checkpoint to resume. Dataloader should also be resumed and deterministic, esp since most loss values are having the same up and down. But there was a loss spike all of a sudden

hearty flicker Nov 19, 2023, 4:55 PM

#

That might be because the Adam params reset

quasi steppe Nov 19, 2023, 4:55 PM

#

no, optimizer states are recorded

hearty flicker Nov 19, 2023, 4:55 PM

#

Weird

quasi steppe Nov 19, 2023, 4:55 PM

#

everything is saved well with huggingface's Trainer. They did a good job on this

quasi steppe Nov 19, 2023, 4:56 PM

#

hearty flicker Weird

I suspect there is some minor float point issue in distributed training that is not deterministic. We didn't see these on single node training job

#

I used to think those spikes in LLM pretraining are due to data outliers. Now at least this is ruled out

quasi steppe Nov 20, 2023, 7:18 AM

#

Nice. The 8-node jobs is still running. 36B tokens trained. We have more token than parameters by the way. if it ever overfits it shouldn't be that bad.

hearty flicker Nov 20, 2023, 1:33 PM

#

Was this with 2048 too?

quasi steppe Nov 20, 2023, 5:39 PM

#

hearty flicker Was this with 2048 too?

yeah it was the continuation of my previous run until 100x epochs

#

gonna be interesting

#

I saved every 5000 steps so we can even study how the latent space changes

quasi steppe Nov 20, 2023, 10:18 PM

#

125000 step checkpoint... This is quite amazing.
I generated 8 samples and all of them are amazing.

#

I did some random sampling of bitmidi dataset and I'm fairly confident that this prompt is close to go out of distribution. Meaning this "overfit" model generalizes very well

#

quasi steppe Nov 20, 2023, 11:20 PM

#

Traceback (most recent call last):                                                                                                                   
  File "/fsx/home-honglu/aria/generate_large.py", line 135, in <module>
    sample(
  File "/fsx/home-honglu/aria/generate_large.py", line 106, in sample
    res_midi_dict = tokenizer.detokenize(tokenized_seq)
  File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 79, in detokenize
    return self.detokenize_midi_dict(tokenized_seq)
  File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 469, in detokenize_midi_dict
    _channel = instrument_to_channel["drum"]
KeyError: 'drum'

Got this error

hearty flicker Nov 21, 2023, 9:35 AM

#

This is amazing work

#

Hmmm it looks like that issue happened because there was a drum token sampled, but it's not an instrument that is supposed to be there?

#

I can probs put a check for that, but technically it shouldn't ever happen

quasi steppe Nov 21, 2023, 11:44 AM

#

hearty flicker Hmmm it looks like that issue happened because there was a drum token sampled, b...

that was when I tried extrapolation out of context and interpolate styles. It could just be that the generation is corrupted. But it should probably send out a warning instead of erroring out the whole script.

#

maybe like adding a force=True so that we turn those errors to warnings and just stop the detokenize at wrong token for each sample

hearty flicker Nov 21, 2023, 11:45 AM

#

Yup I can add a check

hearty flicker Nov 21, 2023, 3:06 PM

#

How is the Yarn stuff going btw @quasi steppe ?

#

It seems to work super well for nlp, will be amazing if we can get it working well in this context

#

I need to get a fluidsynth soundfont setup that correctly handles multitrack

quasi steppe Nov 22, 2023, 3:20 AM

#

hearty flicker It seems to work super well for nlp, will be amazing if we can get it working we...

Tried non-finetined version and didn't extrapolate well. Will try the finetuned yarn and use a small portion of data. We will see

quasi steppe Nov 22, 2023, 11:24 AM

#

@hearty flicker is there a convenient UI that reads the model output as a stream and renders it into sound? Would be cool to write a model-based infinite music player if such a stream reader is already available in open source

hearty flicker Nov 22, 2023, 11:24 AM

#

I've had exactly that idea

#

And it's super possible

#

I'll build an API for it today maybe

#

It would be really cool and easy to write

quasi steppe Nov 22, 2023, 11:25 AM

#

I tried to run my model locally and it was about 15 tok per sec. If we quantize it it can be really fast

hearty flicker Nov 22, 2023, 11:26 AM

#

Yup, more than enough for live playing

quasi steppe Nov 22, 2023, 11:26 AM

#

Def good enough for streaming tokens

hearty flicker Nov 22, 2023, 11:26 AM

#

I was thinking of eventually building a ggml backend too

quasi steppe Nov 22, 2023, 11:26 AM

#

Yeah!

hearty flicker Nov 22, 2023, 11:26 AM

#

Did you fix the problem with the bucket perms btw? Maybe I was doing something wrong

quasi steppe Nov 22, 2023, 11:27 AM

#

hearty flicker Did you fix the problem with the bucket perms btw? Maybe I was doing something w...

Hmm I could write using that json certificate

#

From sai cluster it was alright

hearty flicker Nov 22, 2023, 11:27 AM

#

I kept getting 403s

quasi steppe Nov 22, 2023, 11:27 AM

#

I already put in all the model checkpoints

hearty flicker Nov 22, 2023, 11:27 AM

#

When I'm at my laptop I'll get back to you

quasi steppe Nov 22, 2023, 11:28 AM

#

Yeah I will try again later

hearty flicker Nov 22, 2023, 11:28 AM

#

Were you using gsutil mv?

quasi steppe Nov 22, 2023, 11:28 AM

#

cp

hearty flicker Nov 22, 2023, 11:28 AM

#

Alright I'll try that

quasi steppe Nov 22, 2023, 11:28 AM

#

Before it needs gcloud auth .... using that json

quasi steppe Nov 22, 2023, 7:24 PM

#

int8 quantization kinda works. Feels like 1.5x speed-up from fp16 but haven't measured it. I will push the code later.

Made a draft PR to fool around but we don't have to merge it. Still playing with quantization on my laptop.

hearty flicker Nov 23, 2023, 1:39 PM

#

Is there much quality degradation?

quasi steppe Nov 23, 2023, 3:26 PM

#

hearty flicker Is there much quality degradation?

There is.... I'm using pytorch fx now. It spits out that invalid drum token all the time

#

not sure what the best setup it is for CPU inferencing

hearty flicker Nov 23, 2023, 3:28 PM

#

I don't think CPU will work with flash attention

#

I tried it a few weeks ago and it didn't work

quasi steppe Nov 23, 2023, 3:30 PM

#

hearty flicker I don't think CPU will work with flash attention

I would guess that pytorch attention implementation revert to usual attention implementation?

#

It works on my computer

hearty flicker Nov 23, 2023, 3:30 PM

#

I'm moving flat today so am quite busy, next week I should be back to my normal schedule

#

Oh that's weird, maybe there was a patch

quasi steppe Nov 23, 2023, 3:31 PM

#

I have a draft PR for fooling around. It works fine on my computer. I refactored all those device stuff. You can play with it.

hearty flicker Nov 23, 2023, 3:31 PM

#

I tried it briefly and scaled_dot_product_attention was causing some issues

#

Might be my computer dependant though ha

#

I will play with it

quasi steppe Nov 23, 2023, 3:33 PM

#

pytorch has some other quantization options. If we use the MultiheadAttention class instead of the scaled_dot_product_attention, something seems to apply to that directly. Currently I only quantize all those dense layers (also need to sort out the degradation.... Hope it's just a paramter issue)

#

When the sequence gets long the speed is a loooot slower on my laptop and I fear without quantization the token stream wouldn't catch up with the player frontend, if we want to do what we talked about earlier

hearty flicker Nov 23, 2023, 4:30 PM

#

If we did it with ggml it would be fine

#

If we are aiming for arm macs

#

I mean you can run llama7b at 10-15 toks/s I think

#

So for our model there will be more than enough speed

sand nymph Nov 24, 2023, 3:37 PM

#

Whats the formula for model training cost (in A100-hours) in terms of dataset size and # params?

sharp quiver Nov 28, 2023, 6:27 PM

#

https://fxtwitter.com/alexfmckinney/status/1729459787614638295

FixTweet / FixupX

Releasing JAX x Equinox code and a 101M parameter model checkpoint for my homebrewed MIDI transformer TchAIkovsky ☕ Feel free to have a play around with it 🙂 It isn't SOTA but produces some fun results~ https://github.com/vvvm23/tchaikovsky

Alex McKinney (@alexfmckinney)

Releasing JAX x Equinox code and a 101M parameter model checkpoint for my homebrewed MIDI transformer TchAIkovsky ☕

Feel free to have a play around with it 🙂 It isn't SOTA but produces some fun results~
https://github.com/vvvm23/tchaikovsky

▶ Play video

quasi steppe Nov 28, 2023, 6:47 PM

#

sharp quiver https://fxtwitter.com/alexfmckinney/status/1729459787614638295

seems undertrained but it's really nice! Sounds better than our early checkpoints in terms of that repetition problem

hearty flicker Dec 5, 2023, 5:56 PM

#

If anyone is interested, here are some samples I compiled a few weeks ago from the old version of the model @here

#

https://soundcloud.com/loua19/sets/aria-samples

SoundCloud

loua19

Aria Samples

Early classical samples from Aria v0.1

▶ Play video

#

Will be interesting to see how much of a difference the improved tokenizer/datasets/scale/finetuning will make

#

Ok SAI keeps kicking me off and I've stuck with using the terrible train script provided by this paper

#

Will run the transcription training on my home server instead

quasi steppe Dec 7, 2023, 11:26 PM

#

there are some idle machines in SAI today. Gonna try to see if I can get yarn finetuning working.
Managed to get 100 training steps done but crashed when it tries to save optimizer states. I got a lot of troubles with my training codes and getting weird C++ errors all the time...
Gonna switch to your codes and I'm building a 8192-length dataset now

hearty flicker Dec 8, 2023, 5:53 PM

#

quasi steppe there are some idle machines in SAI today. Gonna try to see if I can get yarn fi...

Do you think this is was because of issues with my code?

#

Could be an accelerate issue maybe?

#

I haven't fully tested my training script on multiple nodes ect

quasi steppe Dec 8, 2023, 5:57 PM

#

hearty flicker Do you think this is was because of issues with my code?

nope I didn't use your training code. I was quickly trying out YaRN finetuning. Supposedly it should only take 1% of the original training data so I was trying on my training codes

#

it worked before but all of a sudden it breaks down.
Still working on it. I found some bugs in the yarn code. Will do a major PR for the YaRN component

#

but I don't know if those will resolve the c++ errors yet

hearty flicker Dec 8, 2023, 5:58 PM

#

Cool, I'm having a bug in the data augmentation stuff with the new tokenizer, so I'll probably not be able to merge it till monday. I'm away this weekend in cambridge unfortunately so I can't work on it

#

c++ errors scare me

#

haha

quasi steppe Dec 8, 2023, 5:59 PM

#

got something along the lines of "some kernel returned NULL without raising an error". Googled it and saw in pytorch forum a dev said he hasn't seen this error for years lol

hearty flicker Dec 8, 2023, 5:59 PM

#

Would be so amazing to get yarn finetuning working

hearty flicker Dec 8, 2023, 5:59 PM

#

quasi steppe got something along the lines of "some kernel returned NULL without raising an e...

Eeek

#

I was getting C++ errors when trying to get gradient checkpointing to work with torch.compile

quasi steppe Dec 8, 2023, 6:00 PM

#

yeah mine was from torch.compile too

hearty flicker Dec 8, 2023, 6:00 PM

#

Maybe we should add a flag to the train script to skip compiling

#

Since it seems to randomly cause weird issues

#

I need to remember to go over the transformer optimization document you sent also

quasi steppe Dec 8, 2023, 6:01 PM

#

I disabled that, the training got slower but it worked until it was trying to save the first checkpoint. When it got to optimizer states it got another c++ error along the line of "all gather failed because different devices have different values of something" lol

hearty flicker Dec 8, 2023, 6:02 PM

#

Very weird

#

Is this using hf Trainer?

quasi steppe Dec 8, 2023, 6:02 PM

#

hearty flicker Is this using hf Trainer?

yeah

hearty flicker Dec 8, 2023, 6:02 PM

#

Could be an issue on that end then

#

Was this also on multiple nodes?

quasi steppe Dec 8, 2023, 6:03 PM

#

hearty flicker Was this also on multiple nodes?

didn't try. It worked earlier without yarn so it could also have to do with that

#

and I upgraded accelerate in my conda env and that's another possible cause

#

will try to debug more today

hearty flicker Dec 8, 2023, 6:04 PM

#

I can also have a look on monday

#

I should probably read the Yarn paper properly too

#

did the iclr rebuttals get reviewed yet?

quasi steppe Dec 8, 2023, 6:05 PM

#

the yarn code was a bit confusing and I realized I need to copy a different class from our yarn repo if I want to do finetuning. I'm trying to refactor a bit to clean some stuff up.

quasi steppe Dec 8, 2023, 6:06 PM

#

hearty flicker did the iclr rebuttals get reviewed yet?

yeah for Yarn one reviewer flipped and we got 6 6 6 8. Should be able to get in

hearty flicker Dec 8, 2023, 6:06 PM

#

Oh lovely !

#

How about cfg?

quasi steppe Dec 8, 2023, 6:06 PM

#

hearty flicker How about cfg?

nobody responded

#

so probably a rejection

hearty flicker Dec 8, 2023, 6:06 PM

#

A shame, made a big impact for this project anyway haha

quasi steppe Dec 8, 2023, 6:07 PM

#

it's a bit unfair. One reviewer basically said, oh CFG exists in CV so you are not novel

hearty flicker Dec 8, 2023, 6:07 PM

#

Yeah but that's the entire point

quasi steppe Dec 8, 2023, 6:07 PM

#

We were like, that was the whole point lol

hearty flicker Dec 8, 2023, 6:07 PM

#

That's so annoying

quasi steppe Dec 8, 2023, 6:07 PM

#

and this guy never responded to our rebuttal.
Pretty sure the 2x 5's didn't spend more than 10min on our paper. The 6 guy raised a lot of good points. I really think if we can get qualified experts to read it, we should at least get overall 6

hearty flicker Dec 8, 2023, 6:08 PM

#

Pretty annoying reason to get a rejection

timber talon Dec 9, 2023, 1:32 AM

#

extremely annoying, yeah

#

@quasi steppe what's that secret part of your website again where you have all the generations you've been producgin?

quasi steppe Dec 9, 2023, 1:35 AM

#

https://honglu.fan/files

#

These are from the giantmidi model. Haven't uploaded the bitmidi generations yet

quasi steppe Dec 9, 2023, 5:06 AM

#

Couldn't get YaRN working. The finetuned checkpoint is gibberish....

#

@hearty flicker Dataset was alright. Trying to run your script.
AttributeError: 'Namespace' object has no attribute 'train_data'. Did you mean: 'train_dir'?
This is probably a typo.
Will fix and try it again tomorrow.

hearty flicker Dec 9, 2023, 12:09 PM

#

Yeah must be a typo

#

Did this error happen when building PretrainingDataset from the cli?

#

Pushed a fix

hearty flicker Dec 9, 2023, 12:23 PM

#

quasi steppe These are from the giantmidi model. Haven't uploaded the bitmidi generations yet

Thesen are nice, are they from a model that you just trained?

quasi steppe Dec 9, 2023, 2:21 PM

#

hearty flicker Thesen are nice, are they from a model that you just trained?

It's the previous 100 epoch giant midi model

hearty flicker Dec 9, 2023, 2:21 PM

#

Ah ok cool

sand nymph Dec 9, 2023, 3:03 PM

#

It looks like we'll be able to get y'all at least 5k A100-hours of dedicated compute Stability

hearty flicker Dec 9, 2023, 3:16 PM

#

sand nymph It looks like we'll be able to get y'all at least 5k A100-hours of dedicated com...

That's amazing news! Thanks so much for everything: )

quasi steppe Dec 9, 2023, 3:23 PM

#

finetuned YaRN generating a 7000 token sample. Maybe it has a stop token earlier. It's somehow shorter than I expected.
I feel the quality went down quite a bit.

#

oh crap towards the end it was completely gibberish

#

The final loss of YaRN finetuning is on par with the pre-training final loss. It's possible that the model just gamed the perplexity (it's well known that short repeating patterns can game the loss and show an artificially low perplexity) and biased towards those easy patterns that are weird to human.

hearty flicker Dec 9, 2023, 4:21 PM

#

It's quite curious that the loss is on target but the results are not

#

Was anything similar observed when applying YaRN to nlp?

quasi steppe Dec 9, 2023, 8:36 PM

#

It was ok with the limited amount of downstream task experiments. Like passkey is alright, basic completions are alright. But it's hard to find a super long task and it got prohibitively slow to do something fancy

#

For the 2048-8192 extension I remember it was all ok

quasi steppe Dec 10, 2023, 12:49 AM

#

A long range sample coming out of the yarn finetune (same as last night). I sweeped and tuned the attention scale a little bit. This one seems reasonable at the longer end.

hearty flicker Dec 10, 2023, 1:30 PM

#

This sounds pretty good to me

#

What range of extension was it, 2k->8k?

hearty flicker Dec 10, 2023, 1:45 PM

#

What changes did you have to make to get it working better?

quasi steppe Dec 10, 2023, 3:07 PM

#

hearty flicker What changes did you have to make to get it working better?

the attention weight temperature. Also fixed the code base quite a bit. There used to be a few very confusing naming from the original yarn repo that got me confused again in setting the params.

hearty flicker Dec 10, 2023, 3:32 PM

#

Super excited about that

#

This sample is really good for how long it is

quasi steppe Dec 10, 2023, 3:57 PM

#

@narrow sorrel Want to share your demo video here so that people can see? It's just a thought. If not, don't worry.

#

Also, got a new YaRN finetune checkpoint using new mscale parameter, a lower LR and a longer initial warmup. Here is an 8000-token sample.
(is a bit weird towards the end....)

narrow sorrel Dec 10, 2023, 6:56 PM

#

quasi steppe <@708634594060271616> Want to share your demo video here so that people can see?...

sure! here's the script i made over the weekend to play with aria: https://github.com/EleutherAI/aria/pull/79
thanks Honglu for training the models!

GitHub

feat: add real time playing script by maxreciprocate · Pull Request...

This PR adds a proof of concept script for playing turn-by-turn with aria model.
To try, run the following (currently only works on mac, tested on m1 max):
git clone https://github.com/maxreciproca...

#

hearty flicker Dec 11, 2023, 2:02 PM

#

narrow sorrel sure! here's the script i made over the weekend to play with aria: https://githu...

So cool, I've had this in the back of my mind for a while - will merge this tonight !

hearty flicker Dec 12, 2023, 3:11 PM

#

Hey guys, going to aim to get a pr in for doing supervised finetuning this week

#

If anyone has any experience on doing supervised finetuning with decoder only models, any references would be useful

#

The test example I am planning to implement is key detection

#

Also, expanding on @narrow sorrel's script I'd love to implement some real time I/O in the inference library - maybe next week

timber talon Dec 13, 2023, 5:58 AM

#

narrow sorrel

that is soo cool!!! what midi keyboard are you using, btw?

timber talon Dec 13, 2023, 5:59 AM

#

hearty flicker If anyone has any experience on doing supervised finetuning with decoder only mo...

hey @hearty flicker I've found the huggingface docs on AutoCausalLanguageModel to be super straightforward

#

By SFT, I assume you mean just basic SFT, or continuned pretraining, This is a great script:

https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

In addition to SFT, there's another direction to go in, which is low-rank approximations. this PEFT library: https://github.com/huggingface/peft implements a lot of the standard LoRA approaches. These are potentially cool because you can maybe mix/match LoRAs. The Adapter library also claims to be a similar standardized repo: https://github.com/adapter-hub/adapters

#

i hope i'm answering your question and not just spouting stuff you already know — in general, anything FT-related is very, very standardized by now

hearty flicker Dec 14, 2023, 5:57 PM

#

Ok guys, the new tokenizer is implemented and the initial results are very promising - the timing problems seem to be mostly fixed

#

Running a proper test to get a better sense for the differences...

hearty flicker Dec 14, 2023, 7:03 PM

#

Here is a unprompted piece in the style of Chopin

#

New tokenizer seems to help with long term structure too

#

It's crazy to think these samples come from a model trained on the same testing dataset that I first used 5 months ago... sounds so different -- lots of progress

tiny coral Dec 14, 2023, 8:58 PM

#

Those samples sound incredible. What are the key differences between this new tokenizer and before?

quasi steppe Dec 14, 2023, 9:17 PM

#

tiny coral Those samples sound incredible. What are the key differences between this new to...

Before there was a wait token to wait for certain period of time. Now it encodes the absolute timing in a period of time I think. Haven't poked into it yet and @hearty flicker can tell us more

hearty flicker Dec 14, 2023, 9:48 PM

#

Yeah I got the idea randomly from a paper about automatic music transcription that came out last year

#

Will give a better update tomorrow... I'm training a model on a larger classical dataset overnight too - I'll update the notebook to use it

narrow sorrel Dec 14, 2023, 10:50 PM

#

timber talon that is soo cool!!! what midi keyboard are you using, btw?

that's a Seaboard, also it supports MPE (i think you mentioned it sometime ago) https://youtu.be/6SCug5kUsBs

YouTube

ROLI

Seaboard Block: Super Powered Keyboard

Seaboard BLOCK M is here! Order now: https://roli.com/products/blocks/seaboard-block-m

Make music your superpower with the most compact, portable, and affordable Seaboard ever made. It fits in a backpack so you can play it anywhere. Or use it on your desktop with ROLI Studio or Dashboard for third-party DAW integration with LIVE 11 or Logic.
...

▶ Play video

hearty flicker Dec 15, 2023, 1:48 PM

#

Here is an unprompted piece on the style of mozart @here

#

Quite simple but it defintely sounds like music and has the

hearty flicker Dec 15, 2023, 2:33 PM

#

This is also quite nice, it's super good at doing stuff like this

#

Creatively missing something but is nailing most 'musical' aspects of this continuation (first 20secs is prompted)

#

It's weird because the original piece honestly makes me feel very emotional, but this makes me feel nothing...

hearty flicker Dec 15, 2023, 3:18 PM

#

Here is the updated notebook if anyone is interested - https://colab.research.google.com/drive/1SmwmsSf92Bv30algvZ-D4rW8dtH0kJNL?usp=sharing

sand nymph Dec 15, 2023, 8:44 PM

#

@karmic skiff

#

@hearty flicker @quasi steppe reminder to go talk w/ Stability on thier slack to get compute approval

quasi steppe Dec 17, 2023, 7:33 AM

#

urgh wrote a live player but cpu inference is not fast enough to keep up with playing...
Only medium model can work on my shitty laptop

quasi steppe Dec 17, 2023, 7:42 AM

#

hearty flicker

Forgot to ask, how many epochs have you trained to get those?

quasi steppe Dec 17, 2023, 8:48 AM

#

Implemented rolling window. We can have very very long music now, at the cost of forgetting.
On my computer I can only continuously stream 400 without catching up 😂

hearty flicker Dec 17, 2023, 7:18 PM

#

quasi steppe Forgot to ask, how many epochs have you trained to get those?

This was from 150 epochs I think? Val loss never bottomed out during that training run

hearty flicker Dec 18, 2023, 11:40 AM

#

unconditioned Chopin

timber talon Dec 19, 2023, 6:37 AM

#

quasi steppe Implemented rolling window. We can have very very long music now, at the cost of...

rolling window meaning rope?

quasi steppe Dec 19, 2023, 6:42 AM

#

timber talon rolling window meaning rope?

no, I mean applying a rolling window on kv-cache

#

basically it means if we generate all the way to 8192 with a window of 4096, the last token is generated as if the whole context is 4096 with the first half being removed

#

this is a cheap way to mitigate the context window limit and usually it's kinda okay. The model just forgets, but a good model should generate something that feels smooth. Here somehow it degenerates.

timber talon Dec 19, 2023, 10:59 PM

#

ohhh i see

#

thanks for clarifying!

#

Idk if you saw the Phi-2 lm that microsoft just released — https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ pretty cool stuff

Microsoft Research

Alyssa Hughes

Phi-2: The surprising power of small language models

Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on a variety of tasks.

#

their two takeaways in particular — about data quality as well as scaling up model training, are cool. We're already trying to get high-quality data (and in general, music data, which is usually a performance of music by a famous composer, is probably higher-quality than random text data we find on the web?). W.r.t the second point, about scaling up model training, I know there was discussion in the #research channel a few weeks ago about scaling up language models. Makes me wonder if we should think about some of these smaller models you are training as initialization-points for a larger aria model

narrow sorrel Dec 20, 2023, 12:39 PM

#

hearty flicker unconditioned Chopin

what do you mean by "unconditioned" here?

hearty flicker Dec 20, 2023, 1:31 PM

#

narrow sorrel what do you mean by "unconditioned" here?

As if you used gpt3 without any prompt apart from the <S> token

#

Actually in this case there is a composer token too specifying that the composer is Chopin

hearty flicker Dec 22, 2023, 2:16 PM

#

I tried out fine-tuning on some jazz music (about 400 tracks of jazz recordings)

#

Unfortunately the fine-tuning data is pretty poor quality, however the transfer learning is definitely working which is cool !

#

Here is a jazz version of Moon River

#

To get a sense of the quality of the fine-tuning data you can listen to the first 30 seconds or so of this mp3 - this is the prompt

#

@here

#

This only cost 1 A100 hour to fine-tune. It would be amazing to get access to some higher quality fine-tuning datasets

quasi steppe Dec 24, 2023, 5:18 PM

#

haha finally got a generation almost identical to the original (except towards the end). The bitmidi 100x epochs model actually memorizes jingle bell pretty well

quasi steppe Dec 24, 2023, 5:36 PM

#

this ones shows pretty clearly how rolling window fails... a few bad samples get accumulated into nonsense and once out of distribution, it can't find its way back

timber talon Dec 26, 2023, 11:00 PM

#

there are some interesting works in long-form story-telling, for NLP, coming out. this one in particular: https://arxiv.org/pdf/2311.15208.pdf

#

ending a music generation is always really hard, it always trails away until you artificially stop it

#

what struck me about this work was they introduced the concept of "ending" a narrative, and developed metrics to track what a narrative ending was

#

this story-generation work is pretty cool, how it uses a separate classifier (BERT-tiny) to toggle between local context and global context. Maybe that can be applied to what we're doing with CFG

sand nymph Jan 2, 2024, 6:25 PM

#

@hearty flicker @timber talon @quasi steppe Did y'all fill out the JIRA research grant request for Stability to get project allocation for aria?

timber talon Jan 2, 2024, 6:26 PM

#

hello stella and happy new years/holidays! I think i missed the discussion of this grant — i'm still waiting on SAI approval for slack. Did the discussion occur there? Anyway, I am happy to help with grant writing

#

if it occurred here and i just missed it, my apologies — I'll touch base with @quasi steppe and @hearty flicker

sand nymph Jan 2, 2024, 6:30 PM

#

Ah, Honglu and Louis got approved. I'll go nag about getting you in

#

The discussion was basically "this looks cool, submit a request:

#

There's an overview of what they're looking for here

#

The actual submission is submitted here

hearty flicker Jan 2, 2024, 7:07 PM

#

@sand nymph I submitted it before Christmas, Zach said that they were on a winter break but will look at it after they get back.

hearty flicker Jan 2, 2024, 7:10 PM

#

sand nymph There's an overview of what they're looking for [here](https://stabilityai.notio...

I didn't see this before I submitted! I can bolster the application and resubmit if it's a problem : )

sand nymph Jan 2, 2024, 7:11 PM

#

@hearty flicker Great

hearty flicker Jan 5, 2024, 4:51 PM

#

Here are two nice continuations of the first movement of Beethoven's moonlight sonata !

#

hearty flicker Jan 8, 2024, 3:22 PM

#

I also thought this was quite pretty. Any Aphex Twin fans @here? I rendered the piano in ableton to try to make it sound more like the original

#

Here is the original for reference https://www.youtube.com/watch?v=-LgYzva-xq8

timber talon Jan 9, 2024, 5:56 AM

#

wow ^ that's amazing! how it keeps the same rhythmic patterns throughout

hearty flicker Jan 9, 2024, 2:45 PM

#

timber talon wow ^ that's amazing! how it keeps the same rhythmic patterns throughout

The new tokenizer seems to have (anecdotally) solved these sorts of problems : )

sand nymph Jan 11, 2024, 5:02 AM

#

https://arxiv.org/abs/2401.04577

arXiv.org

Masked Audio Generation using a Single Non-Autoregressive Transformer

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construc...

hearty flicker Jan 16, 2024, 8:28 PM

#

The dataset building me and Alex have been working on has gone incredibly well, just in time too

#

This is the product of an audio recording that has been turned into a symbolic form (think sheet music) via a neural net (similar to OCR in nlp), and then rerendered into audio using software

#

So essentially the data cap on our model is all audio piano recordings that we can obtain

#

Will be exciting to keep working on this after the ICML deadline : )

sand nymph Jan 18, 2024, 2:34 PM

#

I have gone and fought the monsters and come back victorious: you should have a compute authorization to train these model on the Stability AI cluster

timber talon Jan 18, 2024, 3:47 PM

#

hearty flicker The dataset building me and Alex have been working on has gone incredibly well, ...

does anyone have piano albums on Spotify they like? (besides the big names — Chopin, Mozart, Schubert, etc., we already have a lot of them)

hearty flicker Jan 18, 2024, 3:48 PM

#

sand nymph I have gone and fought the monsters and come back victorious: you should have a ...

!!!

#

Running this pretraining this evening in that case

hearty flicker Jan 18, 2024, 3:49 PM

#

sand nymph I have gone and fought the monsters and come back victorious: you should have a ...

Will the budget be under its own slurm account?

sand nymph Jan 18, 2024, 4:08 PM

#

hearty flicker Will the budget be under its own slurm account?

Because permissions are wonky I can't see your JIRA ticket. If you send me a screenshot of what your submitted I can answer that question though xD

hearty flicker Jan 23, 2024, 6:44 PM

#

Pretraining is definitely working. The optimal sampling parameters seem to be different than before, especially top_p

hearty flicker Jan 23, 2024, 8:48 PM

#

timber talon Jan 24, 2024, 6:01 PM

#

hearty flicker

the slowing down and cadence at ~1:20 to end the phrase is pretty crazy

hearty flicker Jan 24, 2024, 6:02 PM

#

crazy good or crazy bad haha

#

I've been playing around with the style transfer stuff and it's super interesting. This is supposed to be a liszt inspired continuation of a chopin nocturne

#

A bit all over the place structurally, but sounds a bit like liszt to me

timber talon Jan 24, 2024, 6:16 PM

#

hearty flicker crazy good or crazy bad haha

i think it's pretty great — from a performance perspective, the ritardando is pretty well executed

#

at that to the musical coherence of the cadence

#

all in all, pretty impressive

hearty flicker Jan 24, 2024, 6:16 PM

#

This is actually from a half trained version of the smallest model

#

Am super excited to see what the large one is like

timber talon Jan 24, 2024, 6:19 PM

#

hearty flicker A bit all over the place structurally, but sounds a bit like liszt to me

lol crazy! and it keeps a pretty consistent bass line throughout. How are you doing this, are you doing the CFG mashup thing that @h used?

hearty flicker Jan 24, 2024, 6:20 PM

#

timber talon lol crazy! and it keeps a pretty consistent bass line throughout. How are you d...

Oh wait, yeah I just gave it a chopin nocturne and then changed the composer meta-token to lizst and the form meta-token to nocturne

#

So no dynamic cfg stuff really

timber talon Jan 24, 2024, 6:21 PM

#

ohhh cool!

quasi steppe Jan 24, 2024, 6:26 PM

#

yeah this meta-token approach is definitely more natural.

#

I still don't know if dynamic cfg trick works better for a better base model or not. Maybe I need to try it out. The way I think of it is that it might be suitable for more complex guidance (like we want something to get somewhere at certain exact time, or mixing more than 2 styles)

hearty flicker Jan 24, 2024, 6:28 PM

#

cfg defintely is working amazing

#

These samples are normally prouduced with a cfg of 1.05 to 1.2

quasi steppe Jan 26, 2024, 8:57 AM

#

Our YaRN finetuning runs on MAESTRO (50 epochs, 16k context length) have also finished (2 larger runs remain but will be done soon).
OMG it's working sooooo well! I'm super excited!!! I mean, not as high quality as the original MAESTRO midis but I've never gotten such coherent one before.
A little weird in the middle and I thought it's gonna go brain-dead like before, but it snaped out of it.

hearty flicker Jan 26, 2024, 7:32 PM

#

Some completely improvised jazz

sand nymph Jan 26, 2024, 8:09 PM

#

@hearty flicker you had pitched ICML as a target venue. The deadline is in six days which is doable but tight. Is that still the goal?

Have you started writing the paper already? I would 100% start now while the models still run.

hearty flicker Jan 26, 2024, 8:10 PM

#

I'm writing as we speak !

quasi steppe Jan 30, 2024, 8:14 AM

#

which one do you guys think is of higher quality? (first ~30-40 sec are the same)
goose11 for first
goose16 for second

(urgh need to render into mp3 first next time)

quasi steppe Jan 30, 2024, 10:33 AM

#

This one as well.
Anyone wants to vote?

sharp quiver Jan 30, 2024, 4:38 PM

#

quasi steppe which one do you guys think is of higher quality? (first ~30-40 sec are the same...

sharp quiver Jan 30, 2024, 4:40 PM

#

quasi steppe This one as well. Anyone wants to vote?

hearty flicker Jan 30, 2024, 10:23 PM

#

Hey @here, please take this survey if you are interested !

#

http://survey.loubbrad.com:8501/

sand nymph Feb 5, 2024, 5:56 PM

#

I'm doing a post-submission editing pass on the paper, including flagging areas where I think more substantive experimentation would be valuable.

@hearty flicker: do you think we could get some music professionals to give feedback on the music we generate?

hearty flicker Feb 5, 2024, 5:56 PM

#

That was one of the aims initially, but we didn't have time

#

I was in the process of reaching out to various people who are experts in a certain form of music that is simultaneously quite 'rule based' whilst also being very free. The model excels in creating this type of music, and I was curious about what a musicologist in this area would think.

#

I would like to have a section on this topic in the arxiv preprint

sand nymph Feb 5, 2024, 5:59 PM

#

Great

hearty flicker Feb 5, 2024, 6:02 PM

#

I would really like to have an extensive experiments and evaluation section in the preprint

#

We also didn't get to do many experiments with finetuning/alignment which is super important

sand nymph Feb 5, 2024, 6:03 PM

#

I strongly recommend not using -small -med and large labels and instead labeling models by their actual size

hearty flicker Feb 5, 2024, 6:04 PM

#

hearty flicker I was in the process of reaching out to various people who are experts in a cert...

This is the style of music that it does well at - https://www.youtube.com/watch?v=7Uv71oX4wG8 (original - https://www.youtube.com/watch?v=W5lOLZsjOp8)

hearty flicker Feb 5, 2024, 6:04 PM

#

sand nymph I strongly recommend not using `-small` `-med` and `large` labels and instead la...

noted

sand nymph Feb 5, 2024, 6:11 PM

#

Are the %-ages in the dataset section the amount of the actual pretraining data?

#

Why was dropout used @hearty flicker?

#

/ was it? I only see one passing reference

hearty flicker Feb 5, 2024, 6:14 PM

#

Dropout was used during training

sand nymph Feb 5, 2024, 6:15 PM

#

That's definitely something we need to disclose given how non-standard it is

hearty flicker Feb 5, 2024, 6:15 PM

#

I thought it was standard, but have realised now there are some issues with the training process we should have done differently

#

I thought it was standard for some reason

#

We also used GELU, which we forgot

sand nymph Feb 5, 2024, 6:16 PM

#

You specifically said otherwise in the paper

hearty flicker Feb 5, 2024, 6:16 PM

#

sand nymph You specifically said otherwise in the paper

What do you mean?

#

We used dropout using training, why is why the train loss is higher than the val loss

sand nymph Feb 5, 2024, 6:17 PM

#

Oh that was about GELU

hearty flicker Feb 5, 2024, 6:17 PM

#

Oh sorry, I meant that we used GELU instead of SwiGELU or whatever the modern activation function is

#

We stated it accurately in the paper, I just wish we trained without dropout and with a better activation function

#

Not that it matters so so much, but still

sand nymph Feb 5, 2024, 6:18 PM

#

You said you used the architecture from LLaMA. I'm currently changing that to be correct.

hearty flicker Feb 5, 2024, 6:20 PM

#

Oh I meant to say inspired by, the only difference is the activation function and dropout.

sand nymph Feb 5, 2024, 6:20 PM

#

Is this the correct config file? https://github.com/EleutherAI/aria/blob/main/config/models/large-wide.json

GitHub

aria/config/models/large-wide.json at main · EleutherAI/aria

Contribute to EleutherAI/aria development by creating an account on GitHub.

hearty flicker Feb 5, 2024, 6:20 PM

#

Yes

#

For the largest model, we also trained medium.json and small.json

sand nymph Feb 5, 2024, 6:21 PM

#

How much of the SAI grant was used for these three models?

hearty flicker Feb 5, 2024, 6:22 PM

#

I think that training large was roughly 2000 hours

#

And the others were rougly 1000 hours together

#

So we have quite a lot left, I think

sand nymph Feb 5, 2024, 6:23 PM

#

Cool, so we can do some ablations then 🙂

#

and/or some larger models

hearty flicker Feb 5, 2024, 6:24 PM

#

I was originally going to do that too, but ran out of time too

sand nymph Feb 5, 2024, 6:24 PM

#

No worries

hearty flicker Feb 5, 2024, 6:24 PM

#

Like training on all the data vs only a high quality subset

#

And training deep arch vs wide arch

#

Since MuseNet (openai) used 100+ layers for some reason

#

I did train a deep version of 'large' briefly, but after 5 epochs is was quite clear it was learning far slower, so I stopped it

#

Honestly these models (small and medium at least) are not that expensive to train, and fine-tuning is basically free

#

Lots of opportunity for ablation

#

I also have a lot of evals around the metadata and controllability stuff, about 2 days before the deadline someone reached out to me regarding this

#

They had basically judged the outputs of our model, using an music listening model (audio)

#

And found that for samples generated with a genre meta-data tag (e.g. jazz), the audio model said the same

sand nymph Feb 5, 2024, 6:41 PM

#

That's cool

timber talon Feb 5, 2024, 8:29 PM

#

sand nymph I'm doing a post-submission editing pass on the paper, including flagging areas ...

i am a music professional in the sense that I went to Juilliard for classical music performance, and also got paid for performing 🤷‍♂️ and my evals are in there lol

#

it's not my current field though, and I'm not deep into the musicological/theoretical side

#

it's actually a pity we didn't call that out in the paper — "one of our annotators was X". it would've given more credibility

sand nymph Feb 5, 2024, 8:36 PM

#

timber talon i am a music professional in the sense that I went to Juilliard for classical mu...

That's awesome! I didn't know that.

timber talon Feb 5, 2024, 8:39 PM

#

it was another lifetime, i also dropped out after 2 years — competing against hundreds of other musicians for the opportunity to play the same few classical music songs for a bunch of old white people wasn't a very appealing career path, long-term. but anyway, i think we're all in agreement about needing more different types of evals though!

sand nymph Feb 5, 2024, 8:41 PM

#

timber talon it's actually a pity we didn't call that out in the paper — "one of our annotato...

If we had systematically collected this info then yes, but it might vibe weird to say "BTW one of our evaluators did X"

timber talon Feb 5, 2024, 8:42 PM

#

ah yeah, agreed that makes sense

hearty flicker Feb 5, 2024, 8:44 PM

#

Alex's insight was pretty handy while we were writing

#

I have a meeting tomorrow with my supervisor, he also mentioned this idea. He might have some people in mind

hearty flicker Feb 8, 2024, 3:14 PM

#

Hey @here! We pretrained three models over the last couple of weeks, and it went well! I wanted to lay out my ideas for future directions, which mostly revolve around efforts to finetuning/alignment. The pretrained models are quite powerful; however, we'd like to release models that are better aligned and easier to use.

Aligning these models is an interesting problem, there isn’t really data available in the equivalent of a questions-answer format. We need to find other ways to align these models. The two main issues are:

The model acts as a continuation generator. If you give it a prompt that sounds mediocre or boring for whatever reason, it will generate a continuation in the same style. Just like aligned LLMs, I would like the continuation to be high quality no matter the quality of the prompt.
The most effective method for controllability is just giving the model a very high-quality prompt. This results in amazing continuations (seriously amazing) but limits how you can use the pretrained models. I want to improve both static (such as predefined genres/styles) and dynamic forms of controllability.

My ideas for how to solve these problems:

During finetuning, separate the prompt from the continuation using a <SEP> token. When training randomly add different amounts of noise (degradation) to the prompt. I think this might help with both (1) and (2) as it implicitly separates the prompt from the continuation. At inference time, you could specify how much ‘noise’ the prompt has, so that the model knows how closely to treat it as ground truth.
By using a listening model (audio classifier), my supervisor has found that it’s possible to tag different moments in the training data (MIDI) with tags like ‘sad’, ‘slow’, ‘jazzy’, etc. These tags actually work very well, and I want to incorporate them in a similar fashion to the ‘diminish’ token <D>, which causes the piece to end 5-10secs from when it is seen.

#

I really want to incorporate RLHF in the alignment process. This may be key for improving the consistency of unprompted samples. This feels like a big thing to take on, but with recent improvements (DPO) I think it could be possible to incorporate. I don't have a lot of experience with RLHF, but it's an area I'm personally interested in.

After we have researched solving these problems, I’d like to publish a full-length paper on arXiv expanding on the preprint that we have just submitted. This would also be a good point to publicly release the models and publish a blog post.

Secondary to this, @timber talon and I are currently researching AMT (audio -> MIDI conversion) and aim to get a model, dataset, and paper ready for ISMIR 2024 (deadline mid-April). The only bottleneck on the size of the dataset is the number of audio piano recordings that we can obtain. If anyone has any idea of how to acquire solo piano recordings in bulk, please let me know! We have good methods for pruning out non-solo piano recordings, so the data source doesn't need to be 100% clean. I had one idea for a crowdsourcing project where people contribute YouTube links of solo-piano music. We can use YouTube’s API and the AMT model we are working on to convert these to MIDI files.

We'd love to hear any feedback. Although the pretrained models are not publicly released yet, if anyone would like access - let me know and I'll dm you a (checkpoint) download link. Also, if anyone is interested in getting more involved (on a co-author level), dm me!

karmic skiff Feb 8, 2024, 6:22 PM

#

@hearty flicker still need feedback from classical music experts? i know someone who might be interested

sand nymph Feb 8, 2024, 6:26 PM

#

karmic skiff <@150031585553547264> still need feedback from classical music experts? i know s...

Yes

hearty flicker Feb 8, 2024, 6:30 PM

#

karmic skiff <@150031585553547264> still need feedback from classical music experts? i know s...

Yes yes! I would like to have a subsection in the arXiv paper evaluating these models from a musicological perspective.

#

Still working on a way to frame it exactly

#

As well as discord, my twitter dms are always open (https://twitter.com/loubbrad) and my email is [email protected]

hearty flicker Feb 8, 2024, 11:11 PM

#

Really cool site, I didn't know about this

#

Might be good to add to the pretraining dataset

#

I can finetune the model on the MIDIs from this website if you like

#

Or if you have compute resources yourself I can walk you through how to do it

timber talon Feb 8, 2024, 11:24 PM

#

@hearty flicker and I have talked about this a bunch but i think IMO the key here is that "alignment" in music prompting is to recover from unintentionally messy/mistake-filled inputs. It's not for recovering from atonal or otherwise intentionally stylistic-nonnormative inputs.

Kinda a tough needle to thread because, as compared with the language domain, music doesn't have the same notion of "toxicity" or "not in accordance to human values". And we don't want to give the impression that we are being Western-biased

#

@hearty flicker a question about the transcription — for the AMT paper, is the idea to stick with piano transcription, or to extend it to other instruments? That would potentially really widen what kinds of music we have access to — one of the things in collecting the dataset for the last paper was that piano music, especially in non-Western styles, was definitely more limited than other kinds of instrumental music

hearty flicker Feb 8, 2024, 11:25 PM

#

It's a pretty interesting topic to research

#

There is some research on multi-track transcription, there was a paper from google in 2023 or 2022 if I remember, I'll dig it up

#

2021 actually - https://arxiv.org/abs/2111.03017

#

Solid multi-track transcription is the dream. I'd settle for SOTA piano-only (I think we can do this too)

timber talon Feb 9, 2024, 1:28 AM

#

does multitrack apply to piano solo as well, since there are multiple voices?

#

I feel like if we take the kinda a Whisper-for-AMT approach we were talking about —
like, start with MIDI and generate tons and tons of augmented audio data using fluidsynth, and then try to learn the reverse audio -> MIDI mapping,
can't we also just take piano MIDI and also augment it into many different instruments?

hearty flicker Feb 9, 2024, 1:58 PM

#

I wonder how well this would work, defintely worth looking into

#

We probably have to find a good way to automatically render the MIDIs in a realistic sounding way

#

Most DAWs don't have a cli or api to do this quickly, so we'd probably be stuck using something like fluidsynth again

#

If anyone has any references for aligning / finetuning programming llm models, send them my way!

#

I think it's the closest thing to what we are trying to do here

timber talon Feb 10, 2024, 2:43 AM

#

hearty flicker Most DAWs don't have a cli or api to do this quickly, so we'd probably be stuck ...

fluidsynth plus lots of augmentation/reverb/etc.? covering a wider input distribution domain might be just as good as finding something realistic

hearty flicker Feb 10, 2024, 2:14 PM

#

Yeah I think that would be the best bet, but I'm not sure how well it would work

#

We'll give it a try, I've been meaning to read that paper I linked anyway - I wonder how they did it

hearty flicker Feb 11, 2024, 4:10 PM

#

Thought these were quite nice

#

rustic dirge Feb 16, 2024, 10:37 PM

#

@hearty flicker would you like to try RWKV for this 🙂 i found it generates better midi than transformers

hearty flicker Feb 17, 2024, 2:37 PM

#

rustic dirge <@150031585553547264> would you like to try RWKV for this 🙂 i found it generate...

Hey! The models we trained last month are large enough to actually overfit the training data given enough epochs

#

So I'm not actively looking into architectural changes in that direction, if that makes sense

#

Right now I'm mostly concerned with alignment as well as data related stuff

#

It would be pretty easy to try out training RWKV code wise, it just would take a lot of compute haha. The largest model we (pre-)trained last month was roughly 2500 GPU hours, so not that cheap unfortunately

rustic dirge Feb 17, 2024, 2:45 PM

#

hearty flicker So I'm not actively looking into architectural changes in that direction, if tha...

i can train one to compare with your result if the training data is available 🙂

hearty flicker Feb 17, 2024, 2:48 PM

#

Do you normally train models on SAI's research cluster? I have the data on there already

rustic dirge Feb 17, 2024, 3:04 PM

#

i am on another cluster. can use "croc" to send it to me

rustic dirge Feb 19, 2024, 3:12 PM

#

https://twitter.com/BlinkDL_AI/status/1759596571316883480

BlinkDL (@BlinkDL_AI) on X

100% composed by RWKV-6 120M params MIDI model🎶Still takes multiple trials for such high quality outputs, but I will fix this🙂

hearty flicker Feb 19, 2024, 7:47 PM

#

You should link to the repo so we can get stars!

#

Very cool work btw

quasi steppe Feb 19, 2024, 8:00 PM

#

we could actually add some comparison between transformer and RWKV to the paper

hearty flicker Feb 19, 2024, 8:02 PM

#

Only thing I can't share publicly is data

#

I'm sure RWKV will do amazing on the full dataset, only issue is that I can't take it off SAI for obvious reasons

#

Any architecture good enough for language will be good enough for MIDI

#

Btw Blink trained this with aria's midi tokenizer / tool chain : )

quasi steppe Feb 19, 2024, 8:08 PM

#

I assume you don't mean we legally can't take off from SAI? If we really want to open-source data it's possible (but need to notify SAI of course). Just so that you know

hearty flicker Feb 19, 2024, 8:10 PM

#

We can open source most, but not all

#

Basically everything that me and Alex have worked on, we can open source

hearty flicker Feb 19, 2024, 8:11 PM

#

quasi steppe I assume you don't mean we legally can't take off from SAI? If we really want to...

Redistribution of music is legally tricky

#

However for the transcriptions (the stuff Alex and I worked on), there is precedent for redistributing

quasi steppe Feb 19, 2024, 8:14 PM

#

Yeah basically just want to say it's not our NDA that stops it. There can be other constraints of course

hearty flicker Feb 19, 2024, 8:15 PM

#

No it's not the NDA

hearty flicker Feb 19, 2024, 8:33 PM

#

We are aiming to release the aria models in April, I'm currently working on three separate ideas around alignment/scaling/editing that will be done by then I hope

#

@timber talon, @quasi steppe and I are also planning to submit 3 paper to ISMIR in April

#

I recon overall this project will have a decent impact on the gen-music scene

rustic dirge Feb 19, 2024, 8:42 PM

#

hearty flicker You should link to the repo so we can get stars!

it's still using my midi data & tokenizer

hearty flicker Feb 19, 2024, 8:45 PM

#

Oh damn!

#

Well it sounds great : )

wind viper Feb 22, 2024, 9:16 PM

#

Hi all, I'm very excited to have found this project and will be following its development. I work in the music industry and have been curating MIDI data for many years. I'm interested in learning more about the topics discussed here and, if helpful, sharing some ideas.

Is Aria aimed to be an all-genre foundation model or solely focused on classical music?

sand nymph Feb 22, 2024, 9:25 PM

#

wind viper Hi all, I'm very excited to have found this project and will be following its de...

It's focused on classical music (and mostly piano music) for data reasons but if you have other data it should work on other forms of music

wind viper Feb 22, 2024, 10:09 PM

#

Good to know, although classical is what I collect the least of, with so many large datasets already available.

I'm still scouring through the comments here. In regards to ideas for rendering MIDI, you may want to explore Spotify's pedalboard for Python, which allows rendering with VST3 plugins.

hearty flicker Feb 22, 2024, 10:28 PM

#

wind viper Hi all, I'm very excited to have found this project and will be following its de...

Hey! Multiple people have reached out to me and have already successfully trained/finetuned their own models using the toolchain we have created

#

If that is something you are interested in (since you have your own data), I can guide you through it

#

Generally speaking it's quite easy to do this with the toolchain we have built

rustic dirge Feb 23, 2024, 12:31 AM

#

hearty flicker Well it sounds great : )

currently it's like gacha game, sometimes bad, sometimes great
i think we need RLHF dataset 🙂

hearty flicker Feb 23, 2024, 12:34 AM

#

Scaling and alignment are the key!

#

It's what I'm currently working on myself

narrow sorrel Feb 23, 2024, 3:42 PM

#

@neon hamlet

wind viper Feb 23, 2024, 6:36 PM

#

hearty flicker Generally speaking it's quite easy to do this with the toolchain we have built

Can it be fine-tuned or trained from scratch on a single GPU?

hearty flicker Feb 23, 2024, 8:04 PM

#

wind viper Can it be fine-tuned or trained from scratch on a single GPU?

Can fine finetuned with a single GPU easily

wind viper Feb 25, 2024, 11:04 PM

#

hearty flicker Can fine finetuned with a single GPU easily

I almost have the small model running. Can it be fine-tuned below max_seq_len? 4096 is quite large for an average GPU.

hearty flicker Feb 25, 2024, 11:46 PM

#

wind viper I almost have the small model running. Can it be fine-tuned below max_seq_len? ...

It should work. Try giving a -l argument of 2048 or 1024 when building the finetuning dataset. It might trigger an assert somewhere in the training script, I haven't tested it myself

#

Which GPU are you using?

wind viper Feb 25, 2024, 11:58 PM

#

I'll test 2048 to see what happens. I'm running on colab. Orange run is 100 classical piano samples and green is 1000. Batch size is 2.

hearty flicker Feb 26, 2024, 12:15 AM

#

If it doesn't trigger an assert you are probably good

#

although you should be able to train using the full context length on colab : )

#

Doesn't a t4 have 16gb of VRAM?

hearty flicker Feb 26, 2024, 12:17 AM

#

wind viper I'll test 2048 to see what happens. I'm running on colab. Orange run is 100 clas...

It's already trained on a lot of classical piano, so if that is what you want to use it for you can just use the pretrained checkpoints

hearty flicker Feb 27, 2024, 1:30 PM

#

This is one of the weirdest training bugs I've ever had to debug

#

I'm 90% sure it's something to do with the optimizer... Training in fp32 makes it a lot better, but still really confusing

wind viper Feb 28, 2024, 4:34 PM

#

I'm fine-tuning on pop piano music. This might take a while if the model has never seen pop melodies.

#

Is there any type of minimum length or note density for the midi data? Should it be accepting files with short durations like 8-16 bars?

hearty flicker Feb 28, 2024, 4:59 PM

#

During pretraining there was some pruning based off of those things

#

You can find them in config/config.json

#

https://github.com/EleutherAI/aria/blob/56b66eacfef72081ea0eaca8184fab4792b54d19/config/config.json#L3

GitHub

aria/config/config.json at 56b66eacfef72081ea0eaca8184fab4792b54d19...

Contribute to EleutherAI/aria development by creating an account on GitHub.

#

The model does support any valid MIDI file though

#

The model has seen about ~50k multitrack pop during pretraining

#

Make sure to compare your finetuning results to the original model, would be interesting : )

obtuse lagoon Feb 28, 2024, 5:06 PM

#

hearty flicker I'm 90% sure it's something to do with the optimizer... Training in fp32 makes i...

I don’t know if you’ve seen this already but this might be a useful discussion : https://stackoverflow.com/questions/58633177/why-theres-a-big-jump-up-of-the-loss-curve-during-the-training

Stack Overflow

Why there's a big jump (up) of the loss curve during the training?

I've been training the exactly same model (with the exactly same training dataset) twice but results are very different, and I got confused about the behavior of their loss curves.

The loss curve...

hearty flicker Feb 28, 2024, 5:07 PM

#

I haven't actually seen such a jump during my experiments

#

The paper will hopefully be on arxiv soon

wind viper Mar 4, 2024, 7:19 PM

#

I don't understand the tokenization process that well, but I'm curious to know how much MIDI information is being compressed. Is 4096 tokens intended to cover an average song length? Also, is it possible to know how many MIDI files the model was trained on?

hearty flicker Mar 4, 2024, 7:50 PM

#

wind viper I don't understand the tokenization process that well, but I'm curious to know h...

I think the pretraining corpus was 144k sequences of length 4096

#

If you want a better idea about the tokenisation process, I'd recommend just tokenising some MIDI files and printing it out. You can find some in tests/test_data

#

Or you can read the source code - https://github.com/loubbrad/aria/blob/56b66eacfef72081ea0eaca8184fab4792b54d19/aria/tokenizer/tokenizer.py#L414

#

The main idea is that each note is represented by three tokens: (instrument, pitch, velocity), (onset in ms relative to the last <T> token), (duration in ms)

#

The reason for using <T> is to keep the total vocabulary size under control.

#

There is some preprocessing which is applied when building datasets, you can fine the source for this here https://github.com/loubbrad/aria/blob/56b66eacfef72081ea0eaca8184fab4792b54d19/aria/data/datasets.py#L246

#

It is fully customisable using the config/config.json file

wind viper Mar 9, 2024, 1:14 AM

#

This was was helpful, thanks. I'm testing some runs to see if I can get tiny model to perform. These are my settings:

{
"d_model": 384,
"n_heads": 8,
"n_layers": 16,
"ff_mult": 4,
"drop_p": 0.0,
"max_seq_len": 2048,
"grad_checkpoint": true
}

#Neuro-Symbolic Music Models