#Neuro-Symbolic Music Models

2689 messages · Page 3 of 3 (latest)

hearty flicker
#

So pickling the tokenizer might be the bottleneck too

quasi steppe
#

oh you pass tokenizer in through pickling? I thought it cannot be pickled. I tried that when writing my own script and got an error

hearty flicker
#

Not sure if there is a good way to share python objects between processes, I'm not super familiar with python mp in general I just threw this together to speed things up

#

Actually it shouldnt be pickling the tokenizer since I'm using partial

quasi steppe
hearty flicker
#

I'll look into it later...

quasi steppe
#

or like I said just have a script to build the dataset in memory directly, then save the 100x epoches to a giant file

hearty flicker
#

Could also index the file, and read from it using mmap in each process seperately

hearty flicker
quasi steppe
#

I mean we do the whole I/O, tokenization, transform, concat and chunking, encoding in the same process that draws from a queue or just do round-robin. Save the token ids to a file and next time we don't need to do anything

#

if the bottleneck is that we spin up too many processes and pickle the tokenizer too many times, this helps too

#

if the problem is still the json.decode this should be optimal anyway if we use a queue to let processes grab whenever they can.

hearty flicker
#

Oh so, tokenized datasets are built to a file

#

So they only need to be built once and then they can be accessed whenever

#

And accessing them is fast

quasi steppe
#

yeah

#

and put everything in one process to minimize communications

hearty flicker
#

That's how I have it implemented currently actually

#

Unless I'm misunderstanding

quasi steppe
#

you only save the tokenized dataset, not the token ids right?

#

tokenized sequences can do data augmentations which is nice. But I say we do augmentation when building it, and save a static file full of integers

#

or maybe I misunderstood

hearty flicker
#

Could do, although I don't think that there is any CPU bottlenecks during training

#

Fetching, loading and doing data aug takes about 5ms per entry

#

So one cpu core can do like 100-200/s

#

And there is no pickle stuff in the mp, so with 8 or 16 cores it's never gonna be an issue

#

I'm not sure how this scales to multiple nodes though

quasi steppe
#

oh yeah the dataloader itself is not multiprocessing for now... As long as this problem doesn't happen for training it's probably alright

quasi steppe
#

ok it's faster with fixed workers that receive items from a queue. Tokenizer is only pickled once so maybe that's it.
But it's not as fast as I imagine though.

quasi steppe
#

Doing fixed workers is about 1.5-2x faster
Actually for giant_midi, before 4000 samples the speed fluctuate a lot but afterwards it was so fast (like 3x). Happens for both methods. If you just do I/O and read the whole file to memory, the whole thing is like 5 seconds in single process.... So I bet there is still some massive overhead depending on sequence lengths.

hearty flicker
#

I'll look at your commits I'm interested about how you implemented it

#

I was trying to avoid loading the entire mididataset jsonl into memory at once

#

I was worried it would be too big to fit in memory for the larger datasets

quasi steppe
#

not that I want to load everything actually to memory.... Just trying to rule out the possibility that the file system in the cluster node is doing some weird things

#

The jsonl file is also getting fat. The bitmidi is already 6G. If we do 50 epochs it's gonna be 300G but there will be a storage limit in SAI nodes. I definitely can't save 300G with my current quota

hearty flicker
#

Yeah that's kinda why I have the data aug implemented dynamically

#

The dataset files get huge

#

Ha

quasi steppe
#

yeah makes sense

#

I also throw in a jsonl.zst reader/writer. When I was working with pilev2 jsonl.zst was the go-to format that really brings storage down

hearty flicker
#

I think the issue for going from Mididataset- > tokenized dataset is the pickle

#

It makes sense that the profiler is saying require lock is the biggest timesink maybe

#

Because it's trying to acquire the lock on the mididict string / dict

#

Maybe idk, the pickle happens inside that perhaps?

#

Someone skilled in python mp can probably tell what is going on

#

The way that the MidiDataset and TokenizedDatasets are designed is very subpar in general

#

I built it pretty fast

#

I wish I had access to my server so I could run some tests

quasi steppe
#

if there is a way to do mmap with cheap random access for zstd stream reader, we could stick to zstd all the time without worrying about storage. But not sure if it's possible

#

was trying that and failed

hearty flicker
#

Ima have a look at this properly this weekend

#

Just to confirm 100%, you are concerned with the speed of 'aria tokenized-dataset' (aka the build method) and not the tokenized dataset class once it is built, right?

hearty flicker
quasi steppe
quasi steppe
hearty flicker
#

If pickling is the problem, another alternative would be to chunk the file into n parts and then each process can convert only it's chunk

#

That way, there would be no overhead with starting up and killing each process

quasi steppe
#

was just thinking if there is anything else

#

@hearty flicker got a lot of aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments when trying tokenized-dataset for bitmidi.

#

oooooooh never mind... The config.json was changed for bitmidi but reverted back when pulling the updates.
But this made the workflow a little inconvenient.

hearty flicker
#

That's expected

#

It's not really an error

quasi steppe
#

oh doesn't help, still got that

hearty flicker
#

It's not an error

quasi steppe
#

gotcha

#
Traceback (most recent call last):
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/fsx/home-honglu/aria/aria/run.py", line 178, in <module>
    main()
  File "/fsx/home-honglu/aria/aria/run.py", line 170, in main
    build_tokenized_dataset(args=_parse_tokenized_dataset_args())
  File "/fsx/home-honglu/aria/aria/run.py", line 137, in build_tokenized_dataset
    dataset = TokenizedDataset.build(
  File "/fsx/home-honglu/aria/aria/data/datasets.py", line 641, in build
    buffer += entry
TypeError: 'NoneType' object is not iterable

Also got this. Not sure if it's from my changes or not. Will try some debugging

hearty flicker
#

It's happens when you have a midi dict which only has instruments that should be removed according to the config

#

Uhhg this error again

#

I thought maybe it's happening because entry is None so I put in a check for that

#

But I guess not

#

I'll debug this tmo too

#

Properly

quasi steppe
#

I run it on my branch that has a lot of changes on _get_tokenized_seqs_mp. Could be my problem too. Will let you know

hearty flicker
#

If you get that with aria/main let me know

#

The error message was confusing me a bit tbh

#

I didn't see how the type error could apply to that line

#

buffer += entry

quasi steppe
#

oh another problem, now it seems that sequences are concatenated and chunked in a fixed way (we save them to tokenized dataset), and then augmented.
But ideally the data needs to be shuffled first for every epoch, and then concatenated and chunked. There will be a difference when running multiple epochs

#

like
abcde and 123456789 becomes abcde12, 3456789 if sequence length is 7. But with multiple epochs it's gonna be always these two with augmentations.

#

ideally I might want 1234567 and 89abcde or even stuff that starts in the middle like cde1234, ...

hearty flicker
#

Ok added to the list, will be good to get this all squashed now ha

#

That might be a hard issue to fix. I wonder how it's done in NLP

#

I can't think of an obvious way to make it so that the sequences are concentrated differently for each epoch

#

There must be a way that it's done in NLP

#

Any thoughts on this @sand nymph ?

quasi steppe
#

we are already close. Basically we just need to skip the tokenized-dataset step, and do this in data loader

hearty flicker
#

Not sure how to integrate that with multiple workers / processes though

quasi steppe
#

basically start with one single process DataLoader that does all of these lazily. To scale to multi-gpus have another DataLoader wrapped around a few processes each targeting the original dataloader (or maybe torch dataloader has some built-in stuff for this). For multi-node I don't think the code needs to change because we sample at random, and global_rank doesn't really matter.

hearty flicker
#

Ok that sort of makes sense I'll look into it tmo

quasi steppe
quasi steppe
#

Man.... Can't believe training a 400M model with 8192 length on a single node can only do batch_size=1... I'm sure pp will help but I thought we only need that when > 1B...

#

and it's 13sec per batch lol... total batch size 128 (grad acc 16 steps)
Running profiling tools now.... Didn't do the math but I'm sure the flops is incredibly bad

hearty flicker
#

The reason why might be that your not using gradient checkpointing

#

If the issue is the vram, otherwise there must be a different issue

#

You can use the profiler to make sure that flash attention is working?

#

Also are you training in bf16?

quasi steppe
quasi steppe
quasi steppe
# hearty flicker You can use the profiler to make sure that flash attention is working?

Already did. Also tried to change the code to explicitly use flash_attn. No difference in both vRAM usage and speed so flash attention is working well.

Used a profiling tool to dump a json for perfetto visualization. It seems it went alright. Didn't see a lot of bubbles other than low level stuff between ops inside kernels. I don't have an immediate idea about what to improve

#

For vRAM I'm sure it can improve with pp. Need to change code a little bit I think. I'm still really confused by flops

quasi steppe
#

@hearty flicker DDP seems to have trouble with gradient checkpointing

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1
) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acros
s multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module gr
aph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, 
if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of
 parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready mu
ltiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workarou
nd if your module graph does not change over iterations.

Guess that's probably why people didn't recommend gradient checkpointing when I worked on pretraining.

quasi steppe
#

FSDP and cpu offload seems to make it fit and automatically scale to multinodes. Looks slow but didn't do the math carefully. I thought I need to manually write the wrappers but it's nice that accelerate takes care of them already.
I should be able to start a bigger run tomorrow on the bitmidi data we have.

sand nymph
#

@quasi steppe it looks like you can just pre-compute this? Build the two epochs separately and concatenate them. It's only tricky if you try to do it on the fly I think

quasi steppe
quasi steppe
#

@hearty flicker the xlarge sized model (I made a mistake, it's 700M rather than 400M) trained on the full bitmidi at around 8 epochs

#

the chord is fairly different from the start already.

hearty flicker
#

Try it on some multi-track stuff

#

This is just trained on bitmidi right?

#

Not including the classical data?

hearty flicker
#

The file size might get too big

#

My immediate goal is to get the audio downloading and transcription working, then I'll and try sort out the dataloading and training stuff

#

It would be nice @quasi steppe if eventually we could unify all of the code that we are using into aria/data/datasets.py and aria/train.py

#

Don't worry about it for now though, haha

#

What did the loss curve look like for giantmidi-700m?

quasi steppe
hearty flicker
#

It might be undertrained

quasi steppe
#

I'm listening to the generations and it really sucks. There are some good ones but failure cases are too many

quasi steppe
hearty flicker
#

It sounds like it from the sample you sent

#

btw if you test it, you should test it on pop music not classical

#

There is very little classical piano stuff in bitmidi

quasi steppe
#

but the loss is already below 2 (where the "overfit" classical one we got). The loss value doesn't mean too much but I would have expected classical is more repetitive and structured than pop and should have a lower loss given same training conditions

quasi steppe
#

not looking good

hearty flicker
#

Weird haha

#

I think it's useful to have val loss just so we have an idea of how over/underfit the model is

#

My experiance is that the best checkpoints occur when val loss stops decreasing

quasi steppe
#

or maybe they have fixed it by now. I will try

hearty flicker
#

Hmmm, ive never used hf trainer to be fair

#

Seems to work fine with my train script, idk

quasi steppe
#

hf trainer actually doesn't have fancy codes. I tried to glance through it in the weekend. Actually similar to yours but the only advantage is that it works with HF datasets out-of-box and supports a lot of fancy wrappers like FSDP or deepspeed pp, etc.

#

(well I mean it's not hard to adapt codes for those stuff... just that it's a little faster to try out things quickly haha)

hearty flicker
#

Yeah true, I'm going to adjust the dataloader stuff and train script this week

#

So that is does what we need

#

There must be some sort of issue going on, musenet was also trained on bitmidi

#

And the samples from that sound good

#

As a sanity you could try recreate the checkpoint we've been using on the notebook

#

50 epochs at medium

#

On giant midi + kunderfug + mutopia + maestro

#

On the bucket

quasi steppe
#

I'm gonna do some basic data stats. There are a few possibilities:

  • 10 epoch is not enough (but it kinda worked for classical so idk...)
  • I changed the optimizer from Adam to Adafactor for mem reduction but maybe Adam works better
  • model architecture? 96 layers is really really deep and honestly it's my first time training that deep of a network. I could add some monitoring of each layer and I suspect it's really too deep.
  • data quality issue. I did random sampling and many are short (but not short enough to be filtered).
hearty flicker
#

There is a min length filter in the config that you can use btw

#

When building MidiDataset

quasi steppe
#

this training job wasn't serious anyway. Just left it on yesterday and I can't believe it's not preempted for a whole day haha

hearty flicker
#

It could also be that the concatenation stuff we are doing isn't helping

#

Maybe try to recreate the old checkpoint with the new training methods

quasi steppe
hearty flicker
#

And we can directly compare the quality

quasi steppe
#

the 100x large was good IMO

hearty flicker
#

Try with medium because it's hard to know exactly what is affecting what

#

Might be just using a larger model or more epochs

#

It's only 24 4090 hours so shouldn't be hard

quasi steppe
#

oh by the way, this xlarge run used 8192 as max length....

hearty flicker
#

Yeah lol crazy

#

Insane context length

quasi steppe
#

I suspect that 8192 is too long and we got too few training steps so I actually think being undertrained is very likely

hearty flicker
#

If you test the medium cp, test with 2k

quasi steppe
#

yep

hearty flicker
#

There are so many factors here, it's probs best to be systematic

#

Hard to know what is doing what haha

quasi steppe
#

yeah absolutely

#

The sample length distribution of bitmidi. Super long tail and I need to zoom in a bit

#

actually most samples are very long. If we filter out anything below 2048 tokens, we are only removing about 3% of the tokens.......
And we got (only) a total of 700M tokens here.

quasi steppe
#

multi-node training works. Still crappy flops though...
I will start with trying medium 2048 bitmidi later this week. DDP should be more than enough for that

hearty flicker
#

Very nice stuff

timber talon
#

question —

I got the wav -> midi transcription inference working, and it's outputting a set of files per wav file:

#

do you know, offhand, if these are canonical enough to work with, or should i get the rest of the inference scripts running as well? (these 10 files are all for the same song)

#

because of how bad the code is, the bar goes exponentially higher for getting the other processing scripts working — would probably need a full rewrite. And most of what the other processing scripts are doing is calculating F1 score and such, so just wanted to check

#

oh wait – reading section 4.3, now — definitely want to ignore all the *1st* files in those lists, those are just for the first-level loss calculation. I wonder whether the 2nd.json is enough? If the pitches look weird, there is a conversion function in another file: https://github.com/sony/hFT-Transformer/blob/master/evaluation/m_transcription.py#L98-L106 which converts the json to .txt for mir_eval and does a functional transformation on the pitch specifically

golden chasm
#

@hearty flicker Hi, I have been trying to start a pretraining test using a small dataset created using some random midi files. I have encountered a crash at line 801 of file aria/tokenizer/tokenizer.py.
It crash on this line, which raise an IndexError if stack is empty:

stack[-1]["dur"] = tok

For the moment I solved in this way, but I am not sure it is the proper solution:

if len(stack) > 0:
    stack[-1]["dur"] = tok

Maybe it is just a problem with my midi files having content which is not supported? Could you share the midi files you used to train the model?

hearty flicker
hearty flicker
#

So the output is just a bunch of eval info, not a midi file

#

Very annoying

#

I could probably write a MIDI converter for this

#

Are those files really just for one song?

hearty flicker
#

I wonder if that is implemented in the repo

#

If this is too hard, we can just use Kong et al for now

hearty flicker
#

@timber talon did you figure out what the 'mpe' acronym means?

#

This file seems to hold the code for converting to MIDI

#

If we can find the right config file, this might be all we need

#

I think you just have the run the AMT methods sequentially to get a MIDI out

timber talon
#

great... yeah, I'll take a look. There was a config file was kinda the root of the problem yesterday, but I solved that

#

that's not used anywhere else. ok, sweet, this is totally doable. sorry, i'm just not entirely fluent in midi, yet, so wasn't sure if the offsets/mpe files were something someone more familiar than me would look at and be like "oh, obviously, we can turn this into midi "

timber talon
#

transcription is not super fast — 10-20 seconds per piece, on average. But fits on a 12GB GPU. I have access to loads of those, and can easily parallelize this processing

hearty flicker
#

Yeah we can parallelize this heavily.

#

I'm still working on the downloading pipeline

#

should be done by friday

hearty flicker
#

Here is what it the midi sounds like once I converted it back into audio using fluidsynth

#

This is so promising lol, has gotten me excited !!

#

I love this piece by the way, I've never heard of this composer

hearty flicker
#

@quasi steppe Over the next few days I'm going to implement a different version of TokenizedDataset that supports the functionality that you wnat

#

Aka all the epochs are concatinated to one big file.

#

During fine-tuning it might be best to have the original implementation with padding, so I'm going to keep that functionality in a different class

#

I'll also make the changes to the train script so we can work from there directly

#

Currently working on the spotify_dl stuff so alex and I can get our pipeline working

timber talon
#

I spent a lot of time yesterday trying to test out some OMR systems — it would be another source of MIDI that is definitely open-domain, and would open up all of IMSLP. Was hard to find anything that looked credible

#

@hearty flicker do you know of any OMR libraries? Or is OMR just not really a thing?

hearty flicker
#

Not something I've ever looked into

#

There might be some deep learning research on it though

#

the download stuff is nearly working btw

#

I've just forked spotify-dl and added the extra functions we need

timber talon
#

I was disappointed when I tried to use the Spotify_dl in some trial runs. Especially for more diverse composers — women piano composers in the classical era, for instance— I was getting very few hits

#

That’s what motivated searching for some good OMR tools

#

I’ll keep looking around, but if you hear of anything (maybe your advisor knows?) that would be really cool

#

I signed up for IMSLP, btw, so that’s another potential source for public domain audio and midi

hearty flicker
#

Yeah, I mean spotify_dl can only find stuff it matches on youtube

#

It really depends how much of it is on youtube

#

Any other way of sourcing solo piano recordings also works

#

We can use a combination of different methods. As the (audio -> midi) transcription is so good, any source of solo piano audio would be great

hearty flicker
#

@timber talon You can use my fork of spotify-dl

#

It automatically prunes out non solo piano recordings by ensuring that the number of artists on the spotify metadata is <= 2

#

It also skips downloading duplicate files (only album/playlist wide at the moment, will improve this)

#

And it takes a text file of links with the --file arg

#

Actually it should skip downloading dupes for every album in the text file

timber talon
#

some initial results using Audiveris to OMR on some sheet music.

It's not great, but if we're doing any data augmentation/ noising, it will fit in on that level

hearty flicker
#

How do the mxl files work?

timber talon
#

Oh you load them into musenote to listen. They’re primarily for musical notation software. But they’re also convertible to MIDI, I can do that quickly

quasi steppe
#

Got a lot of idle machines in the cluster so I sent a training job this morning. 64 A100, large.json (700M?), 100 epochs of bitmidi. I have gone back to 2048 context length. We can extend the length later easily.

#

That bump is interesting though. It does happen all the time in LLM pretraining

hearty flicker
#

That's cool

#

I've been bugged down with a bunch of irl stuff, should have more time this next week however I am moving apartment so we will see

#

The dataset stuff with @timber talon is looking very promising. Going to try to retrain the hft model on monday

quasi steppe
#

25 epoch checkpoint is actually pretty good!!!

#

I think directly pretraining on 8192 is somehow hurting the quality

#

I'm more and more convinced an LM should pretrain on 2048 or less, and then extend procedurally

#

the prompt is 200 token (first 2-3 bars) from some random online stuff. No CFG, an honest 1.0 temp for the current one.

hearty flicker
#

This is pretty good

quasi steppe
#

bitmidi is pop heavy right? How much classical could it have?

hearty flicker
#

It might have some

#

hard for me to know

#

It's a webscape of like 200k midi files

#

The context length thing is interesting

#

This sounds way way better than lasttime

quasi steppe
hearty flicker
#

Could it be a problem with the freq used for rotary embs?

quasi steppe
#

I should probably have increased lr by 4x but I doubt how much it helps. Each step is an attempt of searching the loss landscape, fewer amount definitely covers less ground

hearty flicker
#

Will be cool to see if the new tokenizer improves the results

#

My suspicion is that it will make timing related stuff better

quasi steppe
#

something different. So it knows to improvise at temperature 1.0. Better decoding param should give much better results (esp I haven't applied CFG and we haven't implemented beam search)

quasi steppe
#

This is quite interesting. The job stopped and I had to use an earlier checkpoint to resume. Dataloader should also be resumed and deterministic, esp since most loss values are having the same up and down. But there was a loss spike all of a sudden

hearty flicker
#

That might be because the Adam params reset

quasi steppe
#

no, optimizer states are recorded

hearty flicker
#

Weird

quasi steppe
#

everything is saved well with huggingface's Trainer. They did a good job on this

quasi steppe
# hearty flicker Weird

I suspect there is some minor float point issue in distributed training that is not deterministic. We didn't see these on single node training job

#

I used to think those spikes in LLM pretraining are due to data outliers. Now at least this is ruled out

quasi steppe
#

Nice. The 8-node jobs is still running. 36B tokens trained. We have more token than parameters by the way. if it ever overfits it shouldn't be that bad.

hearty flicker
#

Was this with 2048 too?

quasi steppe
#

gonna be interesting

#

I saved every 5000 steps so we can even study how the latent space changes

quasi steppe
#

125000 step checkpoint... This is quite amazing.
I generated 8 samples and all of them are amazing.

#

I did some random sampling of bitmidi dataset and I'm fairly confident that this prompt is close to go out of distribution. Meaning this "overfit" model generalizes very well

quasi steppe
#
Traceback (most recent call last):                                                                                                                   
  File "/fsx/home-honglu/aria/generate_large.py", line 135, in <module>
    sample(
  File "/fsx/home-honglu/aria/generate_large.py", line 106, in sample
    res_midi_dict = tokenizer.detokenize(tokenized_seq)
  File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 79, in detokenize
    return self.detokenize_midi_dict(tokenized_seq)
  File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 469, in detokenize_midi_dict
    _channel = instrument_to_channel["drum"]
KeyError: 'drum'

Got this error

hearty flicker
#

This is amazing work

#

Hmmm it looks like that issue happened because there was a drum token sampled, but it's not an instrument that is supposed to be there?

#

I can probs put a check for that, but technically it shouldn't ever happen

quasi steppe
#

maybe like adding a force=True so that we turn those errors to warnings and just stop the detokenize at wrong token for each sample

hearty flicker
#

Yup I can add a check

hearty flicker
#

How is the Yarn stuff going btw @quasi steppe ?

#

It seems to work super well for nlp, will be amazing if we can get it working well in this context

#

I need to get a fluidsynth soundfont setup that correctly handles multitrack

quasi steppe
quasi steppe
#

@hearty flicker is there a convenient UI that reads the model output as a stream and renders it into sound? Would be cool to write a model-based infinite music player if such a stream reader is already available in open source

hearty flicker
#

I've had exactly that idea

#

And it's super possible

#

I'll build an API for it today maybe

#

It would be really cool and easy to write

quasi steppe
#

I tried to run my model locally and it was about 15 tok per sec. If we quantize it it can be really fast

hearty flicker
#

Yup, more than enough for live playing

quasi steppe
#

Def good enough for streaming tokens

hearty flicker
#

I was thinking of eventually building a ggml backend too

quasi steppe
#

Yeah!

hearty flicker
#

Did you fix the problem with the bucket perms btw? Maybe I was doing something wrong

quasi steppe
#

From sai cluster it was alright

hearty flicker
#

I kept getting 403s

quasi steppe
#

I already put in all the model checkpoints

hearty flicker
#

When I'm at my laptop I'll get back to you

quasi steppe
#

Yeah I will try again later

hearty flicker
#

Were you using gsutil mv?

quasi steppe
#

cp

hearty flicker
#

Alright I'll try that

quasi steppe
#

Before it needs gcloud auth .... using that json

quasi steppe
#

int8 quantization kinda works. Feels like 1.5x speed-up from fp16 but haven't measured it. I will push the code later.

Made a draft PR to fool around but we don't have to merge it. Still playing with quantization on my laptop.

hearty flicker
#

Is there much quality degradation?

quasi steppe
#

not sure what the best setup it is for CPU inferencing

hearty flicker
#

I don't think CPU will work with flash attention

#

I tried it a few weeks ago and it didn't work

quasi steppe
#

It works on my computer

hearty flicker
#

I'm moving flat today so am quite busy, next week I should be back to my normal schedule

#

Oh that's weird, maybe there was a patch

quasi steppe
#

I have a draft PR for fooling around. It works fine on my computer. I refactored all those device stuff. You can play with it.

hearty flicker
#

I tried it briefly and scaled_dot_product_attention was causing some issues

#

Might be my computer dependant though ha

#

I will play with it

quasi steppe
#

pytorch has some other quantization options. If we use the MultiheadAttention class instead of the scaled_dot_product_attention, something seems to apply to that directly. Currently I only quantize all those dense layers (also need to sort out the degradation.... Hope it's just a paramter issue)

#

When the sequence gets long the speed is a loooot slower on my laptop and I fear without quantization the token stream wouldn't catch up with the player frontend, if we want to do what we talked about earlier

hearty flicker
#

If we did it with ggml it would be fine

#

If we are aiming for arm macs

#

I mean you can run llama7b at 10-15 toks/s I think

#

So for our model there will be more than enough speed

sand nymph
#

Whats the formula for model training cost (in A100-hours) in terms of dataset size and # params?

quasi steppe
hearty flicker
#

If anyone is interested, here are some samples I compiled a few weeks ago from the old version of the model @here

#

Will be interesting to see how much of a difference the improved tokenizer/datasets/scale/finetuning will make

#

Ok SAI keeps kicking me off and I've stuck with using the terrible train script provided by this paper

#

Will run the transcription training on my home server instead

quasi steppe
#

there are some idle machines in SAI today. Gonna try to see if I can get yarn finetuning working.
Managed to get 100 training steps done but crashed when it tries to save optimizer states. I got a lot of troubles with my training codes and getting weird C++ errors all the time...
Gonna switch to your codes and I'm building a 8192-length dataset now

hearty flicker
#

Could be an accelerate issue maybe?

#

I haven't fully tested my training script on multiple nodes ect

quasi steppe
#

it worked before but all of a sudden it breaks down.
Still working on it. I found some bugs in the yarn code. Will do a major PR for the YaRN component

#

but I don't know if those will resolve the c++ errors yet

hearty flicker
#

Cool, I'm having a bug in the data augmentation stuff with the new tokenizer, so I'll probably not be able to merge it till monday. I'm away this weekend in cambridge unfortunately so I can't work on it

#

c++ errors scare me

#

haha

quasi steppe
#

got something along the lines of "some kernel returned NULL without raising an error". Googled it and saw in pytorch forum a dev said he hasn't seen this error for years lol

hearty flicker
#

Would be so amazing to get yarn finetuning working

hearty flicker
#

I was getting C++ errors when trying to get gradient checkpointing to work with torch.compile

quasi steppe
#

yeah mine was from torch.compile too

hearty flicker
#

Maybe we should add a flag to the train script to skip compiling

#

Since it seems to randomly cause weird issues

#

I need to remember to go over the transformer optimization document you sent also

quasi steppe
#

I disabled that, the training got slower but it worked until it was trying to save the first checkpoint. When it got to optimizer states it got another c++ error along the line of "all gather failed because different devices have different values of something" lol

hearty flicker
#

Very weird

#

Is this using hf Trainer?

quasi steppe
hearty flicker
#

Could be an issue on that end then

#

Was this also on multiple nodes?

quasi steppe
#

and I upgraded accelerate in my conda env and that's another possible cause

#

will try to debug more today

hearty flicker
#

I can also have a look on monday

#

I should probably read the Yarn paper properly too

#

did the iclr rebuttals get reviewed yet?

quasi steppe
#

the yarn code was a bit confusing and I realized I need to copy a different class from our yarn repo if I want to do finetuning. I'm trying to refactor a bit to clean some stuff up.

quasi steppe
hearty flicker
#

Oh lovely !

#

How about cfg?

quasi steppe
#

so probably a rejection

hearty flicker
#

A shame, made a big impact for this project anyway haha

quasi steppe
#

it's a bit unfair. One reviewer basically said, oh CFG exists in CV so you are not novel

hearty flicker
#

Yeah but that's the entire point

quasi steppe
#

We were like, that was the whole point lol

hearty flicker
#

That's so annoying

quasi steppe
#

and this guy never responded to our rebuttal.
Pretty sure the 2x 5's didn't spend more than 10min on our paper. The 6 guy raised a lot of good points. I really think if we can get qualified experts to read it, we should at least get overall 6

hearty flicker
#

Pretty annoying reason to get a rejection

timber talon
#

extremely annoying, yeah

#

@quasi steppe what's that secret part of your website again where you have all the generations you've been producgin?

quasi steppe
#

These are from the giantmidi model. Haven't uploaded the bitmidi generations yet

quasi steppe
#

Couldn't get YaRN working. The finetuned checkpoint is gibberish....

#

@hearty flicker Dataset was alright. Trying to run your script.
AttributeError: 'Namespace' object has no attribute 'train_data'. Did you mean: 'train_dir'?
This is probably a typo.
Will fix and try it again tomorrow.

hearty flicker
#

Yeah must be a typo

#

Did this error happen when building PretrainingDataset from the cli?

#

Pushed a fix

hearty flicker
quasi steppe
hearty flicker
#

Ah ok cool

sand nymph
#

It looks like we'll be able to get y'all at least 5k A100-hours of dedicated compute Stability

hearty flicker
quasi steppe
#

finetuned YaRN generating a 7000 token sample. Maybe it has a stop token earlier. It's somehow shorter than I expected.
I feel the quality went down quite a bit.

#

oh crap towards the end it was completely gibberish

#

The final loss of YaRN finetuning is on par with the pre-training final loss. It's possible that the model just gamed the perplexity (it's well known that short repeating patterns can game the loss and show an artificially low perplexity) and biased towards those easy patterns that are weird to human.

hearty flicker
#

It's quite curious that the loss is on target but the results are not

#

Was anything similar observed when applying YaRN to nlp?

quasi steppe
#

It was ok with the limited amount of downstream task experiments. Like passkey is alright, basic completions are alright. But it's hard to find a super long task and it got prohibitively slow to do something fancy

#

For the 2048-8192 extension I remember it was all ok

quasi steppe
#

A long range sample coming out of the yarn finetune (same as last night). I sweeped and tuned the attention scale a little bit. This one seems reasonable at the longer end.

hearty flicker
#

This sounds pretty good to me

#

What range of extension was it, 2k->8k?

hearty flicker
#

What changes did you have to make to get it working better?

quasi steppe
hearty flicker
#

Super excited about that

#

This sample is really good for how long it is

quasi steppe
#

@narrow sorrel Want to share your demo video here so that people can see? It's just a thought. If not, don't worry.

#

Also, got a new YaRN finetune checkpoint using new mscale parameter, a lower LR and a longer initial warmup. Here is an 8000-token sample.
(is a bit weird towards the end....)

narrow sorrel
# quasi steppe <@708634594060271616> Want to share your demo video here so that people can see?...

sure! here's the script i made over the weekend to play with aria: https://github.com/EleutherAI/aria/pull/79
thanks Honglu for training the models!

GitHub

This PR adds a proof of concept script for playing turn-by-turn with aria model.
To try, run the following (currently only works on mac, tested on m1 max):
git clone https://github.com/maxreciproca...

hearty flicker
hearty flicker
#

Hey guys, going to aim to get a pr in for doing supervised finetuning this week

#

If anyone has any experience on doing supervised finetuning with decoder only models, any references would be useful

#

The test example I am planning to implement is key detection

#

Also, expanding on @narrow sorrel's script I'd love to implement some real time I/O in the inference library - maybe next week

timber talon
timber talon
#

By SFT, I assume you mean just basic SFT, or continuned pretraining, This is a great script:

https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

In addition to SFT, there's another direction to go in, which is low-rank approximations. this PEFT library: https://github.com/huggingface/peft implements a lot of the standard LoRA approaches. These are potentially cool because you can maybe mix/match LoRAs. The Adapter library also claims to be a similar standardized repo: https://github.com/adapter-hub/adapters

#

i hope i'm answering your question and not just spouting stuff you already know — in general, anything FT-related is very, very standardized by now

hearty flicker
#

Ok guys, the new tokenizer is implemented and the initial results are very promising - the timing problems seem to be mostly fixed

#

Running a proper test to get a better sense for the differences...

hearty flicker
#

New tokenizer seems to help with long term structure too

#

It's crazy to think these samples come from a model trained on the same testing dataset that I first used 5 months ago... sounds so different -- lots of progress

tiny coral
#

Those samples sound incredible. What are the key differences between this new tokenizer and before?

quasi steppe
hearty flicker
#

Yeah I got the idea randomly from a paper about automatic music transcription that came out last year

#

Will give a better update tomorrow... I'm training a model on a larger classical dataset overnight too - I'll update the notebook to use it

narrow sorrel
# timber talon that is soo cool!!! what midi keyboard are you using, btw?

that's a Seaboard, also it supports MPE (i think you mentioned it sometime ago) https://youtu.be/6SCug5kUsBs

Seaboard BLOCK M is here! Order now: https://roli.com/products/blocks/seaboard-block-m

Make music your superpower with the most compact, portable, and affordable Seaboard ever made. It fits in a backpack so you can play it anywhere. Or use it on your desktop with ROLI Studio or Dashboard for third-party DAW integration with LIVE 11 or Logic.
...

▶ Play video
hearty flicker
#

Here is an unprompted piece on the style of mozart @here

#

Quite simple but it defintely sounds like music and has the

hearty flicker
#

This is also quite nice, it's super good at doing stuff like this

#

Creatively missing something but is nailing most 'musical' aspects of this continuation (first 20secs is prompted)

#

It's weird because the original piece honestly makes me feel very emotional, but this makes me feel nothing...

hearty flicker
sand nymph
#

@karmic skiff

#

@hearty flicker @quasi steppe reminder to go talk w/ Stability on thier slack to get compute approval

quasi steppe
#

urgh wrote a live player but cpu inference is not fast enough to keep up with playing...
Only medium model can work on my shitty laptop

quasi steppe
quasi steppe
#

Implemented rolling window. We can have very very long music now, at the cost of forgetting.
On my computer I can only continuously stream 400 without catching up 😂

hearty flicker
hearty flicker
quasi steppe
#

basically it means if we generate all the way to 8192 with a window of 4096, the last token is generated as if the whole context is 4096 with the first half being removed

#

this is a cheap way to mitigate the context window limit and usually it's kinda okay. The model just forgets, but a good model should generate something that feels smooth. Here somehow it degenerates.

timber talon
#

ohhh i see

#

thanks for clarifying!

#

Idk if you saw the Phi-2 lm that microsoft just released — https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ pretty cool stuff

Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on a variety of tasks.

#

their two takeaways in particular — about data quality as well as scaling up model training, are cool. We're already trying to get high-quality data (and in general, music data, which is usually a performance of music by a famous composer, is probably higher-quality than random text data we find on the web?). W.r.t the second point, about scaling up model training, I know there was discussion in the #research channel a few weeks ago about scaling up language models. Makes me wonder if we should think about some of these smaller models you are training as initialization-points for a larger aria model

narrow sorrel
hearty flicker
#

Actually in this case there is a composer token too specifying that the composer is Chopin

hearty flicker
#

I tried out fine-tuning on some jazz music (about 400 tracks of jazz recordings)

#

Unfortunately the fine-tuning data is pretty poor quality, however the transfer learning is definitely working which is cool !

#

Here is a jazz version of Moon River

#

To get a sense of the quality of the fine-tuning data you can listen to the first 30 seconds or so of this mp3 - this is the prompt

#

@here

#

This only cost 1 A100 hour to fine-tune. It would be amazing to get access to some higher quality fine-tuning datasets

quasi steppe
#

haha finally got a generation almost identical to the original (except towards the end). The bitmidi 100x epochs model actually memorizes jingle bell pretty well

quasi steppe
#

this ones shows pretty clearly how rolling window fails... a few bad samples get accumulated into nonsense and once out of distribution, it can't find its way back

timber talon
#

ending a music generation is always really hard, it always trails away until you artificially stop it

#

what struck me about this work was they introduced the concept of "ending" a narrative, and developed metrics to track what a narrative ending was

#

this story-generation work is pretty cool, how it uses a separate classifier (BERT-tiny) to toggle between local context and global context. Maybe that can be applied to what we're doing with CFG

sand nymph
#

@hearty flicker @timber talon @quasi steppe Did y'all fill out the JIRA research grant request for Stability to get project allocation for aria?

timber talon
#

hello stella and happy new years/holidays! I think i missed the discussion of this grant — i'm still waiting on SAI approval for slack. Did the discussion occur there? Anyway, I am happy to help with grant writing

#

if it occurred here and i just missed it, my apologies — I'll touch base with @quasi steppe and @hearty flicker

sand nymph
#

Ah, Honglu and Louis got approved. I'll go nag about getting you in

#

The discussion was basically "this looks cool, submit a request:

#

There's an overview of what they're looking for here

#

The actual submission is submitted here

hearty flicker
#

@sand nymph I submitted it before Christmas, Zach said that they were on a winter break but will look at it after they get back.

hearty flicker
sand nymph
#

@hearty flicker Great

hearty flicker
#

Here are two nice continuations of the first movement of Beethoven's moonlight sonata !

hearty flicker
#

I also thought this was quite pretty. Any Aphex Twin fans @here? I rendered the piano in ableton to try to make it sound more like the original

timber talon
#

wow ^ that's amazing! how it keeps the same rhythmic patterns throughout

hearty flicker
sand nymph
hearty flicker
#

The dataset building me and Alex have been working on has gone incredibly well, just in time too

#

This is the product of an audio recording that has been turned into a symbolic form (think sheet music) via a neural net (similar to OCR in nlp), and then rerendered into audio using software

#

So essentially the data cap on our model is all audio piano recordings that we can obtain

#

Will be exciting to keep working on this after the ICML deadline : )

sand nymph
#

I have gone and fought the monsters and come back victorious: you should have a compute authorization to train these model on the Stability AI cluster

timber talon
hearty flicker
#

Running this pretraining this evening in that case

hearty flicker
sand nymph
hearty flicker
#

Pretraining is definitely working. The optimal sampling parameters seem to be different than before, especially top_p

hearty flicker
timber talon
# hearty flicker

the slowing down and cadence at ~1:20 to end the phrase is pretty crazy

hearty flicker
#

crazy good or crazy bad haha

#

I've been playing around with the style transfer stuff and it's super interesting. This is supposed to be a liszt inspired continuation of a chopin nocturne

#

A bit all over the place structurally, but sounds a bit like liszt to me

timber talon
#

at that to the musical coherence of the cadence

#

all in all, pretty impressive

hearty flicker
#

This is actually from a half trained version of the smallest model

#

Am super excited to see what the large one is like

timber talon
hearty flicker
#

So no dynamic cfg stuff really

timber talon
#

ohhh cool!

quasi steppe
#

yeah this meta-token approach is definitely more natural.

#

I still don't know if dynamic cfg trick works better for a better base model or not. Maybe I need to try it out. The way I think of it is that it might be suitable for more complex guidance (like we want something to get somewhere at certain exact time, or mixing more than 2 styles)

hearty flicker
#

cfg defintely is working amazing

#

These samples are normally prouduced with a cfg of 1.05 to 1.2

quasi steppe
#

Our YaRN finetuning runs on MAESTRO (50 epochs, 16k context length) have also finished (2 larger runs remain but will be done soon).
OMG it's working sooooo well! I'm super excited!!! I mean, not as high quality as the original MAESTRO midis but I've never gotten such coherent one before.
A little weird in the middle and I thought it's gonna go brain-dead like before, but it snaped out of it.

hearty flicker
sand nymph
#

@hearty flicker you had pitched ICML as a target venue. The deadline is in six days which is doable but tight. Is that still the goal?

Have you started writing the paper already? I would 100% start now while the models still run.

hearty flicker
#

I'm writing as we speak !

quasi steppe
#

which one do you guys think is of higher quality? (first ~30-40 sec are the same)
goose11 for first
goose16 for second

(urgh need to render into mp3 first next time)

quasi steppe
hearty flicker
#

Hey @here, please take this survey if you are interested !

sand nymph
#

I'm doing a post-submission editing pass on the paper, including flagging areas where I think more substantive experimentation would be valuable.

@hearty flicker: do you think we could get some music professionals to give feedback on the music we generate?

hearty flicker
#

That was one of the aims initially, but we didn't have time

#

I was in the process of reaching out to various people who are experts in a certain form of music that is simultaneously quite 'rule based' whilst also being very free. The model excels in creating this type of music, and I was curious about what a musicologist in this area would think.

#

I would like to have a section on this topic in the arxiv preprint

sand nymph
#

Great

hearty flicker
#

I would really like to have an extensive experiments and evaluation section in the preprint

#

We also didn't get to do many experiments with finetuning/alignment which is super important

sand nymph
#

I strongly recommend not using -small -med and large labels and instead labeling models by their actual size

sand nymph
#

Are the %-ages in the dataset section the amount of the actual pretraining data?

#

Why was dropout used @hearty flicker?

#

/ was it? I only see one passing reference

hearty flicker
#

Dropout was used during training

sand nymph
#

That's definitely something we need to disclose given how non-standard it is

hearty flicker
#

I thought it was standard, but have realised now there are some issues with the training process we should have done differently

#

I thought it was standard for some reason

#

We also used GELU, which we forgot

sand nymph
#

You specifically said otherwise in the paper

hearty flicker
#

We used dropout using training, why is why the train loss is higher than the val loss

sand nymph
#

Oh that was about GELU

hearty flicker
#

Oh sorry, I meant that we used GELU instead of SwiGELU or whatever the modern activation function is

#

We stated it accurately in the paper, I just wish we trained without dropout and with a better activation function

#

Not that it matters so so much, but still

sand nymph
#

You said you used the architecture from LLaMA. I'm currently changing that to be correct.

hearty flicker
#

Oh I meant to say inspired by, the only difference is the activation function and dropout.

sand nymph
hearty flicker
#

Yes

#

For the largest model, we also trained medium.json and small.json

sand nymph
#

How much of the SAI grant was used for these three models?

hearty flicker
#

I think that training large was roughly 2000 hours

#

And the others were rougly 1000 hours together

#

So we have quite a lot left, I think

sand nymph
#

Cool, so we can do some ablations then 🙂

#

and/or some larger models

hearty flicker
#

I was originally going to do that too, but ran out of time too

sand nymph
#

No worries

hearty flicker
#

Like training on all the data vs only a high quality subset

#

And training deep arch vs wide arch

#

Since MuseNet (openai) used 100+ layers for some reason

#

I did train a deep version of 'large' briefly, but after 5 epochs is was quite clear it was learning far slower, so I stopped it

#

Honestly these models (small and medium at least) are not that expensive to train, and fine-tuning is basically free

#

Lots of opportunity for ablation

#

I also have a lot of evals around the metadata and controllability stuff, about 2 days before the deadline someone reached out to me regarding this

#

They had basically judged the outputs of our model, using an music listening model (audio)

#

And found that for samples generated with a genre meta-data tag (e.g. jazz), the audio model said the same

sand nymph
#

That's cool

timber talon
#

it's not my current field though, and I'm not deep into the musicological/theoretical side

#

it's actually a pity we didn't call that out in the paper — "one of our annotators was X". it would've given more credibility

sand nymph
timber talon
#

it was another lifetime, i also dropped out after 2 years — competing against hundreds of other musicians for the opportunity to play the same few classical music songs for a bunch of old white people wasn't a very appealing career path, long-term. but anyway, i think we're all in agreement about needing more different types of evals though!

sand nymph
timber talon
#

ah yeah, agreed that makes sense

hearty flicker
#

Alex's insight was pretty handy while we were writing

#

I have a meeting tomorrow with my supervisor, he also mentioned this idea. He might have some people in mind

hearty flicker
#

Hey @here! We pretrained three models over the last couple of weeks, and it went well! I wanted to lay out my ideas for future directions, which mostly revolve around efforts to finetuning/alignment. The pretrained models are quite powerful; however, we'd like to release models that are better aligned and easier to use.

Aligning these models is an interesting problem, there isn’t really data available in the equivalent of a questions-answer format. We need to find other ways to align these models. The two main issues are:

  1. The model acts as a continuation generator. If you give it a prompt that sounds mediocre or boring for whatever reason, it will generate a continuation in the same style. Just like aligned LLMs, I would like the continuation to be high quality no matter the quality of the prompt.

  2. The most effective method for controllability is just giving the model a very high-quality prompt. This results in amazing continuations (seriously amazing) but limits how you can use the pretrained models. I want to improve both static (such as predefined genres/styles) and dynamic forms of controllability.

My ideas for how to solve these problems:

  • During finetuning, separate the prompt from the continuation using a <SEP> token. When training randomly add different amounts of noise (degradation) to the prompt. I think this might help with both (1) and (2) as it implicitly separates the prompt from the continuation. At inference time, you could specify how much ‘noise’ the prompt has, so that the model knows how closely to treat it as ground truth.

  • By using a listening model (audio classifier), my supervisor has found that it’s possible to tag different moments in the training data (MIDI) with tags like ‘sad’, ‘slow’, ‘jazzy’, etc. These tags actually work very well, and I want to incorporate them in a similar fashion to the ‘diminish’ token <D>, which causes the piece to end 5-10secs from when it is seen.

#
  • I really want to incorporate RLHF in the alignment process. This may be key for improving the consistency of unprompted samples. This feels like a big thing to take on, but with recent improvements (DPO) I think it could be possible to incorporate. I don't have a lot of experience with RLHF, but it's an area I'm personally interested in.

After we have researched solving these problems, I’d like to publish a full-length paper on arXiv expanding on the preprint that we have just submitted. This would also be a good point to publicly release the models and publish a blog post.

Secondary to this, @timber talon and I are currently researching AMT (audio -> MIDI conversion) and aim to get a model, dataset, and paper ready for ISMIR 2024 (deadline mid-April). The only bottleneck on the size of the dataset is the number of audio piano recordings that we can obtain. If anyone has any idea of how to acquire solo piano recordings in bulk, please let me know! We have good methods for pruning out non-solo piano recordings, so the data source doesn't need to be 100% clean. I had one idea for a crowdsourcing project where people contribute YouTube links of solo-piano music. We can use YouTube’s API and the AMT model we are working on to convert these to MIDI files.

We'd love to hear any feedback. Although the pretrained models are not publicly released yet, if anyone would like access - let me know and I'll dm you a (checkpoint) download link. Also, if anyone is interested in getting more involved (on a co-author level), dm me!

karmic skiff
#

@hearty flicker still need feedback from classical music experts? i know someone who might be interested

hearty flicker
#

Still working on a way to frame it exactly

hearty flicker
#

Really cool site, I didn't know about this

#

Might be good to add to the pretraining dataset

#

I can finetune the model on the MIDIs from this website if you like

#

Or if you have compute resources yourself I can walk you through how to do it

timber talon
#

@hearty flicker and I have talked about this a bunch but i think IMO the key here is that "alignment" in music prompting is to recover from unintentionally messy/mistake-filled inputs. It's not for recovering from atonal or otherwise intentionally stylistic-nonnormative inputs.

Kinda a tough needle to thread because, as compared with the language domain, music doesn't have the same notion of "toxicity" or "not in accordance to human values". And we don't want to give the impression that we are being Western-biased

#

@hearty flicker a question about the transcription — for the AMT paper, is the idea to stick with piano transcription, or to extend it to other instruments? That would potentially really widen what kinds of music we have access to — one of the things in collecting the dataset for the last paper was that piano music, especially in non-Western styles, was definitely more limited than other kinds of instrumental music

hearty flicker
#

It's a pretty interesting topic to research

#

There is some research on multi-track transcription, there was a paper from google in 2023 or 2022 if I remember, I'll dig it up

#

Solid multi-track transcription is the dream. I'd settle for SOTA piano-only (I think we can do this too)

timber talon
#

does multitrack apply to piano solo as well, since there are multiple voices?

#

I feel like if we take the kinda a Whisper-for-AMT approach we were talking about —
like, start with MIDI and generate tons and tons of augmented audio data using fluidsynth, and then try to learn the reverse audio -> MIDI mapping,
can't we also just take piano MIDI and also augment it into many different instruments?

hearty flicker
#

I wonder how well this would work, defintely worth looking into

#

We probably have to find a good way to automatically render the MIDIs in a realistic sounding way

#

Most DAWs don't have a cli or api to do this quickly, so we'd probably be stuck using something like fluidsynth again

#

If anyone has any references for aligning / finetuning programming llm models, send them my way!

#

I think it's the closest thing to what we are trying to do here

timber talon
hearty flicker
#

Yeah I think that would be the best bet, but I'm not sure how well it would work

#

We'll give it a try, I've been meaning to read that paper I linked anyway - I wonder how they did it

hearty flicker
#

Thought these were quite nice

rustic dirge
#

@hearty flicker would you like to try RWKV for this 🙂 i found it generates better midi than transformers

hearty flicker
#

So I'm not actively looking into architectural changes in that direction, if that makes sense

#

Right now I'm mostly concerned with alignment as well as data related stuff

#

It would be pretty easy to try out training RWKV code wise, it just would take a lot of compute haha. The largest model we (pre-)trained last month was roughly 2500 GPU hours, so not that cheap unfortunately

rustic dirge
hearty flicker
#

Do you normally train models on SAI's research cluster? I have the data on there already

rustic dirge
#

i am on another cluster. can use "croc" to send it to me

rustic dirge
hearty flicker
#

You should link to the repo so we can get stars!

#

Very cool work btw

quasi steppe
#

we could actually add some comparison between transformer and RWKV to the paper

hearty flicker
#

Only thing I can't share publicly is data

#

I'm sure RWKV will do amazing on the full dataset, only issue is that I can't take it off SAI for obvious reasons

#

Any architecture good enough for language will be good enough for MIDI

#

Btw Blink trained this with aria's midi tokenizer / tool chain : )

quasi steppe
#

I assume you don't mean we legally can't take off from SAI? If we really want to open-source data it's possible (but need to notify SAI of course). Just so that you know

hearty flicker
#

We can open source most, but not all

#

Basically everything that me and Alex have worked on, we can open source

hearty flicker
#

However for the transcriptions (the stuff Alex and I worked on), there is precedent for redistributing

quasi steppe
#

Yeah basically just want to say it's not our NDA that stops it. There can be other constraints of course

hearty flicker
#

No it's not the NDA

hearty flicker
#

We are aiming to release the aria models in April, I'm currently working on three separate ideas around alignment/scaling/editing that will be done by then I hope

#

@timber talon, @quasi steppe and I are also planning to submit 3 paper to ISMIR in April

#

I recon overall this project will have a decent impact on the gen-music scene

rustic dirge
hearty flicker
#

Oh damn!

#

Well it sounds great : )

wind viper
#

Hi all, I'm very excited to have found this project and will be following its development. I work in the music industry and have been curating MIDI data for many years. I'm interested in learning more about the topics discussed here and, if helpful, sharing some ideas.

Is Aria aimed to be an all-genre foundation model or solely focused on classical music?

sand nymph
wind viper
#

Good to know, although classical is what I collect the least of, with so many large datasets already available.

I'm still scouring through the comments here. In regards to ideas for rendering MIDI, you may want to explore Spotify's pedalboard for Python, which allows rendering with VST3 plugins.

hearty flicker
#

If that is something you are interested in (since you have your own data), I can guide you through it

#

Generally speaking it's quite easy to do this with the toolchain we have built

rustic dirge
hearty flicker
#

Scaling and alignment are the key!

#

It's what I'm currently working on myself

narrow sorrel
#

@neon hamlet

wind viper
hearty flicker
wind viper
hearty flicker
#

Which GPU are you using?

wind viper
#

I'll test 2048 to see what happens. I'm running on colab. Orange run is 100 classical piano samples and green is 1000. Batch size is 2.

hearty flicker
#

If it doesn't trigger an assert you are probably good

#

although you should be able to train using the full context length on colab : )

#

Doesn't a t4 have 16gb of VRAM?

hearty flicker
hearty flicker
#

This is one of the weirdest training bugs I've ever had to debug

#

I'm 90% sure it's something to do with the optimizer... Training in fp32 makes it a lot better, but still really confusing

wind viper
#

I'm fine-tuning on pop piano music. This might take a while if the model has never seen pop melodies.

#

Is there any type of minimum length or note density for the midi data? Should it be accepting files with short durations like 8-16 bars?

hearty flicker
#

During pretraining there was some pruning based off of those things

#

You can find them in config/config.json

#

The model does support any valid MIDI file though

#

The model has seen about ~50k multitrack pop during pretraining

#

Make sure to compare your finetuning results to the original model, would be interesting : )

obtuse lagoon
# hearty flicker I'm 90% sure it's something to do with the optimizer... Training in fp32 makes i...

I don’t know if you’ve seen this already but this might be a useful discussion : https://stackoverflow.com/questions/58633177/why-theres-a-big-jump-up-of-the-loss-curve-during-the-training

hearty flicker
#

I haven't actually seen such a jump during my experiments

#

The paper will hopefully be on arxiv soon

wind viper
#

I don't understand the tokenization process that well, but I'm curious to know how much MIDI information is being compressed. Is 4096 tokens intended to cover an average song length? Also, is it possible to know how many MIDI files the model was trained on?

hearty flicker
#

If you want a better idea about the tokenisation process, I'd recommend just tokenising some MIDI files and printing it out. You can find some in tests/test_data

#

The main idea is that each note is represented by three tokens: (instrument, pitch, velocity), (onset in ms relative to the last <T> token), (duration in ms)

#

The reason for using <T> is to keep the total vocabulary size under control.

#

It is fully customisable using the config/config.json file

wind viper
#

This was was helpful, thanks. I'm testing some runs to see if I can get tiny model to perform. These are my settings:

{
"d_model": 384,
"n_heads": 8,
"n_layers": 16,
"ff_mult": 4,
"drop_p": 0.0,
"max_seq_len": 2048,
"grad_checkpoint": true
}