#Neuro-Symbolic Music Models
2689 messages · Page 3 of 3 (latest)
oh you pass tokenizer in through pickling? I thought it cannot be pickled. I tried that when writing my own script and got an error
Not sure if there is a good way to share python objects between processes, I'm not super familiar with python mp in general I just threw this together to speed things up
Actually it shouldnt be pickling the tokenizer since I'm using partial
we could just use a bunch of processes reading from a queue, and each one initializes tokenizer separately
I'll look into it later...
or like I said just have a script to build the dataset in memory directly, then save the 100x epoches to a giant file
Could also index the file, and read from it using mmap in each process seperately
How would that speed things up?
I mean we do the whole I/O, tokenization, transform, concat and chunking, encoding in the same process that draws from a queue or just do round-robin. Save the token ids to a file and next time we don't need to do anything
if the bottleneck is that we spin up too many processes and pickle the tokenizer too many times, this helps too
if the problem is still the json.decode this should be optimal anyway if we use a queue to let processes grab whenever they can.
Oh so, tokenized datasets are built to a file
So they only need to be built once and then they can be accessed whenever
And accessing them is fast
you only save the tokenized dataset, not the token ids right?
tokenized sequences can do data augmentations which is nice. But I say we do augmentation when building it, and save a static file full of integers
or maybe I misunderstood
Could do, although I don't think that there is any CPU bottlenecks during training
Fetching, loading and doing data aug takes about 5ms per entry
So one cpu core can do like 100-200/s
And there is no pickle stuff in the mp, so with 8 or 16 cores it's never gonna be an issue
I'm not sure how this scales to multiple nodes though
oh yeah the dataloader itself is not multiprocessing for now... As long as this problem doesn't happen for training it's probably alright
ok it's faster with fixed workers that receive items from a queue. Tokenizer is only pickled once so maybe that's it.
But it's not as fast as I imagine though.
Doing fixed workers is about 1.5-2x faster
Actually for giant_midi, before 4000 samples the speed fluctuate a lot but afterwards it was so fast (like 3x). Happens for both methods. If you just do I/O and read the whole file to memory, the whole thing is like 5 seconds in single process.... So I bet there is still some massive overhead depending on sequence lengths.
I'll look at your commits I'm interested about how you implemented it
I was trying to avoid loading the entire mididataset jsonl into memory at once
I was worried it would be too big to fit in memory for the larger datasets
no I mean if you do one pass with the data from file to mem, the lower bound of latency should be about 5 sec. If it takes half an hour with varying speed depending on sample length, there is def something inefficient still going on.
not that I want to load everything actually to memory.... Just trying to rule out the possibility that the file system in the cluster node is doing some weird things
The jsonl file is also getting fat. The bitmidi is already 6G. If we do 50 epochs it's gonna be 300G but there will be a storage limit in SAI nodes. I definitely can't save 300G with my current quota
Yeah that's kinda why I have the data aug implemented dynamically
The dataset files get huge
Ha
yeah makes sense
I also throw in a jsonl.zst reader/writer. When I was working with pilev2 jsonl.zst was the go-to format that really brings storage down
I think the issue for going from Mididataset- > tokenized dataset is the pickle
It makes sense that the profiler is saying require lock is the biggest timesink maybe
Because it's trying to acquire the lock on the mididict string / dict
Maybe idk, the pickle happens inside that perhaps?
Someone skilled in python mp can probably tell what is going on
The way that the MidiDataset and TokenizedDatasets are designed is very subpar in general
I built it pretty fast
I wish I had access to my server so I could run some tests
if there is a way to do mmap with cheap random access for zstd stream reader, we could stick to zstd all the time without worrying about storage. But not sure if it's possible
was trying that and failed
Ima have a look at this properly this weekend
Just to confirm 100%, you are concerned with the speed of 'aria tokenized-dataset' (aka the build method) and not the tokenized dataset class once it is built, right?
This is great btw, I never got around to this
haven't tried your training script yet, but from the code it looks like it should work great
so yeah it's actually not that important but just trying to get familiar with your setting
If pickling is the problem, another alternative would be to chunk the file into n parts and then each process can convert only it's chunk
That way, there would be no overhead with starting up and killing each process
pickling should be fine now. They will only be pickled once in my implementation.
was just thinking if there is anything else
@hearty flicker got a lot of aria.data.datasets: [ERROR] Failed to tokenize midi_dict: note_msgs is empty after ignoring instruments when trying tokenized-dataset for bitmidi.
oooooooh never mind... The config.json was changed for bitmidi but reverted back when pulling the updates.
But this made the workflow a little inconvenient.
oh doesn't help, still got that
It's not an error
gotcha
Traceback (most recent call last):
File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/admin/home-honglu/miniconda/envs/aria/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/fsx/home-honglu/aria/aria/run.py", line 178, in <module>
main()
File "/fsx/home-honglu/aria/aria/run.py", line 170, in main
build_tokenized_dataset(args=_parse_tokenized_dataset_args())
File "/fsx/home-honglu/aria/aria/run.py", line 137, in build_tokenized_dataset
dataset = TokenizedDataset.build(
File "/fsx/home-honglu/aria/aria/data/datasets.py", line 641, in build
buffer += entry
TypeError: 'NoneType' object is not iterable
Also got this. Not sure if it's from my changes or not. Will try some debugging
It's happens when you have a midi dict which only has instruments that should be removed according to the config
Uhhg this error again
I thought maybe it's happening because entry is None so I put in a check for that
But I guess not
I'll debug this tmo too
Properly
I run it on my branch that has a lot of changes on _get_tokenized_seqs_mp. Could be my problem too. Will let you know
If you get that with aria/main let me know
The error message was confusing me a bit tbh
I didn't see how the type error could apply to that line
buffer += entry
oh another problem, now it seems that sequences are concatenated and chunked in a fixed way (we save them to tokenized dataset), and then augmented.
But ideally the data needs to be shuffled first for every epoch, and then concatenated and chunked. There will be a difference when running multiple epochs
like
abcde and 123456789 becomes abcde12, 3456789 if sequence length is 7. But with multiple epochs it's gonna be always these two with augmentations.
ideally I might want 1234567 and 89abcde or even stuff that starts in the middle like cde1234, ...
Ok added to the list, will be good to get this all squashed now ha
That might be a hard issue to fix. I wonder how it's done in NLP
I can't think of an obvious way to make it so that the sequences are concentrated differently for each epoch
There must be a way that it's done in NLP
Any thoughts on this @sand nymph ?
concatenate and build the samples on-the-fly. Or if it's enough to fit the memory, duplicate and shuffle and build one static dataset into files.
we are already close. Basically we just need to skip the tokenized-dataset step, and do this in data loader
Not sure how to integrate that with multiple workers / processes though
mmap the original jsonl file, and let each worker generate random indices, do all the manipulation until buffer is full, encode everything and yield the result
basically start with one single process DataLoader that does all of these lazily. To scale to multi-gpus have another DataLoader wrapped around a few processes each targeting the original dataloader (or maybe torch dataloader has some built-in stuff for this). For multi-node I don't think the code needs to change because we sample at random, and global_rank doesn't really matter.
Ok that sort of makes sense I'll look into it tmo
yeah no hurries. I will have some time Sunday and I could help. I have done these before actually. But now there are probably even better api from pytorch
Man.... Can't believe training a 400M model with 8192 length on a single node can only do batch_size=1... I'm sure pp will help but I thought we only need that when > 1B...
and it's 13sec per batch lol... total batch size 128 (grad acc 16 steps)
Running profiling tools now.... Didn't do the math but I'm sure the flops is incredibly bad
The reason why might be that your not using gradient checkpointing
If the issue is the vram, otherwise there must be a different issue
You can use the profiler to make sure that flash attention is working?
Also are you training in bf16?
Tried both bf16 and fp16
Yeah. There is a bug with gradient checkpointing right now. I will fix that later.
But when I used to do pretraining we just went without gradient checkpointing at all and still fit a couple on 3-7B models
Already did. Also tried to change the code to explicitly use flash_attn. No difference in both vRAM usage and speed so flash attention is working well.
Used a profiling tool to dump a json for perfetto visualization. It seems it went alright. Didn't see a lot of bubbles other than low level stuff between ops inside kernels. I don't have an immediate idea about what to improve
For vRAM I'm sure it can improve with pp. Need to change code a little bit I think. I'm still really confused by flops
@hearty flicker DDP seems to have trouble with gradient checkpointing
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1
) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared acros
s multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module gr
aph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example,
if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of
parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready mu
ltiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workarou
nd if your module graph does not change over iterations.
Guess that's probably why people didn't recommend gradient checkpointing when I worked on pretraining.
FSDP and cpu offload seems to make it fit and automatically scale to multinodes. Looks slow but didn't do the math carefully. I thought I need to manually write the wrappers but it's nice that accelerate takes care of them already.
I should be able to start a bigger run tomorrow on the bitmidi data we have.
@quasi steppe it looks like you can just pre-compute this? Build the two epochs separately and concatenate them. It's only tricky if you try to do it on the fly I think
The dataset? Yeah right now I do exactly this. I precompute everything for a few epochs and save to disks, then load sequentially to train. Huggingface datasets library is doing a great job at this scale.
@hearty flicker the xlarge sized model (I made a mistake, it's 700M rather than 400M) trained on the full bitmidi at around 8 epochs
the chord is fairly different from the start already.
Try it on some multi-track stuff
This is just trained on bitmidi right?
Not including the classical data?
Is this the way it's typically done? The issue is that I normally train for like 50 epochs...
The file size might get too big
My immediate goal is to get the audio downloading and transcription working, then I'll and try sort out the dataloading and training stuff
It would be nice @quasi steppe if eventually we could unify all of the code that we are using into aria/data/datasets.py and aria/train.py
Don't worry about it for now though, haha
What did the loss curve look like for giantmidi-700m?
log-loss. Very beautiful. But the model sucks lol
It might be undertrained
I'm listening to the generations and it really sucks. There are some good ones but failure cases are too many
very likely
It sounds like it from the sample you sent
btw if you test it, you should test it on pop music not classical
There is very little classical piano stuff in bitmidi
but the loss is already below 2 (where the "overfit" classical one we got). The loss value doesn't mean too much but I would have expected classical is more repetitive and structured than pop and should have a lower loss given same training conditions
already did. Don't want to spam the channel haha
not looking good
Weird haha
I think it's useful to have val loss just so we have an idea of how over/underfit the model is
My experiance is that the best checkpoints occur when val loss stops decreasing
I'm quite sure that the vram blows up if I evaluate. I remember huggingface did a terrible job on mem management when you have eval going on during training
or maybe they have fixed it by now. I will try
Hmmm, ive never used hf trainer to be fair
Seems to work fine with my train script, idk
hf trainer actually doesn't have fancy codes. I tried to glance through it in the weekend. Actually similar to yours but the only advantage is that it works with HF datasets out-of-box and supports a lot of fancy wrappers like FSDP or deepspeed pp, etc.
(well I mean it's not hard to adapt codes for those stuff... just that it's a little faster to try out things quickly haha)
Yeah true, I'm going to adjust the dataloader stuff and train script this week
So that is does what we need
There must be some sort of issue going on, musenet was also trained on bitmidi
And the samples from that sound good
As a sanity you could try recreate the checkpoint we've been using on the notebook
50 epochs at medium
On giant midi + kunderfug + mutopia + maestro
On the bucket
I'm gonna do some basic data stats. There are a few possibilities:
- 10 epoch is not enough (but it kinda worked for classical so idk...)
- I changed the optimizer from Adam to Adafactor for mem reduction but maybe Adam works better
- model architecture? 96 layers is really really deep and honestly it's my first time training that deep of a network. I could add some monitoring of each layer and I suspect it's really too deep.
- data quality issue. I did random sampling and many are short (but not short enough to be filtered).
There is a min length filter in the config that you can use btw
When building MidiDataset
yeah, I noticed that. Didn't tweak that when building the dataset
this training job wasn't serious anyway. Just left it on yesterday and I can't believe it's not preempted for a whole day haha
It could also be that the concatenation stuff we are doing isn't helping
Maybe try to recreate the old checkpoint with the new training methods
it helped for classical I think
And we can directly compare the quality
the 100x large was good IMO
Try with medium because it's hard to know exactly what is affecting what
Might be just using a larger model or more epochs
It's only 24 4090 hours so shouldn't be hard
oh by the way, this xlarge run used 8192 as max length....
I suspect that 8192 is too long and we got too few training steps so I actually think being undertrained is very likely
If you test the medium cp, test with 2k
yep
There are so many factors here, it's probs best to be systematic
Hard to know what is doing what haha
yeah absolutely
The sample length distribution of bitmidi. Super long tail and I need to zoom in a bit
actually most samples are very long. If we filter out anything below 2048 tokens, we are only removing about 3% of the tokens.......
And we got (only) a total of 700M tokens here.
multi-node training works. Still crappy flops though...
I will start with trying medium 2048 bitmidi later this week. DDP should be more than enough for that
Very nice stuff
question —
I got the wav -> midi transcription inference working, and it's outputting a set of files per wav file:
do you know, offhand, if these are canonical enough to work with, or should i get the rest of the inference scripts running as well? (these 10 files are all for the same song)
because of how bad the code is, the bar goes exponentially higher for getting the other processing scripts working — would probably need a full rewrite. And most of what the other processing scripts are doing is calculating F1 score and such, so just wanted to check
oh wait – reading section 4.3, now — definitely want to ignore all the *1st* files in those lists, those are just for the first-level loss calculation. I wonder whether the 2nd.json is enough? If the pitches look weird, there is a conversion function in another file: https://github.com/sony/hFT-Transformer/blob/master/evaluation/m_transcription.py#L98-L106 which converts the json to .txt for mir_eval and does a functional transformation on the pitch specifically
@hearty flicker Hi, I have been trying to start a pretraining test using a small dataset created using some random midi files. I have encountered a crash at line 801 of file aria/tokenizer/tokenizer.py.
It crash on this line, which raise an IndexError if stack is empty:
stack[-1]["dur"] = tok
For the moment I solved in this way, but I am not sure it is the proper solution:
if len(stack) > 0:
stack[-1]["dur"] = tok
Maybe it is just a problem with my midi files having content which is not supported? Could you share the midi files you used to train the model?
Hey! I'll look at this, in a few hours
Their code is so bad it's shocking
So the output is just a bunch of eval info, not a midi file
Very annoying
I could probably write a MIDI converter for this
Are those files really just for one song?
In the paper there is an algorithm for converting from the 2nd files to notes
I wonder if that is implemented in the repo
If this is too hard, we can just use Kong et al for now
@timber talon did you figure out what the 'mpe' acronym means?
This file seems to hold the code for converting to MIDI
If we can find the right config file, this might be all we need
I think you just have the run the AMT methods sequentially to get a MIDI out
great... yeah, I'll take a look. There was a config file was kinda the root of the problem yesterday, but I solved that
omg haha i totally missed this one little method all the way at the bottom of the file: https://github.com/sony/hFT-Transformer/blob/master/model/amt.py#L347-L355
that's not used anywhere else. ok, sweet, this is totally doable. sorry, i'm just not entirely fluent in midi, yet, so wasn't sure if the offsets/mpe files were something someone more familiar than me would look at and be like "oh, obviously, we can turn this into midi "
ok — not bad!
i tried doing some transcriptions for songs that were not in the training data (Paderewski isn't in MAESTRO)
Here's my fork of the repo along with the command to run the transcription on a directory:
transcription is not super fast — 10-20 seconds per piece, on average. But fits on a 12GB GPU. I have access to loads of those, and can easily parallelize this processing
This is great news Alex, amazing work
Yeah we can parallelize this heavily.
I'm still working on the downloading pipeline
should be done by friday
This sounds great
Here is what it the midi sounds like once I converted it back into audio using fluidsynth
This is so promising lol, has gotten me excited !!
I love this piece by the way, I've never heard of this composer
@quasi steppe Over the next few days I'm going to implement a different version of TokenizedDataset that supports the functionality that you wnat
Aka all the epochs are concatinated to one big file.
During fine-tuning it might be best to have the original implementation with padding, so I'm going to keep that functionality in a different class
I'll also make the changes to the train script so we can work from there directly
Currently working on the spotify_dl stuff so alex and I can get our pipeline working
I spent a lot of time yesterday trying to test out some OMR systems — it would be another source of MIDI that is definitely open-domain, and would open up all of IMSLP. Was hard to find anything that looked credible
I tried this and despite it being under pretty constant/recent pushes, it was still not working: https://github.com/BreezeWhite/oemer
End-to-end Optical Music Recognition (OMR) system. Transcribe phone-taken music sheet image into MusicXML, which can be edited and converted to MIDI. - GitHub - BreezeWhite/oemer: End-to-end Optica...
@hearty flicker do you know of any OMR libraries? Or is OMR just not really a thing?
Not something I've ever looked into
There might be some deep learning research on it though
the download stuff is nearly working btw
I've just forked spotify-dl and added the extra functions we need
I was disappointed when I tried to use the Spotify_dl in some trial runs. Especially for more diverse composers — women piano composers in the classical era, for instance— I was getting very few hits
That’s what motivated searching for some good OMR tools
I’ll keep looking around, but if you hear of anything (maybe your advisor knows?) that would be really cool
I signed up for IMSLP, btw, so that’s another potential source for public domain audio and midi
Yeah, I mean spotify_dl can only find stuff it matches on youtube
It really depends how much of it is on youtube
Any other way of sourcing solo piano recordings also works
We can use a combination of different methods. As the (audio -> midi) transcription is so good, any source of solo piano audio would be great
@timber talon You can use my fork of spotify-dl
It automatically prunes out non solo piano recordings by ensuring that the number of artists on the spotify metadata is <= 2
It also skips downloading duplicate files (only album/playlist wide at the moment, will improve this)
And it takes a text file of links with the --file arg
Actually it should skip downloading dupes for every album in the text file
some initial results using Audiveris to OMR on some sheet music.
It's not great, but if we're doing any data augmentation/ noising, it will fit in on that level
How do the mxl files work?
Oh you load them into musenote to listen. They’re primarily for musical notation software. But they’re also convertible to MIDI, I can do that quickly
Got a lot of idle machines in the cluster so I sent a training job this morning. 64 A100, large.json (700M?), 100 epochs of bitmidi. I have gone back to 2048 context length. We can extend the length later easily.
That bump is interesting though. It does happen all the time in LLM pretraining
That's cool
I've been bugged down with a bunch of irl stuff, should have more time this next week however I am moving apartment so we will see
The dataset stuff with @timber talon is looking very promising. Going to try to retrain the hft model on monday
25 epoch checkpoint is actually pretty good!!!
I think directly pretraining on 8192 is somehow hurting the quality
I'm more and more convinced an LM should pretrain on 2048 or less, and then extend procedurally
the prompt is 200 token (first 2-3 bars) from some random online stuff. No CFG, an honest 1.0 temp for the current one.
This is pretty good
this is how it handles classical
bitmidi is pop heavy right? How much classical could it have?
It might have some
hard for me to know
It's a webscape of like 200k midi files
The context length thing is interesting
This sounds way way better than lasttime
yeah totally agree
Could it be a problem with the freq used for rotary embs?
I think it's just that there are fewer gradient steps. 4x length means 25% gradient steps
I should probably have increased lr by 4x but I doubt how much it helps. Each step is an attempt of searching the loss landscape, fewer amount definitely covers less ground
Will be cool to see if the new tokenizer improves the results
My suspicion is that it will make timing related stuff better
chopin might still be kinda popular on the internet. Trying some less heard stuff.
This second one shows a smart strategy of just repeating the prompt lol
something different. So it knows to improvise at temperature 1.0. Better decoding param should give much better results (esp I haven't applied CFG and we haven't implemented beam search)
This is quite interesting. The job stopped and I had to use an earlier checkpoint to resume. Dataloader should also be resumed and deterministic, esp since most loss values are having the same up and down. But there was a loss spike all of a sudden
That might be because the Adam params reset
no, optimizer states are recorded
Weird
everything is saved well with huggingface's Trainer. They did a good job on this
I suspect there is some minor float point issue in distributed training that is not deterministic. We didn't see these on single node training job
I used to think those spikes in LLM pretraining are due to data outliers. Now at least this is ruled out
Nice. The 8-node jobs is still running. 36B tokens trained. We have more token than parameters by the way. if it ever overfits it shouldn't be that bad.
Was this with 2048 too?
yeah it was the continuation of my previous run until 100x epochs
gonna be interesting
I saved every 5000 steps so we can even study how the latent space changes
125000 step checkpoint... This is quite amazing.
I generated 8 samples and all of them are amazing.
I did some random sampling of bitmidi dataset and I'm fairly confident that this prompt is close to go out of distribution. Meaning this "overfit" model generalizes very well
Traceback (most recent call last):
File "/fsx/home-honglu/aria/generate_large.py", line 135, in <module>
sample(
File "/fsx/home-honglu/aria/generate_large.py", line 106, in sample
res_midi_dict = tokenizer.detokenize(tokenized_seq)
File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 79, in detokenize
return self.detokenize_midi_dict(tokenized_seq)
File "/fsx/home-honglu/aria/aria/tokenizer/tokenizer.py", line 469, in detokenize_midi_dict
_channel = instrument_to_channel["drum"]
KeyError: 'drum'
Got this error
This is amazing work
Hmmm it looks like that issue happened because there was a drum token sampled, but it's not an instrument that is supposed to be there?
I can probs put a check for that, but technically it shouldn't ever happen
that was when I tried extrapolation out of context and interpolate styles. It could just be that the generation is corrupted. But it should probably send out a warning instead of erroring out the whole script.
maybe like adding a force=True so that we turn those errors to warnings and just stop the detokenize at wrong token for each sample
Yup I can add a check
How is the Yarn stuff going btw @quasi steppe ?
It seems to work super well for nlp, will be amazing if we can get it working well in this context
I need to get a fluidsynth soundfont setup that correctly handles multitrack
Tried non-finetined version and didn't extrapolate well. Will try the finetuned yarn and use a small portion of data. We will see
@hearty flicker is there a convenient UI that reads the model output as a stream and renders it into sound? Would be cool to write a model-based infinite music player if such a stream reader is already available in open source
I've had exactly that idea
And it's super possible
I'll build an API for it today maybe
It would be really cool and easy to write
I tried to run my model locally and it was about 15 tok per sec. If we quantize it it can be really fast
Yup, more than enough for live playing
Def good enough for streaming tokens
I was thinking of eventually building a ggml backend too
Yeah!
Did you fix the problem with the bucket perms btw? Maybe I was doing something wrong
Hmm I could write using that json certificate
From sai cluster it was alright
I kept getting 403s
I already put in all the model checkpoints
When I'm at my laptop I'll get back to you
Yeah I will try again later
Were you using gsutil mv?
cp
Alright I'll try that
Before it needs gcloud auth .... using that json
int8 quantization kinda works. Feels like 1.5x speed-up from fp16 but haven't measured it. I will push the code later.
Made a draft PR to fool around but we don't have to merge it. Still playing with quantization on my laptop.
Is there much quality degradation?
There is.... I'm using pytorch fx now. It spits out that invalid drum token all the time
not sure what the best setup it is for CPU inferencing
I don't think CPU will work with flash attention
I tried it a few weeks ago and it didn't work
I would guess that pytorch attention implementation revert to usual attention implementation?
It works on my computer
I'm moving flat today so am quite busy, next week I should be back to my normal schedule
Oh that's weird, maybe there was a patch
I have a draft PR for fooling around. It works fine on my computer. I refactored all those device stuff. You can play with it.
I tried it briefly and scaled_dot_product_attention was causing some issues
Might be my computer dependant though ha
I will play with it
pytorch has some other quantization options. If we use the MultiheadAttention class instead of the scaled_dot_product_attention, something seems to apply to that directly. Currently I only quantize all those dense layers (also need to sort out the degradation.... Hope it's just a paramter issue)
When the sequence gets long the speed is a loooot slower on my laptop and I fear without quantization the token stream wouldn't catch up with the player frontend, if we want to do what we talked about earlier
If we did it with ggml it would be fine
If we are aiming for arm macs
I mean you can run llama7b at 10-15 toks/s I think
So for our model there will be more than enough speed
Whats the formula for model training cost (in A100-hours) in terms of dataset size and # params?
Releasing JAX x Equinox code and a 101M parameter model checkpoint for my homebrewed MIDI transformer TchAIkovsky ☕
Feel free to have a play around with it 🙂 It isn't SOTA but produces some fun results~
https://github.com/vvvm23/tchaikovsky
seems undertrained but it's really nice! Sounds better than our early checkpoints in terms of that repetition problem
If anyone is interested, here are some samples I compiled a few weeks ago from the old version of the model @here
Will be interesting to see how much of a difference the improved tokenizer/datasets/scale/finetuning will make
Ok SAI keeps kicking me off and I've stuck with using the terrible train script provided by this paper
Will run the transcription training on my home server instead
there are some idle machines in SAI today. Gonna try to see if I can get yarn finetuning working.
Managed to get 100 training steps done but crashed when it tries to save optimizer states. I got a lot of troubles with my training codes and getting weird C++ errors all the time...
Gonna switch to your codes and I'm building a 8192-length dataset now
Do you think this is was because of issues with my code?
Could be an accelerate issue maybe?
I haven't fully tested my training script on multiple nodes ect
nope I didn't use your training code. I was quickly trying out YaRN finetuning. Supposedly it should only take 1% of the original training data so I was trying on my training codes
it worked before but all of a sudden it breaks down.
Still working on it. I found some bugs in the yarn code. Will do a major PR for the YaRN component
but I don't know if those will resolve the c++ errors yet
Cool, I'm having a bug in the data augmentation stuff with the new tokenizer, so I'll probably not be able to merge it till monday. I'm away this weekend in cambridge unfortunately so I can't work on it
c++ errors scare me
haha
got something along the lines of "some kernel returned NULL without raising an error". Googled it and saw in pytorch forum a dev said he hasn't seen this error for years lol
Would be so amazing to get yarn finetuning working
Eeek
I was getting C++ errors when trying to get gradient checkpointing to work with torch.compile
yeah mine was from torch.compile too
Maybe we should add a flag to the train script to skip compiling
Since it seems to randomly cause weird issues
I need to remember to go over the transformer optimization document you sent also
I disabled that, the training got slower but it worked until it was trying to save the first checkpoint. When it got to optimizer states it got another c++ error along the line of "all gather failed because different devices have different values of something" lol
yeah
didn't try. It worked earlier without yarn so it could also have to do with that
and I upgraded accelerate in my conda env and that's another possible cause
will try to debug more today
I can also have a look on monday
I should probably read the Yarn paper properly too
did the iclr rebuttals get reviewed yet?
the yarn code was a bit confusing and I realized I need to copy a different class from our yarn repo if I want to do finetuning. I'm trying to refactor a bit to clean some stuff up.
yeah for Yarn one reviewer flipped and we got 6 6 6 8. Should be able to get in
A shame, made a big impact for this project anyway haha
it's a bit unfair. One reviewer basically said, oh CFG exists in CV so you are not novel
Yeah but that's the entire point
We were like, that was the whole point lol
That's so annoying
and this guy never responded to our rebuttal.
Pretty sure the 2x 5's didn't spend more than 10min on our paper. The 6 guy raised a lot of good points. I really think if we can get qualified experts to read it, we should at least get overall 6
Pretty annoying reason to get a rejection
extremely annoying, yeah
@quasi steppe what's that secret part of your website again where you have all the generations you've been producgin?
Couldn't get YaRN working. The finetuned checkpoint is gibberish....
@hearty flicker Dataset was alright. Trying to run your script.
AttributeError: 'Namespace' object has no attribute 'train_data'. Did you mean: 'train_dir'?
This is probably a typo.
Will fix and try it again tomorrow.
Yeah must be a typo
Did this error happen when building PretrainingDataset from the cli?
Pushed a fix
Thesen are nice, are they from a model that you just trained?
It's the previous 100 epoch giant midi model
Ah ok cool
It looks like we'll be able to get y'all at least 5k A100-hours of dedicated compute Stability
That's amazing news! Thanks so much for everything: )
finetuned YaRN generating a 7000 token sample. Maybe it has a stop token earlier. It's somehow shorter than I expected.
I feel the quality went down quite a bit.
oh crap towards the end it was completely gibberish
The final loss of YaRN finetuning is on par with the pre-training final loss. It's possible that the model just gamed the perplexity (it's well known that short repeating patterns can game the loss and show an artificially low perplexity) and biased towards those easy patterns that are weird to human.
It's quite curious that the loss is on target but the results are not
Was anything similar observed when applying YaRN to nlp?
It was ok with the limited amount of downstream task experiments. Like passkey is alright, basic completions are alright. But it's hard to find a super long task and it got prohibitively slow to do something fancy
For the 2048-8192 extension I remember it was all ok
A long range sample coming out of the yarn finetune (same as last night). I sweeped and tuned the attention scale a little bit. This one seems reasonable at the longer end.
What changes did you have to make to get it working better?
the attention weight temperature. Also fixed the code base quite a bit. There used to be a few very confusing naming from the original yarn repo that got me confused again in setting the params.
@narrow sorrel Want to share your demo video here so that people can see? It's just a thought. If not, don't worry.
Also, got a new YaRN finetune checkpoint using new mscale parameter, a lower LR and a longer initial warmup. Here is an 8000-token sample.
(is a bit weird towards the end....)
sure! here's the script i made over the weekend to play with aria: https://github.com/EleutherAI/aria/pull/79
thanks Honglu for training the models!
This PR adds a proof of concept script for playing turn-by-turn with aria model.
To try, run the following (currently only works on mac, tested on m1 max):
git clone https://github.com/maxreciproca...
So cool, I've had this in the back of my mind for a while - will merge this tonight !
Hey guys, going to aim to get a pr in for doing supervised finetuning this week
If anyone has any experience on doing supervised finetuning with decoder only models, any references would be useful
The test example I am planning to implement is key detection
Also, expanding on @narrow sorrel's script I'd love to implement some real time I/O in the inference library - maybe next week
that is soo cool!!! what midi keyboard are you using, btw?
hey @hearty flicker I've found the huggingface docs on AutoCausalLanguageModel to be super straightforward
By SFT, I assume you mean just basic SFT, or continuned pretraining, This is a great script:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
In addition to SFT, there's another direction to go in, which is low-rank approximations. this PEFT library: https://github.com/huggingface/peft implements a lot of the standard LoRA approaches. These are potentially cool because you can maybe mix/match LoRAs. The Adapter library also claims to be a similar standardized repo: https://github.com/adapter-hub/adapters
i hope i'm answering your question and not just spouting stuff you already know — in general, anything FT-related is very, very standardized by now
Ok guys, the new tokenizer is implemented and the initial results are very promising - the timing problems seem to be mostly fixed
Running a proper test to get a better sense for the differences...
Here is a unprompted piece in the style of Chopin
New tokenizer seems to help with long term structure too
It's crazy to think these samples come from a model trained on the same testing dataset that I first used 5 months ago... sounds so different -- lots of progress
Those samples sound incredible. What are the key differences between this new tokenizer and before?
Before there was a wait token to wait for certain period of time. Now it encodes the absolute timing in a period of time I think. Haven't poked into it yet and @hearty flicker can tell us more
Yeah I got the idea randomly from a paper about automatic music transcription that came out last year
Will give a better update tomorrow... I'm training a model on a larger classical dataset overnight too - I'll update the notebook to use it
that's a Seaboard, also it supports MPE (i think you mentioned it sometime ago) https://youtu.be/6SCug5kUsBs
Seaboard BLOCK M is here! Order now: https://roli.com/products/blocks/seaboard-block-m
Make music your superpower with the most compact, portable, and affordable Seaboard ever made. It fits in a backpack so you can play it anywhere. Or use it on your desktop with ROLI Studio or Dashboard for third-party DAW integration with LIVE 11 or Logic.
...
Here is an unprompted piece on the style of mozart @here
Quite simple but it defintely sounds like music and has the
This is also quite nice, it's super good at doing stuff like this
Creatively missing something but is nailing most 'musical' aspects of this continuation (first 20secs is prompted)
It's weird because the original piece honestly makes me feel very emotional, but this makes me feel nothing...
Here is the updated notebook if anyone is interested - https://colab.research.google.com/drive/1SmwmsSf92Bv30algvZ-D4rW8dtH0kJNL?usp=sharing
@karmic skiff
@hearty flicker @quasi steppe reminder to go talk w/ Stability on thier slack to get compute approval
urgh wrote a live player but cpu inference is not fast enough to keep up with playing...
Only medium model can work on my shitty laptop
Forgot to ask, how many epochs have you trained to get those?
Implemented rolling window. We can have very very long music now, at the cost of forgetting.
On my computer I can only continuously stream 400 without catching up 😂
This was from 150 epochs I think? Val loss never bottomed out during that training run
unconditioned Chopin
rolling window meaning rope?
no, I mean applying a rolling window on kv-cache
basically it means if we generate all the way to 8192 with a window of 4096, the last token is generated as if the whole context is 4096 with the first half being removed
this is a cheap way to mitigate the context window limit and usually it's kinda okay. The model just forgets, but a good model should generate something that feels smooth. Here somehow it degenerates.
ohhh i see
thanks for clarifying!
Idk if you saw the Phi-2 lm that microsoft just released — https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ pretty cool stuff
their two takeaways in particular — about data quality as well as scaling up model training, are cool. We're already trying to get high-quality data (and in general, music data, which is usually a performance of music by a famous composer, is probably higher-quality than random text data we find on the web?). W.r.t the second point, about scaling up model training, I know there was discussion in the #research channel a few weeks ago about scaling up language models. Makes me wonder if we should think about some of these smaller models you are training as initialization-points for a larger aria model
what do you mean by "unconditioned" here?
As if you used gpt3 without any prompt apart from the <S> token
Actually in this case there is a composer token too specifying that the composer is Chopin
I tried out fine-tuning on some jazz music (about 400 tracks of jazz recordings)
Unfortunately the fine-tuning data is pretty poor quality, however the transfer learning is definitely working which is cool !
Here is a jazz version of Moon River
To get a sense of the quality of the fine-tuning data you can listen to the first 30 seconds or so of this mp3 - this is the prompt
@here
This only cost 1 A100 hour to fine-tune. It would be amazing to get access to some higher quality fine-tuning datasets
haha finally got a generation almost identical to the original (except towards the end). The bitmidi 100x epochs model actually memorizes jingle bell pretty well
this ones shows pretty clearly how rolling window fails... a few bad samples get accumulated into nonsense and once out of distribution, it can't find its way back
there are some interesting works in long-form story-telling, for NLP, coming out. this one in particular: https://arxiv.org/pdf/2311.15208.pdf
ending a music generation is always really hard, it always trails away until you artificially stop it
what struck me about this work was they introduced the concept of "ending" a narrative, and developed metrics to track what a narrative ending was
this story-generation work is pretty cool, how it uses a separate classifier (BERT-tiny) to toggle between local context and global context. Maybe that can be applied to what we're doing with CFG
@hearty flicker @timber talon @quasi steppe Did y'all fill out the JIRA research grant request for Stability to get project allocation for aria?
hello stella and happy new years/holidays! I think i missed the discussion of this grant — i'm still waiting on SAI approval for slack. Did the discussion occur there? Anyway, I am happy to help with grant writing
if it occurred here and i just missed it, my apologies — I'll touch base with @quasi steppe and @hearty flicker
Ah, Honglu and Louis got approved. I'll go nag about getting you in
The discussion was basically "this looks cool, submit a request:
There's an overview of what they're looking for here
The actual submission is submitted here
@sand nymph I submitted it before Christmas, Zach said that they were on a winter break but will look at it after they get back.
I didn't see this before I submitted! I can bolster the application and resubmit if it's a problem : )
@hearty flicker Great
Here are two nice continuations of the first movement of Beethoven's moonlight sonata !
I also thought this was quite pretty. Any Aphex Twin fans @here? I rendered the piano in ableton to try to make it sound more like the original
Here is the original for reference https://www.youtube.com/watch?v=-LgYzva-xq8
wow ^ that's amazing! how it keeps the same rhythmic patterns throughout
The new tokenizer seems to have (anecdotally) solved these sorts of problems : )
We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construc...
The dataset building me and Alex have been working on has gone incredibly well, just in time too
This is the product of an audio recording that has been turned into a symbolic form (think sheet music) via a neural net (similar to OCR in nlp), and then rerendered into audio using software
So essentially the data cap on our model is all audio piano recordings that we can obtain
Will be exciting to keep working on this after the ICML deadline : )
I have gone and fought the monsters and come back victorious: you should have a compute authorization to train these model on the Stability AI cluster
does anyone have piano albums on Spotify they like? (besides the big names — Chopin, Mozart, Schubert, etc., we already have a lot of them)
!!!
Running this pretraining this evening in that case
Will the budget be under its own slurm account?
Because permissions are wonky I can't see your JIRA ticket. If you send me a screenshot of what your submitted I can answer that question though xD
Pretraining is definitely working. The optimal sampling parameters seem to be different than before, especially top_p
the slowing down and cadence at ~1:20 to end the phrase is pretty crazy
crazy good or crazy bad haha
I've been playing around with the style transfer stuff and it's super interesting. This is supposed to be a liszt inspired continuation of a chopin nocturne
A bit all over the place structurally, but sounds a bit like liszt to me
i think it's pretty great — from a performance perspective, the ritardando is pretty well executed
at that to the musical coherence of the cadence
all in all, pretty impressive
This is actually from a half trained version of the smallest model
Am super excited to see what the large one is like
lol crazy! and it keeps a pretty consistent bass line throughout. How are you doing this, are you doing the CFG mashup thing that @h used?
Oh wait, yeah I just gave it a chopin nocturne and then changed the composer meta-token to lizst and the form meta-token to nocturne
So no dynamic cfg stuff really
ohhh cool!
yeah this meta-token approach is definitely more natural.
I still don't know if dynamic cfg trick works better for a better base model or not. Maybe I need to try it out. The way I think of it is that it might be suitable for more complex guidance (like we want something to get somewhere at certain exact time, or mixing more than 2 styles)
cfg defintely is working amazing
These samples are normally prouduced with a cfg of 1.05 to 1.2
Our YaRN finetuning runs on MAESTRO (50 epochs, 16k context length) have also finished (2 larger runs remain but will be done soon).
OMG it's working sooooo well! I'm super excited!!! I mean, not as high quality as the original MAESTRO midis but I've never gotten such coherent one before.
A little weird in the middle and I thought it's gonna go brain-dead like before, but it snaped out of it.
Some completely improvised jazz
@hearty flicker you had pitched ICML as a target venue. The deadline is in six days which is doable but tight. Is that still the goal?
Have you started writing the paper already? I would 100% start now while the models still run.
I'm writing as we speak !
which one do you guys think is of higher quality? (first ~30-40 sec are the same)
for first
for second
(urgh need to render into mp3 first next time)
This one as well.
Anyone wants to vote?
I'm doing a post-submission editing pass on the paper, including flagging areas where I think more substantive experimentation would be valuable.
@hearty flicker: do you think we could get some music professionals to give feedback on the music we generate?
That was one of the aims initially, but we didn't have time
I was in the process of reaching out to various people who are experts in a certain form of music that is simultaneously quite 'rule based' whilst also being very free. The model excels in creating this type of music, and I was curious about what a musicologist in this area would think.
I would like to have a section on this topic in the arxiv preprint
Great
I would really like to have an extensive experiments and evaluation section in the preprint
We also didn't get to do many experiments with finetuning/alignment which is super important
I strongly recommend not using -small -med and large labels and instead labeling models by their actual size
This is the style of music that it does well at - https://www.youtube.com/watch?v=7Uv71oX4wG8 (original - https://www.youtube.com/watch?v=W5lOLZsjOp8)
noted
Are the %-ages in the dataset section the amount of the actual pretraining data?
Why was dropout used @hearty flicker?
/ was it? I only see one passing reference
Dropout was used during training
That's definitely something we need to disclose given how non-standard it is
I thought it was standard, but have realised now there are some issues with the training process we should have done differently
I thought it was standard for some reason
We also used GELU, which we forgot
You specifically said otherwise in the paper
What do you mean?
We used dropout using training, why is why the train loss is higher than the val loss
Oh that was about GELU
Oh sorry, I meant that we used GELU instead of SwiGELU or whatever the modern activation function is
We stated it accurately in the paper, I just wish we trained without dropout and with a better activation function
Not that it matters so so much, but still
You said you used the architecture from LLaMA. I'm currently changing that to be correct.
Oh I meant to say inspired by, the only difference is the activation function and dropout.
Is this the correct config file? https://github.com/EleutherAI/aria/blob/main/config/models/large-wide.json
How much of the SAI grant was used for these three models?
I think that training large was roughly 2000 hours
And the others were rougly 1000 hours together
So we have quite a lot left, I think
I was originally going to do that too, but ran out of time too
No worries
Like training on all the data vs only a high quality subset
And training deep arch vs wide arch
Since MuseNet (openai) used 100+ layers for some reason
I did train a deep version of 'large' briefly, but after 5 epochs is was quite clear it was learning far slower, so I stopped it
Honestly these models (small and medium at least) are not that expensive to train, and fine-tuning is basically free
Lots of opportunity for ablation
I also have a lot of evals around the metadata and controllability stuff, about 2 days before the deadline someone reached out to me regarding this
They had basically judged the outputs of our model, using an music listening model (audio)
And found that for samples generated with a genre meta-data tag (e.g. jazz), the audio model said the same
That's cool
i am a music professional in the sense that I went to Juilliard for classical music performance, and also got paid for performing 🤷♂️ and my evals are in there lol
it's not my current field though, and I'm not deep into the musicological/theoretical side
it's actually a pity we didn't call that out in the paper — "one of our annotators was X". it would've given more credibility
That's awesome! I didn't know that.
it was another lifetime, i also dropped out after 2 years — competing against hundreds of other musicians for the opportunity to play the same few classical music songs for a bunch of old white people wasn't a very appealing career path, long-term. but anyway, i think we're all in agreement about needing more different types of evals though!
If we had systematically collected this info then yes, but it might vibe weird to say "BTW one of our evaluators did X"
ah yeah, agreed that makes sense
Alex's insight was pretty handy while we were writing
I have a meeting tomorrow with my supervisor, he also mentioned this idea. He might have some people in mind
Hey @here! We pretrained three models over the last couple of weeks, and it went well! I wanted to lay out my ideas for future directions, which mostly revolve around efforts to finetuning/alignment. The pretrained models are quite powerful; however, we'd like to release models that are better aligned and easier to use.
Aligning these models is an interesting problem, there isn’t really data available in the equivalent of a questions-answer format. We need to find other ways to align these models. The two main issues are:
-
The model acts as a continuation generator. If you give it a prompt that sounds mediocre or boring for whatever reason, it will generate a continuation in the same style. Just like aligned LLMs, I would like the continuation to be high quality no matter the quality of the prompt.
-
The most effective method for controllability is just giving the model a very high-quality prompt. This results in amazing continuations (seriously amazing) but limits how you can use the pretrained models. I want to improve both static (such as predefined genres/styles) and dynamic forms of controllability.
My ideas for how to solve these problems:
-
During finetuning, separate the prompt from the continuation using a <SEP> token. When training randomly add different amounts of noise (degradation) to the prompt. I think this might help with both (1) and (2) as it implicitly separates the prompt from the continuation. At inference time, you could specify how much ‘noise’ the prompt has, so that the model knows how closely to treat it as ground truth.
-
By using a listening model (audio classifier), my supervisor has found that it’s possible to tag different moments in the training data (MIDI) with tags like ‘sad’, ‘slow’, ‘jazzy’, etc. These tags actually work very well, and I want to incorporate them in a similar fashion to the ‘diminish’ token <D>, which causes the piece to end 5-10secs from when it is seen.
- I really want to incorporate RLHF in the alignment process. This may be key for improving the consistency of unprompted samples. This feels like a big thing to take on, but with recent improvements (DPO) I think it could be possible to incorporate. I don't have a lot of experience with RLHF, but it's an area I'm personally interested in.
After we have researched solving these problems, I’d like to publish a full-length paper on arXiv expanding on the preprint that we have just submitted. This would also be a good point to publicly release the models and publish a blog post.
Secondary to this, @timber talon and I are currently researching AMT (audio -> MIDI conversion) and aim to get a model, dataset, and paper ready for ISMIR 2024 (deadline mid-April). The only bottleneck on the size of the dataset is the number of audio piano recordings that we can obtain. If anyone has any idea of how to acquire solo piano recordings in bulk, please let me know! We have good methods for pruning out non-solo piano recordings, so the data source doesn't need to be 100% clean. I had one idea for a crowdsourcing project where people contribute YouTube links of solo-piano music. We can use YouTube’s API and the AMT model we are working on to convert these to MIDI files.
We'd love to hear any feedback. Although the pretrained models are not publicly released yet, if anyone would like access - let me know and I'll dm you a (checkpoint) download link. Also, if anyone is interested in getting more involved (on a co-author level), dm me!
@hearty flicker still need feedback from classical music experts? i know someone who might be interested
Yes
Yes yes! I would like to have a subsection in the arXiv paper evaluating these models from a musicological perspective.
Still working on a way to frame it exactly
As well as discord, my twitter dms are always open (https://twitter.com/loubbrad) and my email is [email protected]
Really cool site, I didn't know about this
Might be good to add to the pretraining dataset
I can finetune the model on the MIDIs from this website if you like
Or if you have compute resources yourself I can walk you through how to do it
@hearty flicker and I have talked about this a bunch but i think IMO the key here is that "alignment" in music prompting is to recover from unintentionally messy/mistake-filled inputs. It's not for recovering from atonal or otherwise intentionally stylistic-nonnormative inputs.
Kinda a tough needle to thread because, as compared with the language domain, music doesn't have the same notion of "toxicity" or "not in accordance to human values". And we don't want to give the impression that we are being Western-biased
@hearty flicker a question about the transcription — for the AMT paper, is the idea to stick with piano transcription, or to extend it to other instruments? That would potentially really widen what kinds of music we have access to — one of the things in collecting the dataset for the last paper was that piano music, especially in non-Western styles, was definitely more limited than other kinds of instrumental music
It's a pretty interesting topic to research
There is some research on multi-track transcription, there was a paper from google in 2023 or 2022 if I remember, I'll dig it up
2021 actually - https://arxiv.org/abs/2111.03017
Solid multi-track transcription is the dream. I'd settle for SOTA piano-only (I think we can do this too)
does multitrack apply to piano solo as well, since there are multiple voices?
I feel like if we take the kinda a Whisper-for-AMT approach we were talking about —
like, start with MIDI and generate tons and tons of augmented audio data using fluidsynth, and then try to learn the reverse audio -> MIDI mapping,
can't we also just take piano MIDI and also augment it into many different instruments?
I wonder how well this would work, defintely worth looking into
We probably have to find a good way to automatically render the MIDIs in a realistic sounding way
Most DAWs don't have a cli or api to do this quickly, so we'd probably be stuck using something like fluidsynth again
If anyone has any references for aligning / finetuning programming llm models, send them my way!
I think it's the closest thing to what we are trying to do here
fluidsynth plus lots of augmentation/reverb/etc.? covering a wider input distribution domain might be just as good as finding something realistic
Yeah I think that would be the best bet, but I'm not sure how well it would work
We'll give it a try, I've been meaning to read that paper I linked anyway - I wonder how they did it
@hearty flicker would you like to try RWKV for this 🙂 i found it generates better midi than transformers
Hey! The models we trained last month are large enough to actually overfit the training data given enough epochs
So I'm not actively looking into architectural changes in that direction, if that makes sense
Right now I'm mostly concerned with alignment as well as data related stuff
It would be pretty easy to try out training RWKV code wise, it just would take a lot of compute haha. The largest model we (pre-)trained last month was roughly 2500 GPU hours, so not that cheap unfortunately
i can train one to compare with your result if the training data is available 🙂
Do you normally train models on SAI's research cluster? I have the data on there already
i am on another cluster. can use "croc" to send it to me
we could actually add some comparison between transformer and RWKV to the paper
Only thing I can't share publicly is data
I'm sure RWKV will do amazing on the full dataset, only issue is that I can't take it off SAI for obvious reasons
Any architecture good enough for language will be good enough for MIDI
Btw Blink trained this with aria's midi tokenizer / tool chain : )
I assume you don't mean we legally can't take off from SAI? If we really want to open-source data it's possible (but need to notify SAI of course). Just so that you know
We can open source most, but not all
Basically everything that me and Alex have worked on, we can open source
Redistribution of music is legally tricky
However for the transcriptions (the stuff Alex and I worked on), there is precedent for redistributing
Yeah basically just want to say it's not our NDA that stops it. There can be other constraints of course
No it's not the NDA
We are aiming to release the aria models in April, I'm currently working on three separate ideas around alignment/scaling/editing that will be done by then I hope
@timber talon, @quasi steppe and I are also planning to submit 3 paper to ISMIR in April
I recon overall this project will have a decent impact on the gen-music scene
it's still using my midi data & tokenizer
Hi all, I'm very excited to have found this project and will be following its development. I work in the music industry and have been curating MIDI data for many years. I'm interested in learning more about the topics discussed here and, if helpful, sharing some ideas.
Is Aria aimed to be an all-genre foundation model or solely focused on classical music?
It's focused on classical music (and mostly piano music) for data reasons but if you have other data it should work on other forms of music
Good to know, although classical is what I collect the least of, with so many large datasets already available.
I'm still scouring through the comments here. In regards to ideas for rendering MIDI, you may want to explore Spotify's pedalboard for Python, which allows rendering with VST3 plugins.
Hey! Multiple people have reached out to me and have already successfully trained/finetuned their own models using the toolchain we have created
If that is something you are interested in (since you have your own data), I can guide you through it
Generally speaking it's quite easy to do this with the toolchain we have built
currently it's like gacha game, sometimes bad, sometimes great
i think we need RLHF dataset 🙂
@neon hamlet
Can it be fine-tuned or trained from scratch on a single GPU?
Can fine finetuned with a single GPU easily
I almost have the small model running. Can it be fine-tuned below max_seq_len? 4096 is quite large for an average GPU.
It should work. Try giving a -l argument of 2048 or 1024 when building the finetuning dataset. It might trigger an assert somewhere in the training script, I haven't tested it myself
Which GPU are you using?
I'll test 2048 to see what happens. I'm running on colab. Orange run is 100 classical piano samples and green is 1000. Batch size is 2.
If it doesn't trigger an assert you are probably good
although you should be able to train using the full context length on colab : )
Doesn't a t4 have 16gb of VRAM?
It's already trained on a lot of classical piano, so if that is what you want to use it for you can just use the pretrained checkpoints
This is one of the weirdest training bugs I've ever had to debug
I'm 90% sure it's something to do with the optimizer... Training in fp32 makes it a lot better, but still really confusing
I'm fine-tuning on pop piano music. This might take a while if the model has never seen pop melodies.
Is there any type of minimum length or note density for the midi data? Should it be accepting files with short durations like 8-16 bars?
During pretraining there was some pruning based off of those things
You can find them in config/config.json
The model does support any valid MIDI file though
The model has seen about ~50k multitrack pop during pretraining
Make sure to compare your finetuning results to the original model, would be interesting : )
I don’t know if you’ve seen this already but this might be a useful discussion : https://stackoverflow.com/questions/58633177/why-theres-a-big-jump-up-of-the-loss-curve-during-the-training
I haven't actually seen such a jump during my experiments
The paper will hopefully be on arxiv soon
I don't understand the tokenization process that well, but I'm curious to know how much MIDI information is being compressed. Is 4096 tokens intended to cover an average song length? Also, is it possible to know how many MIDI files the model was trained on?
I think the pretraining corpus was 144k sequences of length 4096
If you want a better idea about the tokenisation process, I'd recommend just tokenising some MIDI files and printing it out. You can find some in tests/test_data
Or you can read the source code - https://github.com/loubbrad/aria/blob/56b66eacfef72081ea0eaca8184fab4792b54d19/aria/tokenizer/tokenizer.py#L414
The main idea is that each note is represented by three tokens: (instrument, pitch, velocity), (onset in ms relative to the last <T> token), (duration in ms)
The reason for using <T> is to keep the total vocabulary size under control.
There is some preprocessing which is applied when building datasets, you can fine the source for this here https://github.com/loubbrad/aria/blob/56b66eacfef72081ea0eaca8184fab4792b54d19/aria/data/datasets.py#L246
It is fully customisable using the config/config.json file
This was was helpful, thanks. I'm testing some runs to see if I can get tiny model to perform. These are my settings:
{
"d_model": 384,
"n_heads": 8,
"n_layers": 16,
"ff_mult": 4,
"drop_p": 0.0,
"max_seq_len": 2048,
"grad_checkpoint": true
}