#Myna: Small Music Representation Model
4 messages · Page 1 of 1 (latest)
I don't think I'm in a position to pick up more projects at this moment, but I'm interested in seeing this move forward
It seems you and the other collaborator have already done some work - maybe you could share more details about the project plan at a high level, including concrete things potential contributors could look at/do?
I would love to share some details! We are training on a contrastive objective and would probably really benefit just from scaling up and some architecture improvements. We are currently using a variant of Minz Won's self-attention architecture (https://arxiv.org/abs/1906.04972) as the base model. I am a bit hesitant to share more details publicly as we've been working on this for roughly a year (I don't know how it goes in the research world).
As for contributors and what you could do:
- We could definitely benefit from some architecture changes to make the model more capable
- There are definitely parts in my methodology that could be improved
- We should evaluate on the MARBLE benchmark (https://marble-bm.shef.ac.uk/) - we still need to write the evaluation code for this
- General ideas to improve performance. Everyone else in the field is training on at least 16 GPUs right now and I have an RTX 4080. While results are good for only a single GPU, I wonder if scaling it up correctly will yield a competitive model that is an order of magnitude smaller and faster.
Would love to hear your thoughts!
Self-attention is an attention mechanism that learns a representation by
relating different positions in the sequence. The transformer, which is a
sequence model solely based on self-attention, and its variants achieved
state-of-the-art results in many natural language processing tasks. Since music
composes its semantics based on the relations b...