I'm currently trying to create an AI to make music based on techno DJ-sets as training data.
I have trained the VQ-VAE and get okish results there, so the music is good recognizable and it is generalized quite well works for music quite different to my training dataset too. Now i'm currently building the Transformer which shall predict the next token to generate new music. The training data i use for my transformer are 2**15 long token sequences (2048 values) and then from those a random 2048 long sequence gets chosen. Might increase the sequence length in the future but for computational speed thats my context length atm.
TransformerDec(vocab_size=2048, embed_size=1024, n_layers=6, forward_expansion=4, n_heads=8, pad_idx=-1, dropout=0.3, device=device, max_seq_len=2048)
The transformer seems to work, tested it with a synzthetic dataset, also once got to 7% accurracy, but havent been able to reproduce that. also tried to increase the number of layers and the embedding size with no improvements, does anyone have an idea what i am doing wrong my transformer implementa
tion and training script is attached