https://arxiv.org/abs/2501.00663
This seems like a big step at first glance!
"We observe that our Titan architecture
outperforms all modern recurrent models as well as their hybrid variants (combining with sliding-window attention) across
a comprehensive set of benchmark"
arXiv.org
Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accur...