https://arxiv.org/pdf/2305.07759.pdf
In the current environment May 2023 is a long time ago 😂 but I've been pondering this paper for a while. There are some (I think) unsupported claims, but also some really fascinating insights into what transformers are learning and why, along with an interesting dataset. Anyone who read this still thinking about it? I'd be interested in everybody's take.