PRISM - Replacing Attention with Harmonic Phase-Locking for Linearithmic Scaling (O(N log N)) | EleutherAI | Page 1

scenic cloud Dec 2, 2025, 6:20 PM

#

Codes are on Zenodo with DOI. But for ease of use: https://github.com/AlperYildirim1/Pay-Attention-Later/tree/main
I am open to criticisms.

And if it is appropriate, I also want to open a post for my next work on community projects page soon. I want to explore the architecture further and try PRISM-Transformer encoder hybrids.

GitHub

GitHub - AlperYildirim1/Pay-Attention-Later

Contribute to AlperYildirim1/Pay-Attention-Later development by creating an account on GitHub.

formal abyss Dec 9, 2025, 3:10 AM

#

Advice re modern nlp: BLEU is not a very good metric as it does ngram matching, you would be better off looking at perplexity on some well known web scale pretraining dataset validation subset. If you want to compare against a well optimized baseline, you could compare against the modded-nanogpt (i.e nanogpt speedrun) - see if you can get better loss in its setting

for the finetuning setting, you would be better off testing on modern SFT datasets. You could see if you can beat the benchmarks of those. ofc that would require pretraining a significant model tho

Isn't your attention replacement the same as FNet: https://research.google/pubs/fnet-mixing-tokens-with-fourier-transforms/ ? they also use fft

FNet: Mixing Tokens with Fourier Transforms

scenic cloud Dec 9, 2025, 4:26 AM

#

formal abyss Advice re modern nlp: BLEU is not a very good metric as it does ngram matching, ...

Thanks for advices. I will add perplexity score as another metric.
Fnet uses fft but it completely ignores imaginary part. My work extends RoPE for semantics. So, Fnet uses the math of waves yes, but I am using physics of waves.
The most important difference is Fnet still uses vectors, PRISM does not have vectors. It has waves.
Regarding nanogpt and SFT datasets, they are for decoder only prediction models. PRISM is an encoder. So I am testing these on classic WMT14 translation by adding it a standard transformer decoder. And my work is about the semantic map, especially relations of token representations. And I thought translation is a good way to isolate it because we can check if for example german "apfel" resonates similar with english "apple". And on few shot learning tests, I am exactly checking if the model can make resonate new made up german words with actual english words like Lichtkasten -> Television.
I am currently upgrading the paper. I am training hybrid models and PRISM seems working incredibly good for few shot learning for hybrid models.

cold bronze Dec 13, 2025, 7:21 AM

#

Do I understand correctly that PRISM is not just swapping self attention for an FFT based mixer, but actually changing how meaning is stored so the model mostly adjusts amplitudes and phases instead of pushing vector points around, which helps it learn new things without heavily forgetting old ones?

scenic cloud Dec 13, 2025, 7:51 PM

#

cold bronze Do I understand correctly that PRISM is not just swapping self attention for an ...

Yes correct my narrative is flawed I am fixing it right now. But you figured it out perfectly. Let me make it clearer.
At first, I thought I can make this selective like Hyena. But I was already arguing first layers should not be selective and this is better for few shot learning. I still tried Hyna like gate mechanisms few days ago and it perfromed worse.
So, yes I am replacing attention at the encoder side but abandoning the goal of selectivity and leaving this to the transformer decoder. PRISM is for changing the meaning as waves to make it learn new embeddings a lot faster but does not have selectivity.
So, lately I trained hybrid PRISM-Transformer encoders and this hybrid encoders performed even better. Because I have both rapid phase locking and the selectivity in this way.

#PRISM - Replacing Attention with Harmonic Phase-Locking for Linearithmic Scaling (O(N log N))