#PRISM - Replacing Attention with Harmonic Phase-Locking for Linearithmic Scaling (O(N log N))

1 messages · Page 1 of 1 (latest)

scenic cloud
formal abyss
#

Advice re modern nlp: BLEU is not a very good metric as it does ngram matching, you would be better off looking at perplexity on some well known web scale pretraining dataset validation subset. If you want to compare against a well optimized baseline, you could compare against the modded-nanogpt (i.e nanogpt speedrun) - see if you can get better loss in its setting

for the finetuning setting, you would be better off testing on modern SFT datasets. You could see if you can beat the benchmarks of those. ofc that would require pretraining a significant model tho

Isn't your attention replacement the same as FNet: https://research.google/pubs/fnet-mixing-tokens-with-fourier-transforms/ ? they also use fft

scenic cloud
# formal abyss Advice re modern nlp: BLEU is not a very good metric as it does ngram matching, ...

Thanks for advices. I will add perplexity score as another metric.
Fnet uses fft but it completely ignores imaginary part. My work extends RoPE for semantics. So, Fnet uses the math of waves yes, but I am using physics of waves.
The most important difference is Fnet still uses vectors, PRISM does not have vectors. It has waves.
Regarding nanogpt and SFT datasets, they are for decoder only prediction models. PRISM is an encoder. So I am testing these on classic WMT14 translation by adding it a standard transformer decoder. And my work is about the semantic map, especially relations of token representations. And I thought translation is a good way to isolate it because we can check if for example german "apfel" resonates similar with english "apple". And on few shot learning tests, I am exactly checking if the model can make resonate new made up german words with actual english words like Lichtkasten -> Television.
I am currently upgrading the paper. I am training hybrid models and PRISM seems working incredibly good for few shot learning for hybrid models.

cold bronze
#

Do I understand correctly that PRISM is not just swapping self attention for an FFT based mixer, but actually changing how meaning is stored so the model mostly adjusts amplitudes and phases instead of pushing vector points around, which helps it learn new things without heavily forgetting old ones?

scenic cloud
# cold bronze Do I understand correctly that PRISM is not just swapping self attention for an ...

Yes correct my narrative is flawed I am fixing it right now. But you figured it out perfectly. Let me make it clearer.
At first, I thought I can make this selective like Hyena. But I was already arguing first layers should not be selective and this is better for few shot learning. I still tried Hyna like gate mechanisms few days ago and it perfromed worse.
So, yes I am replacing attention at the encoder side but abandoning the goal of selectivity and leaving this to the transformer decoder. PRISM is for changing the meaning as waves to make it learn new embeddings a lot faster but does not have selectivity.
So, lately I trained hybrid PRISM-Transformer encoders and this hybrid encoders performed even better. Because I have both rapid phase locking and the selectivity in this way.