Thanks for advices. I will add perplexity score as another metric.
Fnet uses fft but it completely ignores imaginary part. My work extends RoPE for semantics. So, Fnet uses the math of waves yes, but I am using physics of waves.
The most important difference is Fnet still uses vectors, PRISM does not have vectors. It has waves.
Regarding nanogpt and SFT datasets, they are for decoder only prediction models. PRISM is an encoder. So I am testing these on classic WMT14 translation by adding it a standard transformer decoder. And my work is about the semantic map, especially relations of token representations. And I thought translation is a good way to isolate it because we can check if for example german "apfel" resonates similar with english "apple". And on few shot learning tests, I am exactly checking if the model can make resonate new made up german words with actual english words like Lichtkasten -> Television.
I am currently upgrading the paper. I am training hybrid models and PRISM seems working incredibly good for few shot learning for hybrid models.