Hey everyone! I’m an independent researcher looking for technical feedback.
I’ve been developing an architecture to decouple LLM scaling costs from vocabulary size O(V) and sequence length O(N^2) via Stochastic Partition Estimation and Kronecker Sketching.
Core Mechanics:
MAXIS Loss: A stochastic partition estimator that uses a "Ghost Logit" (dynamic variance estimation) to simulate the missing probability mass of the unsampled tail. It achieves a 17.5x speedup over the Liger Kernel with ~39% less VRAM on a T4.
RandNLA Attention: A bifurcated Top-K + Sketching approach that maintains flat throughput (~35k tps) with increasing context length and superior NLL stability as the context increases.
I have two technical reports/drafts with the formal math and ablation studies (validated on a 40M prototype).
Repo/Technical Reports: https://github.com/yousef-rafat/MaximusLLM