Simulating Partition Mass via Ghost Logits & Sketching Context via Causal Kronecker | EleutherAI | Page 1

silver kestrel Mar 17, 2026, 7:14 PM

#

Hey everyone! I’m an independent researcher looking for technical feedback.

I’ve been developing an architecture to decouple LLM scaling costs from vocabulary size O(V) and sequence length O(N^2) via Stochastic Partition Estimation and Kronecker Sketching.

Core Mechanics:

MAXIS Loss: A stochastic partition estimator that uses a "Ghost Logit" (dynamic variance estimation) to simulate the missing probability mass of the unsampled tail. It achieves a 17.5x speedup over the Liger Kernel with ~39% less VRAM on a T4.

RandNLA Attention: A bifurcated Top-K + Sketching approach that maintains flat throughput (~35k tps) with increasing context length and superior NLL stability as the context increases.

I have two technical reports/drafts with the formal math and ablation studies (validated on a 40M prototype).

Repo/Technical Reports: https://github.com/yousef-rafat/MaximusLLM

GitHub

GitHub - yousef-rafat/MaximusLLM: High-throughput long-context LLMs...

High-throughput long-context LLMs. Scaling context via RandNLA and massive vocab capacity through MAXIS Loss and Fisher-SVD. - yousef-rafat/MaximusLLM

little portal Mar 17, 2026, 8:07 PM

#

You should try to scaling model dimention. 40M is too small to have meaningful signal

silver kestrel Mar 17, 2026, 8:49 PM

#

I agree that in general 40M is too small for reasoning, but for an architecture PoC it works to prove that RandNLA throughput effectively decouples from sequence length and that MAXIS efficiency is inherent to the algorithm.
Regarding accuracy, the signal lies in the relative convergence delta: MAXIS recovers ~96.4% of the supervision signal of exact Cross-Entropy by simulating the partition mass, while RandNLA actually achieves a 3.18 lower NLL loss than standard GQA at 8K context.
If the compression or sampling were 'lossy' in a way that destroyed learning, we would see it in the divergence of the loss curves even at 40M. Instead, the results suggest the architecture acts as a structural regularizer, maintaining superior semantic stability compared to the baseline

tight barn Mar 17, 2026, 9:07 PM

#

silver kestrel I agree that in general 40M is too small for reasoning, but for an architecture ...

What do the last two sentences mean?

#Simulating Partition Mass via Ghost Logits & Sketching Context via Causal Kronecker