interesting idea, real engineering effort from what i can see, but i think the eval might be too narrow
wikitext2 perplexity is useful, but it is not a strong proxy for downstream long-context reasoning, retrieval-heavy QA, instruction following, or multi-document use
needle in jaystack is easy to over-interpret, p erfect retention of a distinctive token-like “needle” mainly proves the exact bank can preserve a sharp item under that test, not that the model handles messy real contexts well
main thing is that there is little evidence here on multi-hop reasoning, document QA, long dialogue, code, or instruction tasks
i think its promising as a systems paper but as to claim to solve long context economically i dont think its quite there yet
further, the throughput story seems... mixed? its good that its honest that bounded prefill is much slower than baseline because bank updates are expensive, your own table shows medium-cache prefill far below baseline, with decode also slower than baseline, that matters alot.
a bounded KV design is only really compelling if in practice if at least one of these becomes true in the idea that memory savings unlock a deployment regime that was otherwise impossible OR latency/throughput stays competitive enough to be worth it.
This paper clearly helps the first, but does not yet necessarily clearly win the second.
big elephant in the room tho, the quality cost is still pretty large
your bounded Mistral-7B result is 5.94 PPL vs 4.81 baseline, about +23.5%, which is NOT trivial.
for a memory-compression paper, maybe it's acceptable as an early result, but it's still a meaningful degradation. T
your paper does frame the flatness across context lengths as a strength, which is fair, but the absolute gap remains there
also, some of your claims are projections
- projected 70b scaling
- 1000x compression at very long contexts
- consumer gpu deployment implications
those seem to b mostly extrapolations rather than demonstrated experi