#Peer Review/Feedback: Constrained Orthogonal Differential Attention - 9.5x KV Cache Compression

11 messages · Page 1 of 1 (latest)

clever gull
#

Disclaimer: This is my own work research. Looking to raise awareness and gather feedback and review from AI researchers and practitioners
Abstract: We trained CoDA-GQA-L on Mistral-7B: a bounded KV cache mechanism achieving 9.5× memory compression (218 KB/layer) with only +23.5% PPL overhead. Uses learnable orthogonal differential attention + dual memory banks (exact landmark + EMA summary) compiled to 2 fused Triton kernels.

Results: 100% needle-in-haystack at 16K tokens, 5.7× lower bounded penalty than standard GQA, minimal context-length degradation (5.94 at 2K vs 5.95 at 4K).
Checkpoint and all code (56 passing tests) are open-sourced. Aimed at enabling long-context inference on constrained hardware.
Paper:
https://www.researchgate.net/publication/401306672_CoDA-GQA-L_Constrained_Orthogonal_Differential_Attention_with_Grouped-Query_Value-Routed_Landmark_Banks

wintry cedar
#

interesting idea, real engineering effort from what i can see, but i think the eval might be too narrow

wikitext2 perplexity is useful, but it is not a strong proxy for downstream long-context reasoning, retrieval-heavy QA, instruction following, or multi-document use
needle in jaystack is easy to over-interpret, p erfect retention of a distinctive token-like “needle” mainly proves the exact bank can preserve a sharp item under that test, not that the model handles messy real contexts well

main thing is that there is little evidence here on multi-hop reasoning, document QA, long dialogue, code, or instruction tasks

i think its promising as a systems paper but as to claim to solve long context economically i dont think its quite there yet

further, the throughput story seems... mixed? its good that its honest that bounded prefill is much slower than baseline because bank updates are expensive, your own table shows medium-cache prefill far below baseline, with decode also slower than baseline, that matters alot.

a bounded KV design is only really compelling if in practice if at least one of these becomes true in the idea that memory savings unlock a deployment regime that was otherwise impossible OR latency/throughput stays competitive enough to be worth it.

This paper clearly helps the first, but does not yet necessarily clearly win the second.

big elephant in the room tho, the quality cost is still pretty large

your bounded Mistral-7B result is 5.94 PPL vs 4.81 baseline, about +23.5%, which is NOT trivial.
for a memory-compression paper, maybe it's acceptable as an early result, but it's still a meaningful degradation. T

your paper does frame the flatness across context lengths as a strength, which is fair, but the absolute gap remains there

also, some of your claims are projections

  1. projected 70b scaling
  2. 1000x compression at very long contexts
  3. consumer gpu deployment implications
    those seem to b mostly extrapolations rather than demonstrated experi
#

experiments, but id be more careful with phrasing

#

finally

#

the training protocol is expensive and... somewhat awkward?

the two phase training is clearly necessary because cold-swapping bounded mode catastrophically fails

that's a nice thing to know cs it also means this is not a lightweight drop-in serving optimization since it's closer to a partial architectural retrofit requiring dedicated adaptation which must be noted to narrow potential dedicated adaptation, which in turn needs a lot more proven gains to justify in practice or this just ends up being a nice-in-theory paper

as to specific technical thoughts and bits

#

in regards to CoDA, using a learnable orthogonal rotation instead of a second query projection sounds really neat
it reduces parameter overhead and keeps the “signal minus inhibitory stream” idea alive

BUT I would like to know:

  1. how much of the gain comes from true differential selectivity,
  2. instead of just just extra capacity / altered optimization dynamics / gating.

the factorial helps a bit, but a deeper mechanistic analysis would be nicely neat

#

also, the exact + summary banks, nice for hybridised sparse vs dense retention

#

lf/hf handling for RoPE rotated keys, that low-frequency-only summary-key trick is at least intellectually coherent but that part feels more speculative to me than the value-routing part, i'd like stronger empirical isolation of how much this specific design choice helps.

#

tldr:
needs better task coverage

  1. long context qa
  2. multidoc retrieval
  3. code completion with long repos/files
  4. long dialogue/instructions tasks
  5. maaaybe passkey/needle tasks that are less toy-like and more generally messy

also needs comparisons against strong practical baselines, not just baseline mistral but serious long-context-efficient baselines and serving-oriented methods whatever they may be

also, more scaling evidence, not just prelim smollm2 notes and arithmetic projections

finally, stronger latency accounting, this is what i feel most strongly about, separate prefill, decode, bank-update cost, training overhead, serving tradeoffs, whatever else, the memory part is pretty solid now but the latency bit is less so

#

regardless, this is a strong effort

#

sorry if ive been rambling around and around