#Attention Is All You Need, But All You Can't Afford | Hybrid Attention

1 messages · Page 1 of 1 (latest)

dull compass
#

https://codeberg.org/JohannaJuntos/Sisyphus
Building a small Rust-focused LM from scratch in PyTorch. Not a finetune — byte-level, random init, Rust-heavy corpus.
The run

25.6M params • 512 ctx • 173.5MB corpus • 30k steps • RTX 4060 Ti 8GB
Train loss 0.5834 / Val loss 0.8217 / Perplexity 2.15
286.6 tok/s — 51.47x vs full attention

Architecture
Byte-level GPT decoder (vocab 256). Each layer uses HybridAttention: local windowed attention + GRU-like recurrent path + learned gate mixing both.
What actually moved the needle
Corpus expansion from 31MB → 173MB (top 500 crates) mattered more than any architectural change.
Inference
Full attention: 5.6 tok/s | HybridAttention + KV cache: 286.6 tok/s
No quality loss. Fast enough for real interactive use on consumer hardware.
Disabled all experimental tricks (gradient quantization, activation compression, selective backprop). Clean baseline was enough.
Happy to answer questions on the hybrid attention design or corpus construction.

#

TLDR 51x faster inference
Because of hybrid arquitecture linear into quadratic into linear

dull compass
#
JohannaAlmeida

I have been building a small Rust focused language model from scratch in PyTorch. This is not a finetune. It is byte level, trained from random initialization on a Rust heavy corpus assembled here: https://codeberg.org/JohannaJuntos/SisyphusModel and training setupThe model has 25.6M parameters with a 512 context length. It uses a byte level voc...

dull compass
#

HN FRONT PAGE

#

YO WTF

dull compass
#

not bad

#

was on the front page for a bit

quaint iron
#

umhhhhh, dont you think you kinda overfitted your model? how did you get the perplexity to ~2 and loss to bellow 0.8?

that wayyyy too low

dull compass
#

Probably but when I buy a better gpu I will test at a bigger scale

#

The compute gains on inference check out if you check the O notation math for complexity

#

Full attention O(n²): 17.96s / 5.6 tok/s

HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s

#

Was on the hacker news post

#

Also just noticed second place in Google seo

#

Next to hugging face and a PhD holder

#

I only have a bachelor's in software engineering funny af

#

Gemini is using my hacker news post for summary

#

That number is from my experiments

quaint iron
#

can you decrease the epoches or steps. if you upload me your stats per every 5-15 ish steps to my DM or here, i could help you.

you dont need steonger hardwarew OR have to scale up to fix this 🙂

#

omds, these discord autofixes

dull compass
#

I took a break from programming tbh

#

Maybe in some days

#

I will bet you some vc backed Chinese ai lab is reverse engineering this as we speak

#

My Triton kernel changes

#

Someone will steal the implementation and repackage it

#

It is what it is

#

Happy to help the field anyways

#

And the source of truth has been documented as me thru my hacker news and reddit posts

quaint iron
dull compass
#

Not a joke lmaoo

#

Someone is probably doing it rn

#

😭

#

I changed how attention works it's probably worth some money

#

But for me it's just research

#

I'm not a business person

quaint iron
#

ig... same with me. You should probably license your work (always). use a restrictive license or apache 2.0 or MIT (if opensource). so if you did it first, then you can defend yourself + get some money from them for stealing content without consent

quaint iron
#

ohh yeah, gg...

#

never do MIT. always do apache 2.0 unless... if you want people to do what you said

dull compass
#

It's fine

#

If they steal it it's because it's worth stealing

quaint iron
#

yeah!

dull compass
#

I almost died doing this push

#

My health is more important than my intellectual property lmao

quaint iron
#

you good?

dull compass
#

Nope

quaint iron
#

what happened?

dull compass
#

Nervous system died

#

My legs are shot too

#

Have to do stretches

quaint iron
#

ohh boy...
wth💀
did you get medications. dont say it here.

dull compass
#

No just stretches and walking

#

All natural

#

I'm just a solo researcher

#

I had zero knstituonal backing

#

Until the hn front page

#

Google front page

#

Now I do

#

I'm not a company

#

I'm a single human being

quaint iron
#

well, same, dont overwhelme yourself. and take small steps ig.

dull compass
#

David vs Goliath

dull compass
#

For some days

#

Or weeks