https://codeberg.org/JohannaJuntos/Sisyphus
Building a small Rust-focused LM from scratch in PyTorch. Not a finetune — byte-level, random init, Rust-heavy corpus.
The run
25.6M params • 512 ctx • 173.5MB corpus • 30k steps • RTX 4060 Ti 8GB
Train loss 0.5834 / Val loss 0.8217 / Perplexity 2.15
286.6 tok/s — 51.47x vs full attention
Architecture
Byte-level GPT decoder (vocab 256). Each layer uses HybridAttention: local windowed attention + GRU-like recurrent path + learned gate mixing both.
What actually moved the needle
Corpus expansion from 31MB → 173MB (top 500 crates) mattered more than any architectural change.
Inference
Full attention: 5.6 tok/s | HybridAttention + KV cache: 286.6 tok/s
No quality loss. Fast enough for real interactive use on consumer hardware.
Disabled all experimental tricks (gradient quantization, activation compression, selective backprop). Clean baseline was enough.
Happy to answer questions on the hybrid attention design or corpus construction.