#general
1 messages · Page 1 of 1 (latest)
This is cool! Part of me wants to write up the long list of things that kind-of worked but didn’t move the needle on the contest 🔥
I had this odd architecture that was inspired by fast weights PKM, but with more of a hash/ngram prior
I’m curious if anyone else has some favorite failures 😅
I’ll do a write up a bit later on and post it to git. A lot of the stuff I worked on improved BPC early on, but didn’t necessarily gel with the more optimized baselines. I really liked watching all the leaderboard approaches stream in 🔥
Tiny readme on one of the failing approaches that I liked working on: fast weight hash memory. I took Instant NGP + fast weight PKM as priors, built something out with Claude. I hoped fast weights with a multiscale ngram prior would augment context + provide better local language modeling. Early results showed modest BPB improvement, but nothing on par with stronger ngram approaches or later optimized contest submissions. I think the n-gram approach might have limited the benefits of adaptive/online memory — could be cool to re-explore the design sometime.
curious what other approaches got tried and left out of submission, it was a lot of fun doing this
https://youtube.com/shorts/LEEucshktFg?si=Wm_M3FaULDDB42jU
this reminds me of what some of these approaches like PP and deed and gram we’re trying to do use a cheaper model where the next token is more likely easier to predict
Speculative decoding makes LLM inference 2-3× faster with identical output — the trick has been public since 2022, and Google just baked it into Gemma 4 as architecture (MTP drafter). Here is how two language models running side by side beat one running alone.
🚀 Want to learn agentic coding with live daily events and workshops?
Check...
Maybe N gram is useful after all
Can llama.cpp do speculative decoding?
Yes. llama.cpp supports speculative decoding, including a small draft model that predicts ahead of the main model, plus no-extra-model n-gram speculative methods. Its docs explicitly describe llama-server speculative decoding and note that n-gram pattern matching can be useful for code rewriting because repeated patterns often appear in generated text.
For your use case, llama.cpp is good for:
- local GGUF models
- quantized models
- simple OpenAI-compatible serving
- speculative decoding with a smaller draft model
- coding-agent clients like Cline, Roo, Continue, VS Code chat, etc.
It is weaker for:
- high-throughput multi-user serving
- advanced tool-calling fidelity
- long-context scheduling
- MTP/EAGLE research workflows
- diffusion-model integration
Now byte level models are in the News , using speculative decoding