#RWKV-papers
10179 messages Β· Page 11 of 11 (latest)
RWKV 7 paper is already complete
i mean what about rwkv 8? is that going to happen in the next year?
yeah rwkv8 paper = early next year
1 year release cycle sounds chill
11-month release cycle:
RWKV-4: 2305.xxxx
RWKV-5/6: 2404.xxxx
RWKV-7: 2503.xxxx
RWKV-8: 2602.xxxx?
Personally I like this almost-a-year-but-not-quite release cycle
Something big is coming
From openai?
From RWKV for sure!
honestly it feels like less then 11 months
oh - because 5/6 was merged together in 1 paper.... definately felt like less then 11 months per version haha
are they using hybrid / rnn π
Its possible that this just helps focus attention, because maybe the model expects the instructions to be at either position and this reduces the chance that the model will "miss"
Also the "above" context makes sense not just in the case of RNNs, because causal attention can only attend to previous tokens, so having instrucitons at the beginning will be the difference between the model having a superposition of possible instructions vs. instruction information propagating from the start.
crazy if true, hope they open source 4.1
I believe they are using hybrid
gpt4o likely already hybrid
What evidence do you have for this?
See Blink's evidence
its most likely if anything a local global hybrid
true
so just something like sliding window attention?
This is not meaningful evidence.
Just some speculation.
We are not writing an article, just chatting, so no need to ask evidence for every claim.
lets test #off-topic message
@obsidian quest @void quartz I'm trying to track down Guanyu Song, do you have his email address / contact details?
Do you happen to know the discord handle? is it @steady ether ?
@misty igloo
He is @steady ether
He is Guangyu, not Guanyu
(I already DM'd Stella contact info)
https://colmweb.org/dates.html
RWKV-7 paper rebuttal will start next week. Stay tuned
ughhh not fun timing for me
I will do.
thanks! I will be able to work on it too, I just have to move (physically) and do the rebuttal for RADLADS at the same time π
The reviews will be out in a few hours.
I strongly believe that our paper will get high scores
(If I were the reviewer, I would give a score between "good accept" and "highlight")
https://github.com/Benjamin-Walker/structured-linear-cdes let's fix their rwkv7 result
Reviews are out!
Score: 6,8,8,8
That probably means 100% accept and top 2% of all papers and top 5% of the accepted papers
rebuttals seem pretty easy too, mostly clarifications and a few ablations
We have them in the appendix
good job guys π₯
Yeah I think we're solid here and mostly need to point out the existing ablation studies in the appendix to reviewer UGzf to address their concern #1
Fantastic job, everybody! I knew the paper was really solid, but it's still nice to see it confirmed via double blind review π€£
good job!
likewise for reviewer QvXt concern #2, but it seems like they may want even more
I feel like there's no way to address #2 sorry I didnt see you meant the other reviewer
yeah but QvXt gave us an 8 already
the only score we really need to improve is UGzf, and even if they keep it we're still fine
oh fair enough
The reviews seemed not to have read the paper carefully
its a long paper with a million appendices
I agree
I think reviewer GjhW's questions are solid and should be addressed first
I've created a google doc where we can work on the responses https://docs.google.com/document/d/17BCQcG5gH28fqmipBxjM2Z-vlf37I6YlJfE6776gSqk/edit?usp=sharing
@bronze frost I think we need some clarification on kernel speed benchmarks. Could you please test kernels on specific settings encountered in training, such as (12,64,4096)?
i updated figure3 for both arxiv and colm paper, as requested by 1 reviewer
with or without the black rectangle ?
i prefer without unless you can add "see insert" text or the like
"see left" i guess
then without, more similar to the original and addresses the points from reviewer
https://github.com/HazyResearch/zoology
certainly a wrong implementation of rwkv7. lets fix it π
@gusty condor @steady ether
corroborated by data π https://bsky.app/profile/colmweb.org/post/3lq6acxagpk2w
https://github.com/HazyResearch/zoology/issues/34
Firstly, RWKV-7 state_size is exaggerated to 4x and 16x for d_model=128 and d_model=256
Here in state_size function: https://github.com/HazyResearch/zoology/blob/main/zoology/mixers/rwkv7.py should be self.num_heads * self.head_dim* self.head_dim So RWKV-7 state_size is exaggerated to...
https://arxiv.org/pdf/2505.23735
They got RWKV-7 formula wrong.
and they are using wrong RWKV-7 state_size from zoology, in figure 7
All done. I think it's time to reply to the reviewers.
Amazing work!
The other funny thing is that it seems that the lowest score review was disregarded due to review quality, so we ended up with maybe the equivalent of an amazing 8,8,8 for scores!
so how is the camera ready coming along? (is it at all?)
Not hurrying, deadline is August 7, 2025.
@iron parrot do you have the pg19 code used to run the models and produce the plots in eagle and finch paper
I believe Smerky ran the pg19 test for the Eagle and Finch paper, while I handled the one for the Goose paper.
I have the code to reproduce the results from the Goose paper. Do you need it?
yup thx
we can work on a rwkv7s paper
seamless upgrading rwkv7-g1a 0.1b
+de (orange) vs +de+dea (blue, a bit difficult for a trained model to utilize in the beginning, then works)
x axis is billion tokens?
yes
DE is basically a form of hash routing moe and I think we can make it a lot more efficient, see my comment in the rwkv discord #1109810049607532555 message
You have not unveiled the newest design of DE and DEA
4 days to go!
I'm editing the radlads paper atm but I should be done with that and can start on the rwkv paper soon too
We need to make some changes to satisfy things we promised the reviewers we'd do, and reorg a bit for the authors and new 10p limit
did a first pass on all that...
total 191.034624 M
activated 140.702976 M
read 768 numbers per token (embed)
rwkv7+DE 0.1b
total 997.852416 M
activated 142.226688 M
read 13056 numbers per token (embed+DE)
rwkv7+DE+DEA 0.1b
total 1806.753024 M
activated 145.833216 M
read 25344 numbers per token (embed+DE+DEA)
state size = 589824 + 768 * ctxlen = 768 * (768 + ctxlen)```
@here Anyone know if there is anything preventing us from submitting the camera ready to COLM? If there is, please let me know so we can fix it!
don't think @here works, gonna have to ping individually
Okay, well I have submitted the current version as our camera ready. I'll attach the pdf here (but if you're an author you should be able to obtain it from openreview, too!). If people think there are issues please bring them to my attention immediately!
@fresh mulch @gusty condor @iron parrot @obsidian quest @rose mango @tropic minnow @brisk bronze @paper dove @sonic horizon @steady ether @crystal hull @hushed orchid @bronze frost @keen tartan sorry, not sure what other author handles I'm missing - please make sure everything is cool with the pdf above, or look at the current version via your openreview account. Final camera ready submission ends Thursday.
what's with the gap here? i assume this is the superscript b but it's way more noticeable in the third section than the fourth
seeing as we have some space left i wonder whether we can bring figure 7 back up next to figure 4: they were once next to each other and we moved it to the appendix on space concerns
minor nits though generally lgtm
This is probably a better use of the space actually ^
Two lines, actually
good catch!
do we want to include references to G0 here?
not sure.. I guess the problem is we can't really include all that stuff in the paper
and yet, it probably wont go in any other paper until RWKV-8
there could be a paper on using state tuning for RLHF and RLVR
Then there should be RWKV fast batch inference and HF-compatible API π (No FLA)
are there any thoughts on using RWKV for vision?, there is a recent paper that got iclr spotlight for vision-RWKV but i assume it was an independent endeavor from the authors and not from here, any thoughts of improving or thinking a novel direction in vision domain?
please check https://rwkv.com for 100+ papers π
The RWKV Language Model
Hmm RADLADS and RWKV-7 are both in poster session 4. @tropic minnow and I could probably split those up since we're both on both papers, but who else is going to COLM who was an author on RWKV-7?
I am, but am not sure how much I would be able to contribute
Maybe we could negotiate with the org to have both of them placed next to each other? lol
Sounds good - could you reach out to them to ask?
I don't want either of us to have to miss out on either poster!
Could even ask to switch poster sessions at that rate
Good!
Sadly, I cannot go to COLM. It does not make sense to me traveling 10,000 km and skip two classes for an NLP conference, given that I am a math student. My advisor doesn't agree with that either.
Yeah it's a lot of travel. I'm sad that you won't be there though - it would have been fun to meet up in person!
As RNNs start to gain momentum, I will share a framework to improve RNNs (RWKV-8 and beyond), or, how to write 100 architecture papers π Hope it could be useful for researchers interested in the field: https://t.co/UdmOSudvu0
Thanks for sharing such a comprehensive framework with the community!
However, I don't understand why larger state is an improvement as it's just moving along the Parento frontier.
yes i wont do this, but most researchers seem to like it, such as headsz=256
Larger state is cool if it's dynamically allocable
and can combine 1 + 2
have we got a poster ready?
Not finished yet, I'll send you a link if you'd like to help work on it!
@obsidian quest I'm not sure if we can fit it, but if we show any newer RWKV-7 results or DE/DEA preview on the COLM poster, what would you like those to be?
lets show G1a results first?
RWKV-7 Poster presentation at COLM went great last night. A bunch of people were excited and told me it was the best thing they'd seen at the conference!
The RADLADS poster session is tonight.
That's awesome feedback! I'm quite curious what the interest level in the complexity theory stuff was
Mixed... People I met initially didn't believe it and then got convinced
I put a short proof sketch on the poster
That's curious because I was thinking mostly "did people care"
Did you get more theory-oriented people showing up?
Some people said well obviously transformers are more expressive
So I got to retort with "we prove the opposite"
Not really no
There was one group who had some stuff on the topic in a new paper but I didn't catch which it is
Ah that's more what I expected
and RWKV-8 is the genuine transformer killer π https://x.com/BlinkDL_AI/status/1975922536492716103
please mention this too https://x.com/BlinkDL_AI/status/1975946959715471656 RNN magic π
RWKV7 7.2B fp16 decoding on 5090 can reach 10000+ token/s now π (bsz960, and 9000+ token/s for bsz320). Always const speed & VRAM because it's RNN. Try https://t.co/E8cfZH64nO in https://t.co/oMxIrwVVEN
Let's do a real scaling laws suite for this architecture
Retrieval Oriented/Optimized State/Slot/Sparse Attention
given the timing likely related to DeepSeek V3.2 Attn
RWKV-7 G1a 2.9B more evals: https://t.co/X2R2f6EeRB MMLU Pro 42% (+CoT), GSM8K 77%, MATH 50%. Note this is a base model, no mid-training, no post-training. I just add everything to pretraining dataset.
what do you mean by adding everything to the pre-training dataset π
reasoning/instruction/chat data, not test set π
ah makes sense
all 4 words wrong π
anyone whom went to COLM gonna write a blog post afterwards from this group ?
now if i had to guess Iβd say its just python simulating induction head.
but i dont know the acronym
you are correct π the key is to make it work for more scenarios
Hi guys we were just thinking, in the general case , even a soft version could be built, which could be the partition function where prefix length becomes the energy level, so it is a weighted average instead of a single option; which would recover the current discrete formulation under temperature=0
This shows there is a nice connection between attention and this variant; where attention would do inner product as the similarity function (and one would hope this becomes contextual over the layers) whereas we would do directly on token identities and subsequences
Does anyone want to write a short paper on this connection and maybe try a few experiments (language modelling or synthetic tasks) ?
another Q is the expressivity of RWKV7+ROSA => will it be practically (not limited by float precision) turing complete, if we allow some CoT π
I had also drawn the comparison to a discretized version of linear attention here, maybe also helps think about how these discrete methods relate to continuous ones
@tropic minnow
I'm definitely up to do some experiments
RWKV8 ROSA training demo - the first serious neurosymbolic LM? for a new era in AI π Code: https://t.co/j0eFQDISvu
Would the associative operator just be using the values from later tokens, and defaulting to earlier ones if empty?
not sure what you mean about associative operator or earlier or later tokens... this just writes to and retrieves from individual slots in an array
pseudocode for recurrent version was shown, prefill/training would be implemented slightly differently
In the simplest case of linear attention, the associative operator would be adding the earlier and later matrices together. It lets you parallelize along the sequence dimension for training and prefill
This kind of thing is what Iβm referring to: https://docs.jax.dev/en/latest/_autosummary/jax.lax.associative_scan.html
sure, this is associative as well.. instead of addition you could consider the operator being set(a,b)=b if b>0 else a for each int in a vector of ints if zero is a special sentinel value meaning that argmax did not choose that slot
I think that's what you meant in your original message too
(but the idea was not to construct a real parallelizable machine, but rather to create a theoretical stepping stone to ROSA based on linear attention)
RWKV8 ROSA πΉ simply scales, producing mysterious new languages. Training small LMs soon π Code: https://t.co/j0eFQDJql2
RWKV7+ROSA 1M params solving 40 digits +/- with 99% digit accuracy, without CoT πΉ demo: https://t.co/j0eFQDISvu
RWKV7+ROSA with 40K params (L2-D32) reversing 1-60 digits input with 99.8% digit accuracy πΉ demo: 251105_reverse_run.py in https://t.co/j0eFQDJql2
It has been over a month since the ROSA proposal. When will a ROSA language model be trained?
