There’s also https://openreview.net/forum?id=BGnm7Lo8oW
#RWKV-papers
1 messages · Page 9 of 1
Ya I mean if ur gonna post train it I assumed u would put this thinking between the question and answer
Also prompt and chat template I’m not sure is too relevant to the idea of latro
How do u plan on doing it to pretraining data
Where will the CoT be placed
if you're willing to compute 2 forward passes, perhaps where both loss and the predicted token entropies across all tokens that have minimum probability at least p is high
restrict it to settings where the next token isn't necessarily ambigious, a small set of choices (at most 1/p), and where the model is uncertain among those choices and could perhaps reason which would be better. At the same time, might still end up being very wasteful as you'd be trying to reason about stuff like which of 2 synonyms is a better choice
Anywhere (the model is reinforced to learn where to put its thought)
How would u do that efficiently on pretraining data?
Just seems like between the question and answer is 99% of the time most optimal
Easiest thing to train and model to learn quickly
Better user experience
And u keep prefill efficiency
Just read my PDF
Oh ok didn’t realise lol
Hm tbh still don’t see how u would do it efficiently from the pdf
Although maybe the inefficiency is fine
Not really efficient, but worth trying
this is correct, however for some strange reason, the training will nan after some time if we apply 1.6x
i noticed this before. dont have time to debug it yet lol
similar to this
are any of the Goose models (1.5B, 3B or 7B) on huggingface to experiment with? I was thinking of running some long context experiments
1b5 can do niah at least to 32k, bigger v7 models coming soon
to clarify, it can only do this once finetuned
the others arent done yet, but 3b will be done feb 10
I decide to put this table into the introduction of RWKV-7's paper. However, I don't understand exactly how TTT-linear and Titans update their states. I think TTT involves a mini-batch gradient descent, but I have no idea how to write the state evolution formula in a suitable format.
no idea, but I am pretty sure rwkv 5 and 6 dont have diag(w) in them, I think its just w
see eq 7 of https://arxiv.org/pdf/2407.04620 for TTT (except exclude the $x_t$). Since TTT is quite different from all the other techniques (since it essentially maintains a state for 16 steps) so maybe just pretend the mini-batch size is 1 and add a comment as a footnote?
Effectively it would be
$$S_t = S_{t-1} (I - 2\eta k_t k_t^T) + 2 \eta v_t k_t^T$$
or equivalently
$$S_t = S_{t-1} -2\eta(S_{t-1} k_t - v_t) k_t^T$$
Where $\eta$ is a scalar
(You may have to switch the transposes based on convention of row vs column vectors? For instance $v_t^T k_t$ in RWKV-6 would result in a single scalar using column vectors conventions, but it should actually be a matrix, so I assume you are using row vector conventions)
thiscord13
Titans seems to be almost exactly the same (also using minibatch), except it has
$$S_t = S_{t-1} (w_t I - 2\eta_t k_t k_t^T) + 2 \eta_t v_t k_t^T$$
Where $w_t$ and $\eta_t$ are learnable scalars
thiscord13
Actually $w_t$ and $\eta_t$ may be input-dependent vectors so you would have to wrap these with diag
thiscord13
(0,1)
I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, @paper dove , @iron parrot , @dawn pewter and I are working on it. Does anyone have suggestions on the current paper?
@steady ether Have you tested RWKV-7 MQAR?
It seems that the special initialization of RWKV models are not used, which may affect performance.
I tested it a while back. What’s the special initialization you’re referring to? I might have missed that—can you clarify? Here’s the current code:
https://github.com/guangyusong/zoology_fork/blob/rwkv7/zoology/mixers/rwkv7.py
I think we should try to submit to COLM , and if we submit a preprint to arxiv first that will be disallowed due to anonymity periods
I am quite confidant that this is false. That would be a much more stringent policy than they had last year, flies in the face of mainstream attitudes in ML, and there's nothing I can find on their website indicating it.
oh good, I must have misremembered
COLM will use the following policy, adapted from NeurIPS: 'Non-anonymous preprints (on arXiv, social media, websites, etc.) are permitted. We recommend you indicate “preprint”, rather than the “final” option int he template. Reviewers will be instructed not to actively look for such preprints, but encountering them will not constitute a conflict of interest.
Yep, looks fine!
this part
# !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
# self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
# self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
# self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
# self.output.weight.data.zero_()
also, the 'suggestion' values for the LORAs would be good to follow, as those are what are actually used for the models
Ah, I completely forgot. Thanks!
Added dataset details to the paper.
@obsidian quest is this the correct URL for Buzz-V12? https://huggingface.co/datasets/H-D-T/Buzz-V1.2
also, @obsidian quest could you describe what hardware and batchsizes etc. were used for the training
and one more question: when you continued the World v2.0 models on World v2.1, how exactly did that work? It was just an additional 0.3T tokens trained? I know for RWKV-7 World v3.0 you trained again on the whole 3.1T World 3.0 corpus...
Uncommenting this would just not work - Initialization is handled by def _init_weights( at line 73 of model.py.
I suggest you further add this at line 81:
if 'rwkv' in block_type.lower():
# initialize embedding and head
...
return
@steady ether Add Channel mix too. Your code did not use Channel Mix, and it is very different from GLU
You didn't handle properly the v_first term either.
yes
full world v2.1 again
oh, so it was 1.1T World 2.0, then 1.4T World 2.1, then 3.1T World v3? the models have seen a total of 5.6T tokens?
Exactly!
ok, will have updated the manuscript accordingly 😉
I added an intro and did a bunch of checking and edits, will try to add a background section tomorrow and edit more things that I know aren't correct yet
Also added trained models section
although i think this might be slightly weaker than 5.6T + full LR schedule 🙂 just poor man's compute saving method
sure, I'm just describing what was done accurately
though really I don't know that multiple epochs is bad
especially when it was only 1-3 epochs, spread out by trillions each time
@gusty condor are you going to train a RWKVMusic for v7?
multiepoch is fine. just that architecture upgrade & LR restart has some cost
so i think it's more like 1.1/2 + 1.4/2 + 3.1 🙂
Not really, some RWKV-7 models are trained from scratch.
well I can only know what Bo says... which models were trained from scratch that are not Pile?
0.1B is trained from scratch (likely)
I don't know if 0.4B is converted from RWKV-5
@obsidian quest which models were trained from scratch? and which were converted from v5 and v6 and which ones are from world v2 etc?
0.1B - from scratch? just world v3?
0.4B - are all the others from v6 world2.1 upgraded?
1.5B
2.9B
and were those v6 world 2.1 all from v6 world2? or from v5
0.1B: 1.0T
0.4B: ? + 2.0T
1.5B, 3B: likely v6 world v2.1 + 3.1T
all updated from previous models
0.1 from v5 world, 0.4 from v5 world 2, 1.5 2.9 from v6 world 2.1
so for 0.1B and 0.4B did you upgrade to v7, then train world v3 directly for those? so they are only 1.1T + 3.1T?
0.1B is likely world v1
he just wrote above that it was from v5 world 2...
There is no world v2 0.1B model
That's a very compelling point
RWKV-4 paper does not show the number of tokens or contents of World dataset
does anyone know this info?
I guess we can live without the contents, but the token count would be good to show
@obsidian quest can you provide the values for this chart for v7 World 3 training
(I think I have the config for Pile)
World v1 is 0.59T
We also need the data behind the loss plot
Something like this
let me know if you think this is correct
I think it's better using RWKV-7 more than Goose
World v1: 0.59
world v2: 1.12
v2.1: 1.42
v3: 3.119
good to use the same precision consistently, so I think it should be either one or two decimal places
I have updated it to use a single decimal place for now
@gusty condor let's discuss whether to use "state of the art" versus "state of the open" - what are the closed source models at these scales against which we are competing, and can we find equivalent benchmarks for them?
I'm not 100% certain on even state of the open, but I haven't seen any models that beat RWKV-7 at the 3B scale
maybe some hybrids might? we need to check this
My 4 cents:
Do we plan a section on speed/memory benchmarks like sec. 9 of the Eagle/Finch paper? I see it is currently commented in the LaTeX source.
I would also suggest we reformat Sec. 4.1.1 for clarity, because we introduce a dozen or so RWKV-specific variables and it's easy to forget the first few times around. I find myself frequently referring back to it for variable meaning and faster lookup would be great.
Similarly I would like to see intuitive explanations for some design choices throughout Sec. 4, and connections of ways in which Goose design choices can be considered similar (or different!) to other linear attention architectures, like how Eagle/Finch Sec. 4 did it, to contextualize the work in the broader linear attention landscape. (Maybe this will be covered in the background Sec. 2)
Spitballing on this last one, but I also wonder whether we can come up with any simple explanations and visuals. Technical media makes good papers, but simple media makes good blogposts, which in turn makes good (maybe even viral) publicity. I really like Figure 2 in this regard and wonder if we can expand on that.
For example Fig 1 from Attention hits about as many Google Search results as the whole query "RWKV". If we have something that is easily accessible in that regard, I think it will do wonders for RWKV's publicity. (Figure 4 is too intimidating, IMO.)
on another note is anyone testing Goose for music or audio modeling atm? if not I'd like to contribute that
thanks for the feedback, I will add the intuitive explanations into an appendix when I write the background Sec 2 because they are hard to get past reviewers without experimental substantiation
Definitely contribute some audio modeling if you can! Maybe you'd like to get the linaspeech code and rework it for v7? Or if @gusty condor doesn't have time to do it he could share the music modeling code with you to attempt that one
Good point on 4.1.1 - I did some reworking here already but more is needed. As usual, the problem is balancing total pagecount with readability. For arxiv it doesn't matter, so I could just add in a reference sheet, but I think it'd be nicer for the full paper to match somewhat
I will also include a code version in the appendix, with legible naming and comments
took some work since I'm bad at LaTeX, but does this help for 4.1.1?
https://x.com/BlinkDL_AI/status/1893123273871036670
by #1129309171137916948 message
I'll take a look at linaspeech. Mostly I wanted to replicate sec.s 10.1 and 11 from the Eagle/Finch paper to capture the generational leap but I'll see whatever I can make work
I like this more, yeah. I have some more nitpicks (e.g. using alpha in the definition of replacement boosted key before we define it two lines down no, actually, this makes sense) but I imagine this is a difficult problem to balance as you mentioned, and of course it's easier to point out problems than to fix them
other ideas include semantically spacing the lines, e.g. putting a \vspace{0.5em} between g, d, v, a and r, k, v
Keep telling me problems and I'll try to find fixes!
hm well it is really weird to me that we use two different font 'v's (well, \nu and v). For instance it is very difficult to tell them apart on my phone screen - could we do \bar{v} or something clearer for value without residual?
Good call. Was too cute using a different Greek letter that looks like v
They are really very different ... In phisics classes we wete taught $E = h \nu$, not $E = h v$.
Zhang Ruichong
$\nu$ has a sharper angle at the bottom.
Zhang Ruichong
It also has the serif on the top left. But it is for the sake of clarity: for example, if I lean back in my chair, or open Discord on a phone, I cannot distinguish them in this image.
It's just because that I'm wanting a different style from RWKV-5/6 paper. I think RWKV-7 state visualization and probing would be interesting.
That is more valuable than music modeling ( @iron parrot can do that)
it was my mistake, I shouldn't have tried to use a similar looking letter and will change it to have a tilde or hat or something instead
same issue with kappa versus k
I just wanted something that looked like a k, since its related... but maybe it's better as a completely different letter
the problem is these things go through a few steps, so for example we already needed kappa hat
same issue with alpha and a btw
I'm really not sure what I would replace them with though
blink calls kappa 'kk' in the code lol
RWKV-7 MQAR
L=512 KV=64 D=64 98.43%
L=512 KV=64 D=128 >99%
L=512 KV=64 D=256 >99%
L=512 KV=64 D=512 >99%
L=1024 KV=128 D=64 95.01%
L=1024 KV=128 D=128 >99%
L=1024 KV=128 D=256 >99%
L=1024 KV=128 D=512 >99%
L=2048 KV=256 D=64 72.93%
L=2048 KV=256 D=128 94.97%
L=2048 KV=256 D=256 98.97%
L=2048 KV=256 D=512 >99%
@misty igloo What is "relaxed replacement semantics" in the abstract?
The variation of key amount replaced between in context learning rate and 1.0
We can rephrase if you like
The inits did make a noticeable difference!
@steady ether How many learning rates are tested?
I use $$ LR = \frac {(1.0, 2.0, 4.0)}{\sqrt{\mathrm{d_model}} \cdot \mathrm{sequence_length}} $$
Is our batch size aligned?
Zhang Ruichong
does your inner monologue just run in latex?
See https://arxiv.org/abs/2407.05872 (and some others) why I use such a formula
arXiv.org
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parame...
After testing d_model = 64 at sequence length 1024, I transferred the LR across all runs
The batch size looks correct. Everything else is just zoology defaults from page 67 (https://arxiv.org/pdf/2312.04927). Let's go with the run you posted.
Yes, you modified my original formula and made it very complex 🙂
I moved the LayerNorm (GroupNorm) to the prior section, where it belongs because it is per head (see eq 7)
I think this is much simpler than the new formula you added
This way we can keep everything per head in 4.1.2 and full vectors of size D in 4.1.3
4.1.3 is not really dimension D
This is summed per-head:
$$ \langle r_{t, j} \mathrm{diag}(\rho_j), k_{t, j} \rangle $$
Zhang Ruichong
I don't think adding tons of subscripts is making the paper better
it's just harder to follow
when the reality is that its a bunch of hadamard products and an addition
oh sorry
yes I made a mistake here
and I got busy and forgot to correct it
give me a few minutes to look over it - I'll put it back if I don't find a better solution
but it would be nice if 4.1.3 was less complex looking
More subscripts in this paper https://arxiv.org/pdf/2406.06484
I love songlin's papers but they are very challenging for non-mathematicians to read through
my original formula that was in the paper before it got changed was more like this
putting the heads together doesn't really happen until the very last step, right before multiplying with W_o
It's 2025 and people are getting used to these
but the actual situation is that everything is per head until W_o... everything except tokenshift can be considered and written per head
so there is simply no need for head subscripts in any part of the paper
I'm trying to keep things simple here so that it can be quickly and easily understood
per head:
r (u \odot k)^T
would be that same sum, right?
its an inner product
Why did you add it? It's not part of RWKV-7
I know it's part of some proofs
so I left it in the proofs as an extension
This is naturally extendible and exists in some of @iron parrot 's experiments.
Should be considered a hyperparameter
I don't think it should be in the official formulas if it's not in any RWKV-7 code that has ever publicly existed
I agree that it should be listed as an extension in the paper though
when people read the paper and see the code they shouldn't be surprised that the code does not have parts that are in the paper
if you still think it should be in the main formulas, we can just ask Blink if he wants it to be a part of the official RWKV-7 definition or listed as an extension
C being set to 1 is more like a compromise
Originally it was 2
That caused NaNs in rc2
you don't think it will be a problem for readers that the code literally doesn't have this in it?
I think it's pretty bad when I read a paper and the code does not conform to it
extremely confusing to the reader when that happens
we can add this in the code, instead of removing this from the paper
okay, let's just ask Blink if he wants to do that
I'm definitely fine with having it in the main formulas if it's in the code (in that case it would actually have to be in the formulas!)
@obsidian quest what do you think? should we put this additional c parameter (it would be 1.0 in all existing models) into the codebases?
I ran some loss tests on PG19 with different models. Surprisingly, the loss doesn't seem to get better with longer context lengths, even with newer models
Looks like this is something specific to this dataset
interesting - did you try it on any other non-RWKV model? maybe it never gets better over ctxlen for other models, either
That's what I suspect. I'm currently testing RWKV on the Proof Pile dataset, and the loss goes down as context length increases
yes, I suppose PG19 was a poor choice 😭 - I ran that experiment at the very last minute for the Eagle/Finch paper, I think at Blink's request but I forget if Blink asked that I use that dataset or I chose it. Good to know that it's probably not ideal!
@gusty condor I made a correction to your bonus formula... maybe this is a mistake in the model, but the code uses \tilde{k} for it
There is also a mistake where it does not apply the gate before the output matrix. I have corrected that as well but it does not fit nicely with your head indexing
I'm about to run NIAH tests. Which NIAH variant should we use, or should we go with RULER instead?
just simple passkey in garbage - at least rwkv-7 solved
@brisk bronze ran this NIAH style passkey in garbage test and has results - she has an updated version of @iron parrot 's mamba repo that uses exact token counts instead of an approximation based on the average tokenizer bytes per token, as well as some other updates
I asked her to add these to the paper a couple days ago but I guess she hasn't gotten around to it yet
updated to "a relaxed value replacement rule"
Sounds good, let's go with her version then
long-context loss tests on the Proof Pile dataset
We also finetuned a version for longer context by training it on 128k data and it increased its NIAH scores. Let me try to get you that in case you'd like to try it on proof pile
Would love to test it, as it seems the current world models are a bit 'overfitted' to 4k context lengths.
this is the 1.5B model that we extended - https://huggingface.co/m8than/rwkv7-1b5-128k
I am training G1 0.1/0.4/1.5/2.9B ("Goose One" 🪿) simultaneously on world-3.5 (5.16T tokens), continuing from previous RWKV-7 "Goose" world-3 checkpts. Release soon🙂even L12-D768 can reason.
Is the data listed in the rwkv news channel everything you used? We can include in the paper
will provide latest list soon
I suggest we open a separate new github link for all our experiments:
- RWKV-7 training code (should only include RWKV-7)
- MQAR testing
- lm-eval code
- state visualization
Does the new RWKV have a working HF implementation yet?
yeah will add this today, sorry for the delay!
Yes in fla-hub there are models
I have older ones too, and we have an upcoming simplified Rwkv-Blocks repo that implements it too, but I recommend the fla-hub versions at this point
There is a slight performance degradation
In inference or training?
Inference mainly
can we put in non-FLA pure pytorch inference code to fix the problem, like in my original repo:
https://huggingface.co/SmerkyG/RWKV7-Goose-0.4B-Pile-HF/blob/02778effb99287d220d5d9494af4acf2af686296/modeling_rwkv7.py#L358
I created it.
But I suspect the main problem lies in the logit head and outputs, which increases ppl a littie
hmm what aspect of those do you suspect increases it? just curious so I don't make that mistake in the future
Haven't inspected it yet
oh, one thing that used to matter in v6 was doing the normalization in fp32
I spent about 24 hours tracking that one down at one point
like you have to call torch.nn.functional.layer_norm or group_norm with the weights upcasted to float potentially (even though they are stored as bf16) so that the calculation is done in float precision
that could easily be the issue
it's not currently being done in o = self.g_norm(rearrange(o, '... h d -> ... (h d)'))
see this line in ChatRWKV for proof
https://github.com/BlinkDL/ChatRWKV/blob/626367863cf5860268c2fda81a5d43d423a69ebf/rwkv_pip_package/src/rwkv/model.py#L657
GLUE looks like problematic
There are several subtasks in math and CS of MMLU that RWKV-7 lags behind Qwen by over 20%
qwen2.5 has much higher MMLU comparing with llama3.2 too
they have lots of synthetic data
I think there is some slight data leakage. Qwen-2.5's mmlu can be deducted by around 6%.
what else should be included? I think RWKV-6 2.1, any other models?
This was done using the FLA hf so maybe the fix will help a bit. Glue was having some issues so I need to double check the methodology there. @brisk bronze where did we end up on that, did you manually average the accuracy based entries?
Glue results for 2.9B is 55.19 using FLA implementation. The stats in the picture are not based on fla
No. It was I who tested with RWKV-7 pip.
oh ok, janna was running these evals yesterday so I didn't know who put them in
It might be related to pad tokens
{
"model": "/home/zhangping/zrc/RWKV-x070-World-2.9B-v3-20250211-ctx4096",
"tasks": [
"glue"
],
"num_fewshot": 0,
"results": {
"glue": {
"f1,none": 0.6928835730390357,
"f1_stderr,none": 0.0036436664119938256,
"acc,none": 0.6684581943782754,
"acc_stderr,none": 0.0016703858434417523,
"mcc,none": 0.05185503773957725,
"mcc_stderr,none": 0.032600944586408685,
"alias": "glue"
},
"cola": {
"mcc,none": 0.05185503773957725,
"mcc_stderr,none": 0.032600944586408685,
"alias": " - cola"
},
"mnli": {
"acc,none": 0.39449821701477333,
"acc_stderr,none": 0.004933523584717906,
"alias": " - mnli"
},
"mnli_mismatch": {
"acc,none": 0.4044955248169243,
"acc_stderr,none": 0.004949946753591583,
"alias": " - mnli_mismatch"
},
"mrpc": {
"acc,none": 0.7794117647058824,
"acc_stderr,none": 0.020553105287596057,
"f1,none": 0.8534201954397395,
"f1_stderr,none": 0.016157946331836814,
"alias": " - mrpc"
},
"qnli": {
"acc,none": 0.5678198791872597,
"acc_stderr,none": 0.006702886134456929,
"alias": " - qnli"
},
"qqp": {
"acc,none": 0.8064803363838734,
"acc_stderr,none": 0.0019647755361788884,
"f1,none": 0.6912635151132507,
"f1_stderr,none": 0.003676786809851292,
"alias": " - qqp"
},
"rte": {
"acc,none": 0.7472924187725631,
"acc_stderr,none": 0.026157719758464693,
"alias": " - rte"
},
"sst2": {
"acc,none": 0.893348623853211,
"acc_stderr,none": 0.010458867008246837,
"alias": " - sst2"
},
"wnli": {
"acc,none": 0.5352112676056338,
"acc_stderr,none": 0.0596130578497224,
"alias": " - wnli"
}
}
}
These are my tests, definitely better
we are using FLA HF for NIAH etc. too so we really need that repo to work properly
yeah yours does better compared to fla, looks like cola was the most different. Everything else is a percentage point or so
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results/fla-hub__rwkv7-2.9B-world/results_2025-02-24T02-53-42.327679.json
What is NIAH?
needle in a haystack
and qnli, qqp too
Reference:
https://github.com/howard-hou/VisualRWKV/blob/main/VisualRWKV-v6/v6.0/eval/run_lm_eval.py
(change to RWKV_PAD = [0] to align with rwkv fla)
This thing of using RWKV_PAD before the text is pretty weird, and has been discussed as problematic before.
If we are going to do that for evals we should simply change the model to have it be in the starting state.
Where in the FLA HF code does it put [0] in the starting state?
bos_token = eos_token = pad_token = 0 = '<|rwkv_tokenizer_end_of_text|>'
then automatically handled by lm_eval
yeah that's fine, as long as we tell that to lm-eval
doesn't it have a 'add_bos_token' option
so we would need to set that when running like pretrained=MODEL,add_bos_token=True
I don't think it will add it automatically without that commandline setting
lets get the bugs fixed in the fla hf implementation so we can be using it for evals, and so that others who use it will not get bad results
setting add_bos_token=TRUE didn't change the score for lmbda.o for fla rwkv7 1.5B ftr
True is not capitalized in python, I dunno if that matters
I mean the first letter is but not the rest
"True"
Also lambada is typically less sensitive to this somehow
I find the bos token impacts different evals differently with the average ending up not really different
The biggest issue is most likely the group norm bugfix
yeah, I don't think the True mattered
add a column: trained tokens 🙂
and show number of activated parameters (then rwkv params will be less)
I think jellyfish already added this in section 6.2?
@obsidian quest to follow up on this, do you want to show the 'c' constant in the main RWKV-7 formulas? or just in the proofs
got paper link?
actually i dont know what is c
yeah, it's not in the code at all for rc4a but it's a generalization that maybe appeared in some earlier versions
that's why I wanted to ask your opinion
remove "c" then (and change figure 2 too)
I thought so too - @gusty condor can make the argument for it to you if he still thinks it's important
we should mention that expanding eigenvalue is useful for https://github.com/Jellyfish042/RWKV_Othello
and that is a slightly different formula
I think that's what the 'c' variable is for - to allow expanded eigenvalue range
extending eigenvalue:```orig:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 )
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*a)
new:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*(a.float()*torch.exp(-torch.exp(w.float()))).to(dtype=torch.bfloat16))
or (try both)
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kktorch.exp(-torch.exp(w.float())).to(dtype=torch.bfloat16), kka)
it's a bit different
yeah, it wasn't my addition 🙂 I just don't like it when the formulas dont match the code because it confuses readers
I believe semantically the constant c matters. It's just a compromise for training stability that we changed it to 1. Adding that c can better explain the motivation of RWKV-7.
I can't reproduce Qwen's results on arc-c, winogrande and hellaswag on https://arxiv.org/pdf/2412.15115 .
but i am not using it
however this is useful
The proof of RWKV-7's NC1 needs c>1
you can use this @gusty condor
OK, will add a subsection in the appendix to discuss that
we could still keep the c variable in the proof section, and introduce it as an extension that we keep it as 1 in the main model - that was the compromise I struck in my earlier edit
I added an initial draft of a background section just now.
There are too much in "others." How much instruction and Chinese novels are there?
@obsidian quest could you elaborate for "others"?
world-3.0
science+wiki 222.7
math 32.3
law&gov 19.0
fiction 192.6
poetry+lyric 1.7
chat+qa+instruction 110.0
code 258.4
web 1945.2
total 3119.2
Those numbers don't match the table currently, PSA
ok could someone please combine v2 + v2.1 + v3 items and arrange them to approximately match this list and i will fix on top of it because there are so many components
I could attempt doing it.
https://colab.research.google.com/drive/1Ic9RT-VzqEbdff350xPlXtJufBZJjHOK?usp=sharing
https://docs.google.com/spreadsheets/d/1HnwASXkgL6N3mLJQ5-8nkqJbs-yhJhKNFYJw6gpHoSs/edit
If you need it the refined and checked v2.1 and v3 info is in the current paper draft (I went and got all the urls, cleaned up the names, etc.)
Where is the current paper draft?
I was looking at the Transition Matrix Stability Proof in the paper. Normally a contraction matrix is defined as having norm less than 1, not eigenvalues in (-1,1). It's misleading to say it's a contraction, since being a contraction would imply that the state cannot blow up. However, the state can blow up.
Additionally, I wrote a ~100 line standalone implementation of RWKV-7 inference in numpy (to avoid hiding things in torch functions). It's verified numerically against the pip rwkv package.
Sorry, I was mistaken.
Actually, because it's similar to a symmetric matrix, if we fix a, it is indeed a contraction, since spectral norm is equal to the eigenvalue with largest absolute value for symmetric matrices.
It's just because that a isn't really fixed, it is a_t. Since I removed the subscripts in the problem statement, I just forgot that dynamic dependence😭
I'm not sure who is adding the chat examples, but we should discuss this before you do. Using the "base Gradio 7B model" (whatever that is) is not appropriate for a paper that does not include any 7B model.
I'm also not sure we want to show chat examples in this paper.
Agreed. I think we are playing with the big boys now and are way past needing to show that it can talk nicely.
I added a draft appendix section on design decisions, walking people through how it all works and why.
Now it can even rival Qwen-2.5 as a base model (not a chat model)
@misty igloo @gusty condor I was going to read through the paper and do a suggestions / editing pass today. Is there anything in particular you'd like me to focus on?
@keen tartan
pls add lora dimensions suggestions as in RWKV-LM
more suggestions: wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18
add RWKV-4 1.5b to Compression rate% eval
v7 0.1/1.5/2.9/0.4 loss curves
0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-4
1.5 - bsz 480 lr_init 4e-4 => bsz 672 lr_init 4.5e-4 => bsz 1152 lr_init 6.1e-4
2.9 - bsz 640 lr_init 4e-4 => bsz 1008 lr_init 5e-4 => bsz 1120 lr_init 5.4e-4 => bsz 2016 lr_init 8e-4
all - wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18 // lr_final 1e-5
done
let's try v7 for sudoku too @iron parrot
Can you add final loss too? And what exactly is the learning rate curve?
TODO:
5. Add limitations and acknowledgements
green = LR curve
I mean, is there a formula for it?
this. cosine decay. i change LR & BSZ when number of compute nodes changes.
how could contribute to the manuscript (from fla groups 
@obsidian quest It looks like I don't have permissions to view the overleaf
Can you add me?
i am not its owner 😂
Who is?
@tropic minnow is the owner, but that URL will allow you access
Does Overleaf have a dark/night mode? It is so bright! o.O
Updated the world datasets itemized lists:
https://colab.research.google.com/drive/1Ic9RT-VzqEbdff350xPlXtJufBZJjHOK#scrollTo=udl8RkeeM-yE
https://docs.google.com/spreadsheets/d/1HnwASXkgL6N3mLJQ5-8nkqJbs-yhJhKNFYJw6gpHoSs/edit?gid=1049532087#gid=1049532087
Was able to figure out citations for most of them.
I noticed that the DeepMind Mathematics (dm_math) dataset is part of The Pile 1 and was already included in world-v2 I assume, but it seems to be not mentioned in the Eagle & Finch paper. Where should it be placed?
mention we added it but missed mentioning
I see. I try to incorperate it somehow.
Should it perhaps be added as an Errata to the Eagle & Finch paper too?
v7 0.4b = v5 0.4b + subsampled 2T tokens from world-3```
That would be great! I don't have a particular area in mind - I did a lot of the abstract/intro/background/description writing in just the past few days, so they are all essentially early drafts. I'm very open to any kind of perspective you can lend on general flow, narrative, and what should be emphasized.
Also, a lot of the evals are still preliminary or missing. We are working on some discrepancy issues we've found to ensure everything is really solid.
Do you need any additional hands on paper writing/editing?
Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is doing it, may I take on this part of the test?
Would it make sense to mention the D512 and D576 variants of the 0.1B model?
I'm not sure we need this specific benchmark for the paper, but you're welcome to provide it if you like. It will likely end up in the Appendix if so. If you do decide to add it, you will need updated benchmark results for modern models such as Qwen2.5 3B and Llama3.2 3B and 1.5B, as well as any other top tier models in those sizes.
It is mostly there simply because I copied over the results from benchmarks in the Eagle/Finch paper.
Also, if you decide to run it you might consider using our extended context finetunes of 1.5B and 2.9B.
I think this complicates the paper unnecessarily. Did Blink ever even release these?
I recall that they are narrower but deeper.
I think it's @brisk bronze who is doing these benchmarks
Yes, released. See https://huggingface.co/BlinkDL/temp-latest-training-models/tree/main
which files? I don't see
What's your opinion on listing them? I think it may complicate the paper unnecessarily
We could add a section on depth versus width ablations, but I'm also not sure that this is really a RWKV specific result
Oh here it is! https://huggingface.co/BlinkDL/rwkv-7-pile/tree/main
Is there previous research about this?
I wasn't running bamboo.. not sure who is doing it
I think bamboo will be pretty low signal for base models
Thank you. I'll first see how the existing methods work.
They mostly have instruct benchmarks in their paper and the tasks are structured in a way that base models will do poorly
the problem is that it's very niche - we don't have comparison models of other architectures with these changes
so we could show it in its own separate ablations section I suppose, but it wont be relative to other architectures
sounds like Baber thinks this specific benchmark won't be valuable on base models, so let's skip it (he's in a good position to know, since he works on lm eval harness!)
Skip it, ok
I tested gsm8k and found that it's very sensitive on response format, so I decide to skip it
It’s quite apparent here
But maybe the instruct models have context extension idk
Yeah we also have a context extended versions that we just trained, but we already will show NIAH for that
@brisk bronze is doing those, with her fork of what was originally jellyfish's revision of the mamba test 🤣
Yeah. I think niah single needle and maybe one other. Multi key/query depending on the framing
it's relevant for other models too because smollm is deep+narrow as well
smollm 1.7B is 2048 x 24
smollm 135M is 576 x 30
yeah sorry typo
there's no really good comparison point because we don't have a 'normal' depth smollm2
but we can show them all side by side
just cant really draw much of a conclusion
also SmolLM is not trained on pile
@obsidian quest what kind of comparison with SmolLM were you thinking we would show?
State visualization for v6. Working on v5 and v7
Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, normalized the same way as the cryscan webgpu state visualization demo
Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, colored the same way as the state visuals above
No, this is too large for the paper
Interestingly, the stable rank of the WKV matrix in RWKV-7 has shown to be lower than that
of RWKV-5 and RWKV-6.
this is strange. if you check state visualization, rwkv7 states look much more "random" while rwkv6 states are more like checkboards (rank 1)
You can rerun those experiments. Actually in some layers of RWKV-7, the state is very concentrated.
Have you started using Muon optimizer or some other related new optimizer?
This may be relevant:
https://docs.modula.systems/examples/weight-erasure/
https://x.com/jxbz/status/1845146681274478856
https://x.com/ssnl_tz/status/1845179813755224406
got code?
An ultra-fast and efficient AI runs directly in your browser.
We tested Muon, but Muon may not be efficient for RWKV's LoRA gates.
DMed you
@misty igloo I got the formula for parameters correct:
$$ #(\mathrm{Params}) = 2DV + 4D + LD \left(12D + 2\left(d_w + d_a +d_v +d_g \right) + 19 \right) - (2Dd_v + D) $$
Zhang Ruichong
Please double-check Appendix E. I'm finishing in a few hours!
please check this @uneven blade 🙂
looks correct, I had missed v0 earlier - thanks for updating that
@obsidian quest did you really use adam_eps=1e-18 for all of the the entire runs?
yes
ok probably sometimes it NaN in 1 step because of this 😂 maybe 1e-16 will avoid this
lol - did that happen? if so how did you fix it?
i just rewind a bit with cleared optimizer states
also, what are those learning rates shown that deviate from the schedule? and the schedule doesn't look like cosine, what was it?
this. i change bsz because of hardware constraint (number of nodes)
but.. it still doesn't look like a cosine 🙂
different cosines patched together
it looks like a time stretched cosine
do you have a formula?
visually it looks something like cos(t**2)
=(1e-4)*(0.01+0.495*(1+COS(x*PI))) is this cosine decay
oh it's because i am using log axis for y
I added a reference for that
@gusty condor I think we should mention that we increase the number of compute nodes as training progresses
@obsidian quest how many nodes and what kind of GPU was used total?
Yes, let's make it an advantage
This approach not only enhances training efficiency but also utilizes GPU resources economically. After smaller models complete their training, additional GPU resources become available for the later stages of training larger models. This cascading resource allocation ensures that computational power is dynamically reallocated, maximizing hardware utilization and reducing idle time.
Great work! This section is really looking good.
Should we provide the FLOPs counts? I know it has been useful for people in the past, including Quentin
And it can be helpful if we want to put in a table comparing total trained FLOPs vs quality, like we had in the Eagle/Finch paper
I think that will most clearly show the pareto improvement of RWKV7 over these other heavily trained models
We can make it short and simple instead of the longwinded version that we had earlier.
I think the simple formula 6 * model size * training tokens suffices
@misty igloo I think there is a paper named "regular languages in nc1" and you can cite that. (assuming that Wu Tianyi's proof is good)
@bronze frost and I have been discussing the proof at length
we think it needs some revision, but there may be something we can claim that exceeds the abilities of transformers
it also may be able to be simplified quite a bit
Yes I agree
for example, @bronze frost has a very simple construction for showing that you can create true transpositions (row-pair permutation matrices) with RWKV-7
I think NC1 can be achieved by just householder matrices (I'm not an expert in complexity theory)
unfortunately, we think true full permuation matrix requires multiple tokens
I agree with that
but within a single token, having a two-row permutation should exceed transformers abilities
so for example, we should be able to solve swaps on S5 using only incoming (prefill) tokens, which afaict transformers cannot do
supposedly this makes it so that we can correctly claim being NC1 complete under reduction by AC0
I'm not even a novice in this stuff, let alone an expert tho 🙂
Yes, @iron parrot tested on the parity experiment, RWKV-7 can grok while transformers can't
@young sparrow do you have a complexity theorist you could recommend to help us with this aspect of the paper? I'm muddling through but we really need someone who can easily cut through it all and validate our claims and (not yet rewritten) proofs
btw we found that with a slightly larger allowable range on decay, the two-row swap permutation matrix would be possible to achieve even with c=1
was jellyfish's parity experiment done with c=2?
Yes, c=2
(due to normalization, you can use decay instead of c to achieve similar things)
@misty igloo Will Merrill is the expert on this topic. Let me ping him and set up an introduction.
thanks, that would be great!
Yes I've been looking through his papers lately
I'm a little worried that I'm at too low a level of comprehension of this stuff to be the right one to discuss with him, but I do have a well informed broad view of what we're trying to achieve somewhat generally, and the mechanisms involved
COLM 2025
OpenReview submission site opens: February 27, 2025
Abstract deadline: March 20, 2025
Full paper submission deadline: March 27, 2025
Rebuttal period: May 27 to June 10, 2025
Decision notifications: July 7, 2025
Conference dates: October 7-10, 2025
nor, are you somewhat well versed in complexity enough to help us with the paper? it sounds like you might be
If so, let's chat
Have you asked Riccardo by any chance? I think very few people might be versed in this
I don't know him, but he's in the FLA discord right?
yes
final loss and # nodes
what GPU? H800?
I found this set of synthetic tasks which seems relevant (https://arxiv.org/pdf/2403.17844). I ran a few of them and v7 is performing quite well. Here's an early plot (only ~10% complete but looking promising)
Also, someone should try the scaling experiments too but that looks like it will cost $$$$
Not sure about settings if anyone can check:
https://github.com/guangyusong/mad-lab/commit/7daaf1f0b143ea21a07f7aa042d7736d114459b1
yes
That compression accuracy lmao
Need to tag ffmpeg guy
It's only like 10% done so we haven't gotten to the hard stuff yet
Compression here being a repeat after me?
It's encoding a sequence into a token and then decoding it
You can use transferable LR (like what I did) to save time.
Like state tune overfitting but for a single embedding?
Sentence Autoencoder?
Thanks! I think we're good for right now since things are probably going to move around a bunch still in order to cram everything into the usual 9 page limit. But once that happens we might need some help massaging it all together so that it flows well. I'll reach out if and when we do!
And feedback is always welcome!
The writing process has been fairly organized so far, which has been great. I'd like to keep it that way and have editing proceed in an organized manner, with people mostly just adding sections, or working together directly on a section. We may need a wider edit beyond that and what I can provide at some point in the near future, I just want to avoid the 'the whole paper gets rewritten every day' thing that happened towards the end last year.
That's a good way to put it.
Right, but with random tokens
I think the arxiv v1 version can be uploaded in a few hours.
imo figures 3, 8, 9, 13, and 16 should be remade with the same theme and bigger font for clarity and cohesion at some point
They are from 5 different people
If they could share the data, I or someone else can remake the graphs using the same theme
currently, it seems like there is a mix of excel and python-generated plots...
Figure 16? all right...
MAD tasks for v7 finished, this looks more reasonable now. Still pretty good.
Why are you in such a rush to upload it today? I don't think it's ready.
And I'm not comfortable yet with the exact claims we can make for the complexity class, yet that should be something we claim in the abstract.
I'm also not sure we have properly shown SOTA that we claim. I am working on a FLOPS chart, which will likely show that we have a new pareto frontier here, which would be a desirable claim.
Some other notes: I think you need to show Mamba-2 Pile in section K. It's not fair to compare it to the older model only.
I also don't think the paper will realistically be ready for arxiv today, but we should get it ready as soon as possible
We need everyone to fill in the author contributions section, it's barely started.
@here Yes, in the spirit of getting ready as soon as possible: If you made significant contributions to the paper and want to be listed as an author, please list your name and affiliations at the top in the authors section and begin putting in the details of your contributions into Appendix A: Author Contributions. I don't think we currently left anyone out of the list of authors, but definitely let me know if we did. Please also let me know your email address.
yeah need a few more days
i think our avg will be the best 🙂
It is!
I have an AudioRWKV experiment in the oven but I doubt it'll be ready for v1, still trying to figure out Goose state tuning. In the meantime, it looks like sec 2 is still a draft, so I might work on that - do I need to ask to make changes?
I was about to go through the whole thing and edit and move stuff as needed, DM me and we can figure out how to collaborate on it!
let's call it a generalized FWP (fast weight programmer) RNN to respect Schmidhuber 😂
RWKV-7 is a generalized version because i am using deformed keys etc.
@brisk bronze please use lm-eval 0.4.3 and fp32 to evaluate mamba2
lets sort rows by avg
Would recommend adding gated deltanet here to show the advantage of the vector lr and decay
Done. Will also experiment with inits, this one was just a 'naive' run so we have room for improvement.
Good point. Let's see if they have results we can borrow. All depends on how much time we have
@brisk bronze You didn't test https://huggingface.co/state-spaces/mamba2-370m
yeah pls use my init (such as https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py line 847-850, 966-967 and you should see much faster convergence)
compress score looks wrong
or share your code so i can check
I think it's more like wrong LR / Adam epsilon
@obsidian quest Which ZeRO stage is RWKV-7 trained on?
Is RWKV trained without pipeline parallelism?
zero2
@fresh mulch and I are currently doing ablations against all changes from gated deltanet
they just arent in the manuscript yet
these are the differences we're ablating - please let us know if there are others you think are important to show:
- making the gating (decay) w vector-valued instead of scalar
- making the removal kk and replacement k keys different from one another
- making the in-context learning rate a a vector instead of scalar
- adding bonus (last part of code)
@fresh mulch I know we discussed this but plz make sure you're following blinks recommendation here
Oh nice
Would say the parametrization of the decay
Gated deltanet uses the mamba way iirc so can compare with the rwkv with bias
Do you mean something different than reducing the decay to scalar per head? That's what I meant we are ablating in the first bullet point
Ya like the calculation of the decay, mamba uses a specific init and multiplication style which gated deltanet use (songlin mentioned this was pretty important)
@quaint quiver are you referring to training a gated deltanet for table 7, or using the gated deltanet init for our rwkv7 ablations, or something else?
Mainly for table 7 as apparently it was important although could also be done as an ablation
added!
more models in https://huggingface.co/spaces/Jellyfish042/UncheatableEval
Changing inits did improve to 46. Here's the code:
got loss curve comparison? 🙂
could you explain this 🙂
@fringe egret are you waiting on results for reportsumsort/showssort bamboo benchmarks or did every model legitimately just score a 0 on them?
pls mention contents in this https://www.rwkv.com/images/RWKV-7.png
and this https://x.com/BlinkDL_AI/status/1861796264620572859/photo/1
feel free to change my text
Giving two sequences of vectors $\{k_t\}$ and $\{v_t\}$, RWKV-7 will test-time-train an internal model $v \approx k S^\top$ via in-context gradient descent w.r.t the L2 loss $\mathcal{L} = \frac{1}{2}\Vert\, v - k S^\top\Vert^2$.
The gradient is:
\[\frac{\partial \mathcal{L}}{\partial S} = S k^\top k - v^\top k\]
The gradient descent formula (with dynamic weight decay $w_t$ and learning rate $\eta_t$) is:
\[S_t = S_{t-1} \operatorname{diag}(w_t) - (S_{t-1} k_t^\top k_t - v_t^\top k_t)\operatorname{diag}(\eta_t)\]
which equals:
\[S_t = S_{t-1} \left(\operatorname{diag}(w_t) - k_t^\top k_t\operatorname{diag}(\eta_t)\right) + v_t^\top k_t\operatorname{diag}(\eta_t)\]
In RWKV-7 I use the generalized formula:
\[S_{t} = S_{t-1} (\operatorname{diag}(w_t) + \textbf{a}_t^\top \textbf{b}_t) + \textbf{v}_t^\top \textbf{k}_t\]
where a reasonable choice of initial values is $\textbf{a} = -k$, $\textbf{b} = k\cdot\eta$, $\textbf{v} = v$, $\textbf{k} = k \cdot \eta$.
(update: basically diagonal + rank1 because it's good for parallelization. we can do rankn by adding more terms but it will be slower)
\textbf{RWKV-7 uses $\{k_t, v_t\}$ to test-time-train an internal model and uses $\{r_t\}$ as input for this model.} It overcomes the $\mathsf{TC^0}$ limitation of QKV-softmax-attention transformers (and RWKV-6, Mamba, Mamba-2, xLSTM, GLA, ...), while still being efficiently trainable on GPUs.
Such ideas can be traced back to fast weights (1991) by Jürgen Schmidhuber, delta rule (1959) by Bernard Widrow, hebbian learning (1949) by Donald Hebb. RWKV-7 is a generalized scalable version with more tricks to make it actually great at LLM. Details are in my open-source code.
Because the internal model is $v \approx k S^\top$, the output for input $r$ is $r S^\top$, and the pseudocode is:
\begin{lstlisting}
for t in range(T):
sab = torch.einsum("ik,k,j->ij", state, a[t], b[t])
state = state * w[t] + sab + torch.einsum("j,i->ij", k[t], v[t])
out[t] = torch.einsum("j,ij->i", r[t], state)
\end{lstlisting}
\vspace{-8pt}```
this should go into section 3
use log scale Y-axis for (a) RMS of RWKV state entries @gusty condor
@fringe egret we were very clear that we don't want bamboo featured in the paper - I'm sorry if that was somehow not communicated properly, but we had a whole public discussion of it here in this channel after you asked andl before you added it and I removed the old one from the paper already
Table 3 4 9 17 19, put stronger models on top
oh why
Baber thinks it's not well suited for base models
From this discussion
however our bamboo results are good so we can show them 🙂
was mostly going by this
yes that is what was in the eagle/finch paper
good compared to what? the current results don't even include relevant recent models like mamba 2 or Qwen2.5 or Llama 3.2
anyway it really doesn't matter to me whether or not we include bamboo, but if we show bamboo it has to include those models.
and its weird that it got added again with no further interaction after the previous discussion about it ended with this comment
lets test their base models
okay, @fringe egret please go ahead and test those base models if you'd like to include this benchmark in the paper
Okay, I'll try testing it on other base models.
figure 3 15 16, please test pg19 (not proofpile) @gusty condor
proofpile likely has bad information density. and test rwkv WORLD models
Yes, I tested World models.
Figure 3 from @iron parrot
ok please test pg19 for figre 15 16
cant find init code
I just made a fix
oh why 0.8+ but you mentioned 46%
@obsidian quest @gusty condor this was what happened when @iron parrot ran it on PG19
We group results by config parameters, find the best accuracy run, and average the best-per-config accuracies.
This was jellyfish's PG19 result - we could use that in figure 3 if you prefer it over proofpile...
Can you elaborate?
it's very strange because loss 1024-2047 should definitely be lower than 0-1023
probably code bug
please test other models too
In the last paper I skipped the first 2048 of PG19 I think because it was all weird formatting stuff a lot
ok then we should pick the "middle 16384 token" for each data item
if the length = X, pick token X*0.5-8192 to X*0.5+8192
theyre books so the beginning is often some fairly standard preamble, contents etc
yeah then we should do this. and only for data items with length > 32k
How I understood it is:
For each training run config group excluding learning rate and weight decay, we find the highest scoring one, and we take a simple average of them.
There's also some info on page 7 and 11: https://arxiv.org/pdf/2403.17844
Hey, I talked with Ruichong, probably you could use some of my help for long context benchmark?
Oh, thank you so much! I'd appreciate your help with the benchmark. We can discuss the details when you are free.
I'll try this first, then see how Mamba performs. I think there's something off about the data distribution in the PG19 dataset.
RWKV-7 world seems to be a bit 'overfitted' to 4k ctx, when exceed 4k tokens, the ppl increase.
@iron parrot its due to this reason
Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 tokens
After 16k, RWKV-7's ppl increases
That's the 'overfitting to 4k' phenomenon I mentioned before
Don't labels these with the range, that makes it very hard to read.
If you're going to report average loss, you should not use a line plot. A histogram would be more appropriate
Okay, I will change it to a histogram
pls verify Mamba2-1.3B glue 46.1 in table 17. check its individual components vs rwkv7
looks reasonable
still much better than mamba & transformers
just show this range, up to 16k
@brisk bronze
@iron parrot we have our context-length extended version of 2.9B available at https://huggingface.co/SmerkyG/RWKV7-2.9B-World3-128k-250225 - seems like you might want to try that one too for PG19
we found that it was quite a bit better than the base model for NIAH as context length grew, so hopefully it should be for PG19 loss as well
This is really bad. You can't deliberately crop the graph at the point performance starts to degrade. I don't see any reason to do this other than to mislead the reader.
ok 🙂 then we should compare with other models
highlight all RWKV model names in Table 5
pls search for my ID here to see all suggestions
So far, my test results on the proof pile and PG19 show:
All pile models (v4 v5 v6 v7) show decreasing loss as sequence length increases.
For world models (trained on way more tokens), the behavior varies: v4's loss explodes with longer sequences, v5 and v6's loss decreases then stabilizes, while v7's loss slightly increases after about 16k tokens (long-context fine-tuning can fix this).
This is why I call it some kind of "overfitting", more training actually hurts generalization. The severity ranking is: v4 > v7 > v6 = v5.
I think we should show separate loss charts for pile and world models, discuss this issue, and include comparisons with other models.
@obsidian quest@young sparrow@misty igloo
ok we can show these. still much better than transformers
Here are the PG19 test results, the fine-tuned version performs much better after 16k tokens
Maybe it's time for us to shrink the state size once V7 has better state utilization.
https://arxiv.org/pdf/2410.07145 Based on some of this paper's findings
when you do ctxlen extension, use LONG data, for much better results
RWKV-7 World 0.1B and 0.4B LM Evaluation Harness Benchmarks
English focus
Multilang focus
MMLU 0-shot or 5-shot?
0-shot.
I tested 5-shot (in order to match Qwen's performance in the technical report)
I see. Yes, Qwen models are strong on MMLU.
I check.
Usually 0-shot is 1-2% worse than 5-shot
That's a great result! Glad the fine-tuned version helped!
@hushed orchid just bringing this to your attention from Blink for reference on future ctxlen extension attempts
I agree, let's show both and discuss the difference. Interesting that the World models become overfit on a specific length.
Transformers, too, (including Qwen) are typically post-trained to increase context length so while this isn't exactly a win for us it's interestingly comparable.
a bit noisy since pg19 test set only has 100 samples
MMLU has been recomputed with 5-shot.
I think that very few people would be interested in eval results of such small models. But you can add then in the paper.
actually we may need these for @brisk bronze and my upcoming FLOPS vs acc plot
she was running them too but I think she had some tech issues
Should figure 4,5,6,7 be unified into a large figure?
Also please include v6 results too
@gusty condor the pawsx number for 2.9B looks incorrect to me... could you check that it was entered correctly?
By the way, I think this two pixels (as seen in other states) are used to pin the GroupNorm, preventing it from drifting. Now that v7 has O(1) state size, may we remove that GroupNorm?
Or, I think we can use GroupRMSNorm for that. Therefore we need one value to pin an RMSNorm, instead of two values to pin a LayerNorm.
I checked pawsx, and it looked correct.
okay, weird that it got so low?
like the bigger model and more training made it a lot worse than it was previously
There are some inverse scaling problems
I was going to recommend this. I'm having trouble remembering which papers have plots I like, but something like this? People also have been shading regions & drawing Pareto frontier lines with seems like a good idea
I disagree actually. Powerful small models are popular.
yes, we are working on it now (thanks! this was also your very helpful suggestion last year and I think it's a great plot to have)
Most of these models can't get a nontrivial score on MMLU
still working on it, but interesting initial fit lines
(somehow excel is being annoying but the unlabeled ones are the other goose world3 models)
So let's plot something more interesting than MMLU score 🙂
pls add 1.5b 2.9b too
pawsx is noisy because it's using a format unseen in usual training data, which can be seen from other models' numbers (llama 3b << llama 1b). i suggest remove it
maybe multi shot would help smooth out that issue across models
yeah can try
There has also been some formatting issues identified with paws-x:
https://github.com/EleutherAI/lm-evaluation-harness/issues/2442
I can add a PR with the fixes if anyone wants to try it
@gusty condor I ran glue for rwkv7-1.47b-pile in table 17 and the average of the subscores is like 8 percentage points higher than the glue score it computes, and rwkv7-421M-pile is 12 percentage points off. the glue score it computes is same as in paper
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T00-33-06.json
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T01-44-07.json
lambada.o was exactly the same tho
mamba2 on the other hand, subscores of glue and computed glue score are only different by like 0.5 percentage points, probably due to rounding. they were run with lm-eval-0.4.3 and fp32 too
thats strange 😂 so the real GLUE of rwkv7 should be 8/12 percent higher?
yeah basically, although 1.47b has a lower glue score than 421m pile 🤨 (48.0 for 1.47b and 50.3 for 421m)
It's weighted average, I think
do you happen to know the weights
fwiw rwkv7 1.47b had higher scores on subtasks expect for 1 or 2 iirc on glue
I think: like each single problem is given equal weight, rather than each task
i think the indices for equation 8 are wrong
u_{t,j}should be only u_t and be a scalar, not of head dimension, as it is an inner product
and if we follow dirac notation (bra-ket) for the inner product, there should be a , or a | between r and diag(rho)
but using dirac notation and einstein notation in the same equation is a bit confusing imo
so: dirac notation: "add , and remove j subindex" or einstein: "remove <> and diag() "
votes?
either way, the in R^{D/h} should be in R so im making that change already
thoughts for changing Fresh for novel / new / recent ?
ithink we can fuse the gating in 4.1.4 into equation 11; similar as equation 10
in the Pseudocode For RWKV-7 (appendix G) i would separate the weight projections from the time recurrent operation for clarity
happy to take this task
Recent ✅
Novel ❌
Novel has a different meaning
ok recent
github repo needs to be updated
to include v7 code from blinks repo
ithink part of Appendix C (until theorem) can be moved to methods and the proof can be kept in the appendix
I originally wrote them in the methods and Smerky moved them into the appendix.
@misty igloo @gusty condor if we have a bit more time (say 20hours) i would like to include a more theoretical motivation and comparison of rwkv7 vs rwkv6 vs other linear RNNs. Ithink this can give the paper a more theoretical ground rather than the empirical vibe "we mixed 30 things and the result is cool ~sota"
same for appendix D
No problem. Do you mean this?
yes similar. ithink in point 3 we can motivate well the decisions, including a table similar to that
I tried deriving that and found the online objective being overly complex.
in fact, in that table longhorn 's claimed squared associative objective is wrong, as their simplification for practical considerations makes it an inner product objective effectively
yes @gusty condor bc rwkv7 is explicit gradient descent, and those objectives are derived from implicit gradient descent algorithm
so a proper explanation is what i want to include
i added a citation for the OG scaling laws on lstm paper by baidu on 2017 in the introduction
seems nan is unrelated to adam eps #992362252269256815 message
They're using lm eval version 0.4.3 - are there other relevant fixes that have occurred since then?
yeah. a major one was this
https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
sounded reasonable to me. paws-x has a quite non-standard format
Okay, I made that change. The real question is which style of readout we prefer. I think the 'alternate' one is a lot easier to understand without all the subscripts (and as you noted, some of the subscripts were incorrect and it was too hard to notice that)
seems like pawsx is a mess - I think we should either rerun it with the latest updates or drop it as an eval
Is this part of the paper correct? I had added it:
Despite the general stability of our loss curves, our use of such an extremely low AdamW $\epsilon$ value did sometimes cause NaN loss across a single training step. When this occurs, we rewind the training to the prior checkpoint, clear optimizer states, and continue from that point.
i think this is probably related to adam eps. so further investigation is required
okay, changed it to:
Despite the general stability of our loss curves, we did sometimes observe NaN loss across a single training step, which we theorize may be due to our use of such an extremely low AdamW $\epsilon$. When this occurs, we rewind the training to the prior checkpoint, clear optimizer states, and continue from that point.
At the time, this and some other changes got the paper into the 9 page limit for COLM submission. Since then a lot has been added and we are way over that limit again. I have been waiting to see the full set of experiments before moving more things into the appendix.
We have no limit for arxiv, so we should submit to arxiv asap.
Yes, though I think we should be purposeful about which experiments are shown and in which order in the main section, so that it is most impactful for the reader.
The newest version of lm-eval won't show the total score of GLUE, which is bad
We could just add the total score of all the subtasks ourselves from the results, right?
Not really. We need the formula, otherwise it may be inconsistent
Oh, I see. Perhaps we extract the formula from the previous version of lm_eval codebase where it was present (v 0.4.3)
from the docs in v 0.4.3:
weight_by_size: bool = True whether to perform micro- averaging (True) or macro- (False) averaging of subtasks' accuracy scores when reporting the group's metric.
class AggMetricConfig(dict):
metric: Optional[str] = None
aggregation: Optional[str] = "mean"
weight_by_size: Optional[str] = False
# list of filter names which should be incorporated into the aggregated metric.
filter_list: Optional[Union[str, list]] = "none"```
notice it defaults to False
@nova frost I only see group set for each task in this 0.4.3 version of GLUE so would it end up getting non-weightbysize?
btw, we should try our best to have this paper submitted to arxiv by this time tomorrow.
We have promised somewhere in RWKV.cn that RWKV-7 paper will be available "Early March" (in Chinese: 3月上旬, before March 10th). Just in case if the paper goes "on hold" for several days.
Who made this promise?
And why are we only learning of it one day before you say there is a deadline?
This is really not okay.
yeah. we added micro-averaging mostly to deal with MMLU. the default is simple mean of the subtask (same) metrics
@obsidian quest btw, i've been running ablations on some of the design choices (appendix K.2 at the moment) and find that using the same removal/replacement (k, kk) keys has competitive performance with current baseline Goose. For instance it gets higher acc on minipile validation. What kind of difference have you seen here in your experiments, or is there an intuitive reason why to do it?
no need to hurry. quality is important
i have run extended tests and i will provide more loss data. just too busy at the moment 😂
no problem, thanks! just curious, it's also the least intrusive ablation i tested
pls check this
pls update paper 🙂
subscores are the same as what I got yeah
Post your subscore please
And mamba-2's subscore
hey just checking in, did you end up running bamboo on the other models we need to include it in the paper?
Also, I'm a little confused about the author contributions section - what is the "Compilation of the RWKV World 3.X Corpus?" I am the one who put together the listing, is there some dataset you put on huggingface or something?
I will explain. He sent BlinkDL some data and contributed to World-3.5 corpus this way.
I see. The World-3.5 Corpus is not featured in this paper though.
And neither are the chat examples.
And we need more recent models like mamba 2 if we are going to include the bamboo results.
I think this is a very good idea and it's not clear to me why there is such a rush. This is an artificial deadline right? Having deadlines to motivate work is good, but releasing a worse paper than we could do a few days later due to them is bad.
After continuing to read the messages it seems like most people are on the same page as the above. Also, I don't think anyone's going to get too upset if we do hypothetically miss a deadline promised in a blog post by a few days. That said, this is why EleutherAI has a standing policy of not committing to release dates ahead of time.
I apologize for for pushing too hard on an artificial deadline earlier. I am aware that deadlines can be motivating but the urgency can harm the cooperation. Thank you again for your patience.
Moving forward, I will put paper quality into the first place, and avoid imposing too much burden on others.
Thank you - just know that we all appreciate the immense amount of hard work you're putting into this paper!
looks like lm-eval 0.4.3 was using weighted average for glue by number of problems so the glue score it outputs is correct.
from api/metrics.py:
# A helper function that is used to aggregate
# subtask scores cross-task.
aggregations
if not weight_by_size:
sizes = [1] * len(sizes)
assert len(metrics) == len(sizes)
return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)```
`validation split sizes:
cola: 1043
mnli_matched: 9815
mnli_mismatched: 9832
mrpc: 408
qnli: 5463
qqp: 40430
rte: 277
sst2: 872
stsb: 1500
wnli: 71`
(also explains why mamba2-1.3b was higher on glue even though its subtasks scores don't look super different at first glance)
Agree, we dont use einsum anywhere else
after looking into what GLUE is made of, I think we should remove it from the paper
many of the sub-tasks contribute only a percent to the total so the weightings make no sense, causing the numbers reported to be more like 75% QQP than any kind of actual average
and QQP is a pretty weird benchmark, which should probably be run multi-shot to really work well
since we're not going to do that, let's just remove GLUE entirely
I also want to remove paws-x from the paper:
paws-x was broken in v0.4.3, see lm-eval https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
and baber has NEW fixes, that aren't even in the most recent lm-eval: https://github.com/EleutherAI/lm-evaluation-harness/pull/2759
Seems to me that it's too messed up and should be removed.
Later versions of lm-eval don't even spit out an aggregate score for GLUE, and our aggregate score doesn't even include all the subtasks. The other evals have been more stable across lm-eval versions, which will help future authors compare to our results. These two benchmarks are simply too wild and messy.
These are all fair points that you are raising.
QQP is Quora Question Pair Paraphrase subtask.
The task is whether two sentences are semantically equivalent.
Yeah, it basically asks 'Do these two questions have the same meaning' Yes/No
For paws-x we can right now use the pawsxx branch https://github.com/EleutherAI/lm-evaluation-harness/tree/pawsxx
But we should really think well about which tasks to show.
We can compute them anyway and consider to use them or not. Any suggestions for substitute tasks?
I don't think we need a substitute.
Yeah, we could also just drop them. True. Less hassle in the end.
this is a wrong choice... should use avg of different tasks
Cherry picking harms academic integrity!
I suggest blimp (but the scores will be really high)
Yeah I wouldn't have wanted to drop either of them bc of that concern, but both evals just seem like a mess in general
and tbh dropping glue seems to harm us a bit on the flops vs acc chart, so at least its not really in our favor anyway
blimp is too simple for LLMs
glue and superglue are all noisy
dropping badly designed datasets (shown to be bad for llama3 too) is reasonable
@gusty condor What happens if the value of c is large (e.g., equal to the wkv matrix dimension)? I found that if c gets this big, it might be possible to simulate (reverse) the Boolean transition matrix with a single step transition
The range of WKV will explode
RWKV-7 World v3 corpus as itemized and annotated list on HF:
https://huggingface.co/datasets/hevok/Goose-World-v3
Alternatively the corpus as a HF Collection: https://huggingface.co/collections/hevok/rwkv-world-v3-corpus-67be08105ff513c71632e9dd
Additional here is collection of RWKV-7 related resources: https://huggingface.co/collections/hevok/rwkv-7-goose-67c9dd2154d811c24a093f0c
@gusty condor @keen tartan is Table 13 correct? Not sure where this breakdown comes from...
If I recall correctly @obsidian quest provided it.
here, yeah
but @keen tartan weren't you working to accomplish this item above?
so it is in fact not correct (yet)
I categorized all datasets with this classes.
I intend to automatically classify all individual datasets.
Like world languages, artificial or natural, and categories.
okay, did you co-ordinate the results with Blink? if not, please do so we can get the updated table of categories into the paper
I will do so.
thanks!
(I'm just going through making sure we have everything right and aren't missing things that need updates)
Understandable.
@obsidian quest do you happen to have a checkpoint for the Pile models at 300B tokens rather than 332B?
i always train full pile 332B, and other models are probably doing this too and say 300B for simplicity
Unfortunately, it seems that Mamba and possibly others followed Eleuther's Pythia approach which was to limit training to 300B of the Pile. It appears that it is not just rounding.
I was surprised by this.
My understanding is it was so that people could compare directly to Pythia.
What, never heard of it
Take a close look at the Mamba 2 paper - and Stella confirmed that Pythia used only 300B instead of the full pile.
No! Do we have to retrain these models?
Doesn't sound feasible to me 😦
I'm not sure what to do, but it's fine in our FLOPs vs Acc plot since it's adjusted for training length
latest version of that, not final tho - I still have some work to do on deciding the exact flops counts
Mamba uses tied word embeddings but RWKV does not.
yeah, there are definitely differences in the models that make it hard to compare fairly
and counting flops is kind of only a vaguely correct metric in general - it doesn't directly dictate how fast GPUs run the model
Your chart is not accurate, you should subtract the embeddings of RWKV
yes like I said above, I still have some work to do on deciding the exact flops counts
that may push them apart slightly
(we also did not include GLUE in this average because I think its a broken eval)
I can help you with that
thanks, that would be great!
it's not super clear exactly which FLOPs formulas we should use... the attention mechanisms add a small amount, esp because Mamba does the 2x expansion thing
Can you send me the source
this will make rwkv look bad because rwkv7 avg eval @ 90% trained is almost same as 100%
models trained using different amt of data cant be compared like this
models trained with smaller amt of data (such as mamba2) will appear far better than those with more data (such as qwen llama), because if we want optimal loss vs flops we need to follow scaling laws which no one follow in practice for apparent reasons
I think for readers unfamiliar with RWKV architecture, the "Blocks" label within the L Blocks notation in this diagram might cause confusion. Specifically, there's ambiguity about whether "Blocks" refers to the entire module or a specific component (like the Time Mix unit) within it. To enhance clarity, perhaps relocating the "L" designation outside the block representation would create a more intuitive visual hierarchy.
I see
@dawn pewter
WorldRWKV: https://github.com/JL-er/WorldRWKV/tree/main This demonstrates RWKV7's strong comprehension ability, capable of accepting any modality and performing excellently on benchmarks. Can this be included in the RWKV7 paper?
We already feature VisualRWKV in the paper, which does Image QA and gets higher results, and I don't think we should feature two of these. But would you like to add Audio QA to the paper as a new multimodal experiments subsection?
If so, it's probably important to do experiments comparing the results with some other architecture. (could be RWKV-6, but even better if it's something else)
If you don't have time for this now, it's could still be possible for us to add it in a future version of the paper if it's complete before we submit to COLM. COLM deadlines are March 20th for abstract, March 27th for paper.
@iron parrot Please adjust figure 3 and 4 so that the colors of v7, v6, v5, v4, mamba, mamba-2, v7-128k are consistent across two images
I told you this when we wrote the original RWKV paper and then again when we wrote the second.
To clarify, you mean that these other models train an actual 300B tokens, correct? Not that they just write 300B but train 332B.
(I know you said that pythia does actual 300B)
From the Pythia paper
I think yes, but I think very few people are aware of that. I bet no reviewer will raise questions on this specific point.
Yes
Well, the goal is to do everything correctly to the best of our ability. Not to fool reviewers.
And to show correct scientific results that do not contain known errors.
We can show anything we have, as long as we point out the distinctions.
All RWKV models there are trained with 332B Pile, so comparisons are still valid
OK, done.
For now, I adjusted the Pile ablations table and discussion to remove Pythia, Mamba, and Mamba2
If someone has the resources and we want to compare to those others in table format we could train just RWKV7-1.47B on 300B tokens of Pile. But imho the comparison this way is kind of pointless because they all use somewhat different parameter counts. Probably mainly due to differences in weight tying, at least in Mamba's case.
Depending on what the claims we want to make are, I don't see a huge issue in using models with slightly different parameter counts and slightly different training token counts
I could attempt doing it if it is seen as useful. Have some spare compute.
Need to estimate the requirements.
Do we have The Pile binidx files somewhere already?
I added a tokens column, in case we want to do that. Not quite sure what claims we could make though, if any.
The reality is that these RWKV-7 models are both more parameters and 10% more tokens trained than the ones being compared to.
Yes, and ask Blink for his compute
https://huggingface.co/BlinkDL/rwkv7-g1/blob/main/rwkv7-g1-0.1b-20250307-ctx4096.pth
announcement a bit later
are there any objections - if not I will go ahead and remove these two broken and/or messed up benchmarks
@keen tartan I'm seeing open-web-math, algebraic-stack (both of which point to proof-pile-2) and FLAN got added to the v3 dataset listing and citations - do you know why these were added? Afaict they were not in my original v3 list, based on what Blink originally sent me
From looking at the document history, it appears you added them to the dataset listing on 26th February, 1:26 pm ET
Don't remove glue. Pawsx can be removed
You can't only pick a benchmark when it's advantageous for you.
I'm definitely not trying to do that - and I agree that's bad
But let's not use glue in any future paper, because it's not well constructed
I think we can add it to the flops chart (I have no idea if the result will benefit or harm rwkv there) by applying the weighting formula manually
@brisk bronze please take a look at how we can do that if you get a chance to
use avg for glue, not weighted by number of items in each subset @misty igloo @gusty condor
because weighted by items makes no sense here
WorldRWKV has a stronger visual QA benchmark, but there are currently no machine experiments—updates will be made later. The audio QA benchmark has already reached SOTA and does not need to be compared with RWKV6. I believe WorldRWKV should appear as a whole rather than being split up, as this is meant to demonstrate RWKV7's ability to understand any modality.
I will provide a stronger benchmark, and you can update it in the subsequent RWKV7 paper.
Also ccnews. I made an initial itemized list of the World v3 corpus in an excel sheet as it was suggested and Blink added the missing datasets. Based on this I added them to the paper as well. I understood that was the original objective to identify missing datasets. We also found that DeepMind Mathematics dataset dm_math was part of the world v2 as a constitute of The Pile but forgotten to be mentioned in the Eagle & Finch paper. I tried to mention it as a footnote (b) to the table about the World v2.1, but perhaps there is better place for it.
Just curious how do you normally convert a dataset to binidx files? I have a rust script https://github.com/cahya-wirawan/json2bin that would convert 825GB pile dataset in about 40 minutes (using M2 mac mini) instead of 45hours using python script
I use your json2bin implementation already. It is blazing fast. Thank you so much for making it!
For The Pile Comparison experiment we need to segment text with the GPTNeoX Tokenizer. I have been using the Rust implementation only with the World tokenizer. How would you go about specifying a different tokenizer?
I think we need change the library to support other tokenizers if I am not mistaken. I see let tokenizer = rwkv_tokenizer::WorldTokenizer::new(None).unwrap(); is hardcoded right now. In particular in view of supporting other modalities too.
By the way, I did identify a preprocessed version of The Pile with the GPTNeoX tokenizer on HuggingFace as a dataset: https://huggingface.co/datasets/RichardErkhov/RWKV-LM_pile_binidx_dataset
From my tests it seems to be correct as we can decode the original text from it. It is however split across many binidx files rather than a single pair of files.
Added SmoLM2-1.7B
Removed PAWS-X and added SmoLM2-1.7B too.
@brisk bronze @gusty condor We need to share the lm_eval results files to calculate MMLU with either the weighted or non-weighted average ourselves.
I implemented only the world tokenizer because I thought we will not use the old tokenizer anymore:) I will have a look how to add other tokenizer too
WorldRWKV can be written in a separate paper.
Overleaf in dark/night mode. Finally eye strain reduced!
I agree that item-weighted makes no sense, because that way 75% of it is one subset and many others are 1% or less. We are also excluding parts of it that do not have accuracy as a result.
This weighting is a result of a mistake in the old version of lm eval harness used.
Okay, that's great! Thank you for doing that and verifying that Blink made the additions.
I updated the links to proof pile to point to the proper subdirectories
Why using average for GLUE but not so for MMLU?
last time I checked mmlu was missing here - do you have those results as well?
Probably overwritten?
Oh, I found them
0.1B and 0.4B tested by @keen tartan
I put evals results on HF. Just need to organize them. On it right now.
I like this figure, It's a very intuitive way of showing the equation
I painted it
It is beautiful. I like its simplicity and color composition. It is piece of art!
RWKV 0.1B and 0.4B models evals: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval
I ran 0-shot and 5 shot for MMLU separately. Sometimes I ran evals separately per task to better organize. This is why there are multiple files per model. I organize each eval set in a folder per model.
I add the other reference models evals too.
I think we should keep the eval as-is unless we have a very good reason. Making slight modifications to those evals will arouse the attention of reviewers, putting us at risk of rejection.
We should provide convenience for reviewers to verify our results. i.e., using the default averaging method of the evaluation framework without making any tweaks to the results. This simplifies the reproducibility process and avoids potential accusations of "tweaking" the results to favor our model.
I never requested that we change the weighting (Blink did tho), I requested that we drop the eval entirely because it uses a bad weighting in that specific lm eval version
in newer lm eval versions it does not print an average at all for glue
glue also includes other components which do not contain accuracy at all, and these are not reflected in the accuracy score
This is probably originally my fault for including glue in the prior paper without checking it thoroughly beforehand
We can still include it anyway, because that we are already averaging over 9 benchmarks.
updated bolding for smollm2-1.7b on arcC column to reflect better score
Who did the evaluations for Llama3.2 1B/3B and Qwen2.5 1.5B/3B?
@misty igloo I like Figure 3: FLOPs vs. Average Accuracy. I think the title is redundant with the figure caption.
Better label the axes with average accuracy and log scale compute in TFLOPs instead of a title at the top.
Perhaps express the accuracy in %
I also suggest to attempt mitigating the overplotting of labels on each other for Mamba and RWK7-Pile. I know plotting softwares often make it hard to position them apart.
I think the point labels could be shorter just indicating the size of the model as the architecture/dataset is specified in the legend and encoded in color already, e.g. 0.1B, 0.4B, 1.5B, 2.9B, etc.
Should we add transformers to the PG19 long range context loss plots?
updated, now including size-weighted GLUE results like the rest of the paper contains
actually, im going to increase the text size...
Yes, please. There is plenty of space.
hehe there isn't much sadly bc the labels will overlap
Make the labels shorter.
Just the size of the model please.
Put the legend in bottom right or top left.
Make axes text a bit bigger also the axes labels.
hm I dislike it with just the size, but I understand your reasoning
working on it...
Please put % behind Average Accuracy (%) in parenthesis remove it from the y-axis numbers.
updated
my formula was accidentally off before, and this revealed that mistake so its a happy accident that it got lost
not sure I can do that
this is yet another case where GLUE is messing up something
this time llama is gonna look horrible as a result
Hmm
I really hate this benchmark, at least the way we're using it (which is terrible imho)
I might remove llama entirely because I think it's a completely unfair representation of it
it literally scores worse than its own 1B, that was DISTILLED from the same model, on GLUE the way we calculate it
We can iterate over it until it is correct.
Let me look into it.
How do you calculate the compute complexity? I mean estimate.
it took a bunch of work
I can imagine.
the basic formula is 6 x params x tokens
but there are variations in the models that matter
like some use tied embeddings, and embedding doesnt really take flops (it's essentially a lookup table) but de-embedding for the lm_head does
and rwkv was upgraded from prior models which had to be calculated separately
params x tokens is already good rule of thumb. Yeah, the devil lays in the detail.
Llama 3.2 1B/3B were destilled from Llama 3.2 8B right?
other minor differences include the cost of the attention calculation or replacement thereof
i had them in the chart originally but it makes no sense since there is no true FLOPS used to train them
they were distilled from 3.1 8B not 3.2
there is no 3.2 8B
All right.
Let me think about it.
Destillation is kind of like cheating.
Include SmolLM2
It was trained from scratch via pretraining.
No destillation.
They weren't just distilled - the starting point was actually 3.2 8B cut up into smaller parts!
I will think about the issue and look around.
I don't really want to add more models to this plot though
Mostly I'm just continually annoyed by glue messing up all the results
Yeah, don't worry too much right now. You did already pretty well with all those obstacles.
I will get back with some concrete solution suggestions.
I did.
RWKV7-G1 "GooseOne" first release: reasoning @ 0.1b params, pure RNN (attention-free), fully multilingual. Demo & weights on https://t.co/fZ7rmVKsKj 🪿 Larger G1 training in progress.
Who is doing these expreriments?
Do you have the results files for those too? I could not find them.
The x-axis of (c) is not consistent with others
I have them, saved in another txt. I will try to retrieve them
Ok, very well. Yes, please.
Uploaded
That is great. We can parse it. Thank you!
Suggestions:
-
#1103039376184852622 message
-
#1103039376184852622 message
-
#1103039376184852622 message
yes im doing #3
#3 is already in the document, but not grouped together all in one place in this way, and we should credit schmidhuber/widrow/hebb etc.
(it's instead presented in the order of the current narrative)
feel free to fix it up tho!
the linked repo: https://github.com/RWKV/RWKV-LM needs a fresh pull from blinks'
@misty igloo oposition for moving appendix D and E to 3. Architecture?
These are @brisk bronze 's - we are having some trouble with nathan's ctx-extended 1.5B repo (probably because it was done using a very old version of FLA), but I'm working on fixing it so she can run the part up to 15k
ithink also Figure 2 (like it a lot !!!) could be moved to 3. Architectute where the recurrence formulation is first laid out
wrt Appendix D, it could be okay for theorem 2, but my current view is that Theorem 3 is not realistic under actual conditions for RWKV-7
there is ongoing work to find a proof that would work without extra tokens, but I am somewhat doubtful it will happen
and the current proof of Theorem 3 is still not explicit enough about this fact that it is impossible for actual RWKV models to execute without injecting multiple tokens in between each input token
I also think that generally when someone reads the paper they want the overview not every detail inline
The main paper is already 18 pages long
following the structure of [Intro][method][results][additional] i propose the following reordered abstract:
We present RWKV-7 "Goose", a new iteration of linear RNNs featuring a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show this architecture can solve problems outside of TC0 under standard complexity conjectures, exceeding the capabilities of transformers while retaining parallelizability of training.
We trained models up to 3B parameters on a new dataset that we name World-V3; which exhibit improved performance across a wide range of benchmarks and state of the art downstream tasks despite being trained on dramatically fewer tokens than other models in its class, including LLaMA 3.2 and QWen-2.5.
To foster openness, reproduction, and adoption, we release all our models on Huggingface, and our training and inference GitHub; all under Apache 2.0 License.```
The intention is to release it as a kind of a "meta-dataset" (dataset if datasets) as the majority of subsets are available on HuggingFace. Those that are missing we could add as separate dataset repos and link them all together.
I don't know why the paper currently claims we'll release a 1% slice of it - I'm up for it if Blink wants to, but typically in the past he has not wanted to.
I have a comment in the doc asking about this...
I think you should always lead with your best foot forward, and this has a two sentence lead-in about other models and their problems, instead of immediately describing why the Goose is great or giving a hook to the reader.
There was some recent discussion here #general message from @young sparrow in the general channel about how to write a compelling abstract that might be useful
I agree, the first sentence is very important and should captivate the reader immediately.
I am asking Blink for that.
... on a new dataset that we name World-V3; which exhibit improved performance ...
What is thiswhichreferring to?
The models. But can be rephrased to improve clarity. Some readers might be puzzled as well when reading it
I don't think that the two opening sentences are too problematic, I'm more worried about the fact that the third sentence is about something most ML people don't care about
The circuit complexity stuff should be an aside, and probably the second to last sentence in the abstract. Right before the comment about releasing stuff
Abstracts are kind like the first few seconds of a Youtube video. You should start with a hook, otherwise viewers (i.e. readers) will drop before even getting a bit further.
The current first sentence is very catchy.
I think we agree?
Yes, the complexity stuff should not be at the beginning perhaps.
But it is also important to highlight what is novel of the suggested architecture and how it was achieved.
I agree.
@keen tartan is it possible to put together a list of which sub-datasets within the entire World v3 corpus are no longer available online?
Is this correct? I thought you guys said it had to get updated
Afaict it never did
Yes, this is indeed correct
Yes, it is kinda already done.
All available datasets are linked. I will get the list down to the problematic ones.
this was blinks original message and afterward blink asked Hevok to combine all the items so he could go over it and fix it up
@gusty condor yet the table has not changed since then
that's why I temporarily commented out the table, because it never got updated after that
if I misunderstood, let me know - it seemed like Blink wanted to update it to be accurate in some way
I did combine all and Blink went over and added missed datasets.
Datasets that seems not available any more are for example https://huggingface.co/datasets/marianna13/random_quora
I am still searching for those.
It might be this one: https://huggingface.co/datasets/marianna13/random_dataset
I will pin down all.
Yeah I know you guys got that part done, which is great! But wasn't there also something about updating the categories summary as a result?
It was my suggestion. As the categories seem rather arbitrary.
For instance something can be both code as well as web.
Like StackOverFlow data.
An ontology might be helpful in such a case.
Well we don't have to put this summarized list of category breakdowns into the paper. Let's only put it in if we have one that we think is helpful and correct.
I think it is very important. Our dataset is rich in novels and fictions, but falls short of math and code compared to Qwen2.5 series. I think this is an important piece of information.
All right. Updated: https://huggingface.co/datasets/hevok/Goose-World-v3
Almost all are ready-ably available with only a few exceptions:
- Wikipedia: Loader not working anymore https://huggingface.co/datasets/olm/wikipedia
- Guanaco: Was taken down because included private data https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
- Books3: Taken down because of copyright issues https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz https://huggingface.co/datasets/defunct-datasets/the_pile_books3
- Can be easily fixed.
For 2 & 3 I am not certain yet how to resolve.
Does anyone have backup copies of Guanaco and/or Books3?
I may have perhaps on some old drive, not sure.