#RWKV-papers
1 messages Β· Page 10 of 1
Fantastic! Thx
I think this is the missing books3 dataset: https://huggingface.co/datasets/SaylorTwift/the_pile_books3_minus_gutenberg
Cool, are those three everything that's unavailable, including in all the parts of world v1/2 as well?
Yes, as far as I can assess. I gonna double check all 88 sets again just to be sure.
MMLU results seem to be missing for Llama-3.2 1B/3B and Qwen-2.5 1.5B/3B. Could you please check whether you can find them as well?
Added a list with links to all datasets in the README for convenience: https://huggingface.co/datasets/hevok/Goose-World-v3
It is right under the pic.
I quickly hacked together a script to convert the lm_eval markdown text outputs to json outputs: https://huggingface.co/spaces/hevok/evals/blob/main/txt2json.py
I used it to convert the Llama-3.2 1B/3B and Qwen-2.5 1.5B/3B text files into json format for easier processing with software: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval
@gusty condor By the way, in the case you were not aware, you can specify the --output_path <output_folder_name> flag of lm_eval to let it output directly json files with more metadata.
yeah improved in world-3.5
I will rerun them
If Ο means the sigmoid, should we used uniformly? For example, both use Ο or both use sigmoid
Should the ranges of ΞΎ and Ξ± also be specified? we've only described what they do but we haven't described their range
Yes, you can change it
Now, I have finally found the main reason for the performance degradation of converting RWKV7 models to Flash-Linear-Attention format.
It is not related to numerical precision. it is related to the prompt format.
The code used by Blink and HowardHou (VisualRWKV) adds an extra BOS token [0] before the text. However, lm-eval does not add that extra token.
Upon further inspection, I found the failure pattern: The model is unable to perform recall for the very first token it receives, witnessed by these examples:
2728: "Mathews lifted a dark brow. \"Are you sure about that? I mean, wouldn't it be better to wait until Dale is home safe and sound?\"\n\n\"The longer I wait to tell her, the worse it will be for both of us.\"\n\n\"Good luck. You're going to need it,\" said Mathews"
1225: "Seth traced the dirt with the end of a stick. \u201cYou say I\u2019m stubborn\u201d I laughed and he continued, \u201cListen, I don\u2019t even know if it\u2019s true or not. There\u2019s no need for me to worry any of you. That\u2019s why I didn\u2019t say anything.\u201d\n\u201cI still don\u2019t care, Seth"
3999: "Sirona tried to quell her sense of disappointment. \u201cWhat, then? Why did I see what I saw?\u201d\n\u201cThe young woman you observed being sacrificed,\u201d her teacher asked, \u201cDid she appear distraught, or did she go along with the ritual willingly?\u201d\n\u201cI\u2019m certain she was terrified,\u201d said Sirona"
The inability to recall the first token is probably related to WKV state initialization.
After removing the bos token [0] from VisualRWKV code, the performance matches FLA implementation.
Now, I have these questions:
- Should I write that into the paper?
- Does
lm-evalhave an option of adding a BOS token before the text?
@misty igloo @keen tartan
this is like bos token
add_bos_token=True in the model_args.
uses tokenizer.bos_token_id
Gemma series of models also seem to need a bos_token added.
On a related note, I had to set adapter = EvalHarnessAdapter() adapter.custom_prefix_token_id = None when evaluting RWKV models to get some benchmarks working. There was otherwise an undefined variable error raised somewhere. I try perhaps to reproduce it. Might be already gone in newer versions.
@gusty condor RWKV6-3B v2.1 multilingual appears to be missing. Could you please check whether you can find them?
overwritten, I test it now
Oh ok, very well.
Setting adapter.custom_prefix_token_id = None or not changes results a tiny bit for perplexity but not accuracy. However, values are identical up to 7 decimal points (e.g. 12.59587346 versus 12.59587348). So it is properly not significant.
Please print the tokens at line 442 of flash-linear-attention/fla/models/rwkv7/modeling_rwkv7.py to see if BOS token is properly added.
token_ids = input_ids.flatten().tolist()
print(token_ids)
looks like custom_prefix_token is only used for loglikelihood_rolling (perplexity) tasks right now (like wikitext)
RWKV7-G1-0.1B drops 1% (49.1% -> 48.1%) without [0] token for lambada_openai
I am getting 0.4898 right now (so ~ 49.0%)
Are you using g1
yes
You converted the model to FLA format?
Use fp32 for more accurate results
I did set the strategy to use fp32
strategy = 'cuda fp32'
@nova frost Here is the error I get if not setting adapter.custom_prefix_token_id = None: ```
/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py in loglikelihood(self, requests, disable_tqdm)
361 # BOS or EOS as context
362 context_enc, continuation_enc = (
--> 363 [self.prefix_token_id],
364 self.tok_encode(continuation),
365 )
/usr/local/lib/python3.10/dist-packages/lm_eval/models/huggingface.py in prefix_token_id(self)
360 def prefix_token_id(self):
361 # it is used as prefix for loglikelihood
--> 362 if self.custom_prefix_token_id is not None:
363 return self.custom_prefix_token_id
364 if self.tokenizer.bos_token_id is not None:```
Full traceback here: https://huggingface.co/spaces/hevok/evals/blob/main/errors/custom_prefix_token_id.txt
looks like its to do with EvalHarnessAdapter
Yes
It should be the BOS token there, right?
In RWKV we have eos_token_id == bos_token_id == 0, I suppose.
yeah, should be fine i think as long as tokenizer.bos_token_id == 0
it's added through tokenizer.encode(string, add_special_tokens=add_bos_token)
I think we only specify tokenizer.eos_token_id = 0 in the tokenizer wrapper right now. Gonna set the bos_token_id there too.
RWKV7 adapter code, without [0]:
{
"lambada_openai": {
"perplexity,none": 13.835651924377974,
"perplexity_stderr,none": 0.4269067454771951,
"acc,none": 0.4812730448282554,
"acc_stderr,none": 0.006961090021795178,
"alias": "lambada_openai"
}
}
RWKV7 adapter code, with [0]:
{
"lambada_openai": {
"perplexity,none": 12.362614971985607,
"perplexity_stderr,none": 0.36913900917528986,
"acc,none": 0.4913642538327188,
"acc_stderr,none": 0.006964938588638406,
"alias": "lambada_openai"
}
}
RWKV7 FLA, without [0]:
{
"lambada_openai": {
"perplexity,none": 13.835802857719031,
"perplexity_stderr,none": 0.4368222446505151,
"acc,none": 0.4814671065398797,
"acc_stderr,none": 0.00696119082972564,
"alias": "lambada_openai"
}
}
RWKV7 FLA, with [0] (code hacking):
{
"lambada_openai": {
"perplexity,none": 12.364938863860012,
"perplexity_stderr,none": 0.3773660539093024,
"acc,none": 0.49117019212109453,
"acc_stderr,none": 0.006964891360529564,
"alias": "lambada_openai"
}
}
Oh, that is indeed a significant impact!!!
I added this line at line 440 for https://github.com/fla-org/flash-linear-attention/blob/main/fla/models/rwkv7/modeling_rwkv7.py#L440:
input_ids = torch.cat((torch.tensor([[0]], dtype=input_ids.dtype, device=input_ids.device), input_ids), dim=1)
π Efficient implementations of state-of-the-art linear attention models in Torch and Triton - fla-org/flash-linear-attention
Great find. We need to fix this issue.
I am not able to reproduce the RWKV7 adapter code results for RWKV7 G0. As baseline I get: { "lambada_openai": { "perplexity,none": 12.596602010153171, "perplexity_stderr,none": 0.3822718659650309, "acc,none": 0.48961769842810016, "acc_stderr,none": 0.006964475739361981, "alias": "lambada_openai" } }
Hold on
I prepare some code.
Use this
RWKV_PAD = [0] # you can try using [0] as pad
@keen tartan FLA model is here: https://huggingface.co/fla-hub/rwkv7-0.1B-g1/
By default it uses RWKV_PAD = pipeline.tokenizer.encode('\n')
What about the STOP_TOKEN?
Default is STOP_TOKEN = RWKV_PAD + pipeline.tokenizer.encode('\n\n')
Fun fact: k_k can even be more than 4 and less than -4
yeah that's why it's useful
$\xi$ is a learned parameter representing the removal key multiplier, which transforms the original key into a version to be removed from the state.
This is the description in the paper, which may leave the reader a little confused, why can the removal key multiplier even be greater than 1 and less than 0
Kaguya
@gusty condor Getting similar but not identical results now for RWKV7 EvalHarnessAdapter with PAD = [0]: ```{
"lambada_openai": {
"perplexity,none": 12.364936956333898,
"perplexity_stderr,none": 0.3764505210612126,
"acc,none": 0.49117019212109453,
"acc_stderr,none": 0.006964891360529504,
"alias": "lambada_openai"
}
}
An error of 0.02% is not significant at all.
Yeah, I think so too.
The average value of k_k is still between 0.7 and 0.8
@gusty condor I noticed you have been using STOP_TOKEN = [535] which will be decoded as +). Is there a specific reason for this choice?
But wait, that is for PILE models! Wondering what would be proper value for world tokenizer/models. /n/n might terminate long form generations.
@nova frost Even when setting the tokenizer.bos_token_id = 0 still raises the Exception. I try setting the adapter.custom_prefix_token_id = 0 Hope this makes sense.
yeah should be fine. custom_prefix_token_id isn't even used in any of the eval tasks
All right. Thanks!
Sorry guys, got sick and may not be much help for the next two days. I'll try to put in an updated flops plot soon.
Please rest and take care of your health. It is most important!
That is unused in the code for evaluating lambada_openai, piqa, mmlu et al.
That is a relieve, but for other tasks we should have it set to reasonable values.
[261] for '\n\n' in rwkv_vocab_v20230424
@misty igloo In the case of infections, try to consume high amounts of fruits and berries (things that are rich in vitamin C) as well as consider supplementing zinc. Get well soon!
Well? One bottle of vitamin C (100 tablets, 100mg x 100) costs only $0.5 in China.
That is a good deal. It is however recommended to also get vitamins from natural food sources as there is other stuff inside that increases bioavailability. It is extremely difficult to overdose on Vitamin C as it is extremely water-soluble. So consuming it from both food and supplements is fine.
Use smaller plot and relatively larger font size
add w_0 too
Implemented your fix in the fla code...
Replicated your RWKV7 FLA 0.1B-G1 with code hacking results:
"lambada_openai":{
"perplexity,none": 12.364936711373161,
"perplexity_stderr,none": 0.37736600715379043,
"acc,none": 0.49117019212109453,
"acc_stderr,none": 0.006964891360529504
}
}```
BUT RWKV7 FLA 1.5B-World with the same code hack gets much higher results than in the paper:
```{
"lambada_openai":{
"perplexity,none": 4.136933117540389,
"perplexity_stderr,none": 0.0886568308581175,
"acc,none": 0.6931884339219871,
"acc_stderr,none": 0.006425006782127488
}
}```
RWKV7 1.5B-World (with adapter I assume) in the paper:
```{
"lambada_openai":{
"perplexity,none": 3.4,
"acc,none": 0.483
}
}```
According to this issue, pip rwkv7-1.5B-world gets 0.6931 on lambada.openai
https://github.com/fla-org/flash-linear-attention/issues/198
https://arxiv.org/pdf/2503.06121
They used my figure (figure 2) with neither citation nor my consent!
interesting, k_a can take on big values like 13, -20
As models grow larger, the average k_k looks like going up, while k_a seems to trend downward
For RWK7 World 1.5B via HarnessAdapter I get the following results depending on the specified PAD token IDs with the jupyter notebook I provided:
RWKV_PAD = [11] (tokenizer encoded '\n'): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T10-50-30.361472.json
"lambada_openai": {
"perplexity,none": 4.174870788924788,
"perplexity_stderr,none": 0.09003244838599012,
"acc,none": 0.6951290510382302,
"acc_stderr,none": 0.006413613926848405,
}
RWKV_PAD = [0] (only special token in World tokenizer, often denoted as '<|endoftext|>' or '<EOS>' ): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T11-04-07.000221.json
"lambada_openai": {
"perplexity,none": 4.133062406363742,
"perplexity_stderr,none": 0.08879176331605698,
"acc,none": 0.6933824956336115,
"acc_stderr,none": 0.006423873526429436,
}
move Figure 7 to section 3 (Architecture) because it highlights the limits of attention & mamba
I selected a subset of Lambada (142 problems) that satisfies these requirements:
- The answer is the first word;
- the first word does not appear again in the middle of the text.
The diferences are very significant:
v7 0.1B world 2.8:
No padding:ppl=357 acc=9.15
padding with[0]:ppl=16.4 acc=36.6
padding with[0,0]:ppl=10.7 acc=43.7
This is so significant and worth to be written in the paper
Examples:
{"text": "Beth smoothed her wiry half-black, half-gray hair from her makeup-free face. In New Mexico, the natural look was common. Standing next to Cindy Fanucci, she felt like a disaster. She hid her ragged nails under the sleeves of her sweatshirt.\n\u201cHi, I\u2019m Cindy. It\u2019s so nice to meet you, Beth"}
{"text": "Cooper groaned, and his body sagged back.\n\n\"You weren't supposed to be first,\" Deuce snarled as he lifted the gun and took aim at Cooper's prone form. \"But if that's the way you want it, old buddy...\"\n\n\"No!\" Gabrielle threw her body forward and wrapped her arms around Cooper"}
I've been discussing the expressivity of RWKV-7 behind the scenes with @misty igloo , @dawn pewter and William Merrill, and
we finally have a proof that RWKV-7 can recognize any regular language!
This is significantly stronger than our prior claims, and doesn't rely on assumptions such as c = 2, "multi-step computations", or a special BOS token. This result clearly motivates our use of a data-dependent and elementwise ICLR "a". Prior works could only simulate permutation DFAs, while we can simulate general DFAs, because of this "a". RWKV-7 might be the first model to use diagonal + low-rank updates, and still be able to recognize regular languages.
The proof is a bit involved (~4 pages, added as Appendix E), but I tried to write it in a way where the core ideas appear early, and the complicated details appear later. A core insight is that multiple layers are needed. Numerical experiments indicate that 2 layers should be enough, but my construction uses 4 layers to simplify the proof.
There were some interesting insights from the proof of simulating DFAs with RWKV-7:
- Because "a" is applied on the right instead of the left, we actually simulate the reversed DFA (the DFA which recognizes the reversed language). EDIT: Sorry, this was actually incorrect, it is "a" on the left which simulates backwards. Thanks Merrill for finding this mistake.
- For DFA simulation, we often want to extract a single row of the wkv state. But because the receptance "r" is applied on the right instead of the left of the state, the readout requires simulating many identical wkv heads, where each head reads out a single element of the wkv state.
- For DFA simulation, we do not need element-wise control or data-dependence for "w".
i think its because of tokenizer, and the effect is only large for tiny models
great work π do you have suggestions for increasing rwkv7 expressivity
amazing! I'll read it later!
yeah, I thought lemma 3 would be a "well known" result, but I couldn't find a reference, so I cooked up a construction myself. If you can find a simpler proof without requiring the reader to known graph theory terms, that would be great.
Recognizing regular languages is already very strong, things beyond that are usually clearly impossible in constant time per token. For example, some NC1 problems require linearly growing state size in the sequence length. However, the current construction has uses huge state sizes and lookup tables of size vocabulary^(DFA states), which is probably limiting which regular languages can be simulated in practice.
Points 1. and 2. above indicate that we might want to experiment with readout on the "value" dimension of the state, even though this breaks the intuition from linear transformers. And maybe also apply "a" on the left side.
My way to avoid c = 2 is based on the group normalization immediately after the wkv heads, there might exist other/better normalizations of the state which could also improve performance (like how rwkv-6c normalization was great).
yeah I still maintain that a more balanced construction of the overall formula with implicit normalization has the potential to improve performance
but I'm not sure whether that will improve or harm these regular language abilities
gotta wait a day or so until my brain works fully again to think about it π€£
@obsidian quest In summary, regular languages basically include everything we can reasonably do, and we can already technically solve regular languages (so we can do state tracking and basically what classical RNNs can do). However, the way we currently simulate them can be very inefficient. So most further improvement in expressivity probably comes from decreasing the required number of heads / head size / precision / etc.
A practical limitation on the expressivity of RWKV-7 wkv heads is that it applies all vectors to the "key" dimension of the state. This makes the slots in the "value" dimension independent. Some mixing also in the value dimension could potentially make the wkv heads more powerful (while making parallelization a bit more tricky π ).
Am I correct in understanding that, rather than aiming to simulate each individual step of an arbitrary DFA, we are now adopting a block-wise approach where we simulate the corresponding n-step emulation results of the DFA at every n-step interval?
OK I am testing Qwen-0.5B and SmolLM, the results make a difference but not statistically significant, p=0.09
I haven't read the final proof yet, but the idea from a couple of days ago was that all size-n blocks up until the final 2n-1 tokens would be deferred by one block size (think 'pipelining') and evaluated in a non-block manner, as a deferred set of per-token elementary matrices
Then, the final block is done block-wise so that it does not require deferral (and therefore no extra tokens are required)
[update: looks like icecuber simplified it to 2n tokens instead of 2n-1, but seems like otherwise same idea]
@gusty condor I saw you pushed the missing RWKV6 World 3B multilingual results. Thx! We appear to still miss RWKV7 World 1.5B/2.9B results files for lambada_openai, hellaswag, piqa, arc_easy, arc_challenge, winogrande, and sciq. Please check.
For future runs I suggest to also output a bit more metadata for better reproducibility including lm_eval version and special token IDs:
from importlib.metadata import version
# ...
output_dict = dict(
model=MODEL_NAME,
tasks=eval_tasks,
num_fewshot=num_fewshot,
lm_eval_version=version('lm_eval'),
bos_token_id=adapter.tokenizer.bos_token_id,
eos_token_id=adapter.tokenizer.eos_token_id,
custom_prefix_token_id=adapter.custom_prefix_token_id,
pad_token_ids=RWKV_PAD,
stop_token_ids=STOP_TOKEN,
results=results['results']
)
#...
Note: I added bos_token_id to the TokenizerWrapper am assigning right now adapter.custom_prefix_token_id = RWKV_PAD[0].
# ...
class TokenizerWrapper:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
self.bos_token_id = 0
self.eos_token_id = 0
# ...
adapter = EvalHarnessAdapter()
adapter.custom_prefix_token_id = RWKV_PAD[0]
# ...
@brisk bronze did you guys figure out why add_bos_token=True is not working when calling lm eval harness from cmdline?
we shouldn't need an adapter or any of this stuff
should be able to just run lm eval from cmdline and have it work
No idea. I am not a maintainer of lm-eval
Did someone actually check that this doesn't work
Well @nova frost is, so @brisk bronze could you try to work with him to figure out what's going on here
yes, this doesn't work because adding add_bos_token=True in a model.args doesn't improve performance (the result is the same)
both in 0.4.3 and 0.4.7
happy to help out, but my understanding was that y'all are using a a custom model adapter?
we were using lm-eval 0.4.3
this is the output when I do add_bos_token=True, and the 0's are not prepended.
[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297, 59888, 799, 22464, 4855, 25779, 47, 269, 24326, 4491, 22799, 31391, 22799, 21556, 461, 31059, 21273, 0, 0, 0, 24043, 8828, 21795, 30259, 22590, 31254, 46795, 4811, 32451, 39944, 45447, 45,
the output looks the same when I set add_bos_token=False
[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297,
Jannas tests were with no adapter just cmdline
But she'll have to describe the exact details of what was run and how- I didn't run it myself
afaik rwkv7 were run with the adapter code because running fla converted rwkv in lm-eval had degraded results
this is the command I used: lm_eval --model hf --model_args pretrained=fla-hub/rwkv7-1.5B-world,trust_remote_code=True,add_bos_token=True,dtype=float32 --tasks lambada_openai --batch_size 8 --output_path /workspace/lm-evaluation-harness/results
But aren't we discussing your tests of rwkv7 FLA via cmdline lmeval?
oh, then should have notified you of this: https://github.com/EleutherAI/lm-evaluation-harness/pull/2781
basically some HF tokenizers need to be initialized with add_bos_token=True
but the performance for fla-rwkv7 was the same with add_bos_token=True vs. False. I can test again
rwkv7 world tokenizer
ok I'll run the tests again
yeah. this wouldn't add the bos before:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world")
tokenizer.encode("hello", add_special_tokens=True)
# [34550]
intializing with tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world", add_bos_token=True) works properly
looks like it's pre-pending the 0 now!
[0, 6624, 31220, 32227, 28471, 98, 22748, 332,
Well that finally resolves an issue that's only taken a year to figure out π€£
eval results match up too
"lambada_openai": {
"alias": "lambada_openai",
"perplexity,none": 4.136982815151818,
"perplexity_stderr,none": 0.08865873813063398,
"acc,none": 0.6931884339219871,
"acc_stderr,none": 0.006425006782127488
}
}```
Has anyone managed to successfully run the 128k fine-tuning version? We're encountering conflicts when using the world version environment.
What kind of conflicts? Sorry, not sure what you mean by the world version environment... but @brisk bronze used the redone 1.5b version recently. Do I need to update the 2.9b for some reason like a change to the FLA repo?
Does this one https://huggingface.co/SmerkyG/RWKV7-1.5B-World3-128k-250309 work for you?
python: /project/Lib/Tools/LinearLayout.cpp:562:mlir::triton::LinearLayout mlir::triton::LinearLayout::reshapeOuts(llvm::ArrayRef<std::pair<mLir::StringAttr, int>>) const: Assertion `getTotalOutDimSize( )== std::accmulate( newOutDims.begin(), newOutDims.end(), 1, [&](int32_t acc, auto &outDim) { return acc * outDim.second; })' failed.
I'll try this, thank you.
Please send an issue to FLA
use triton nightly
2. for Figure [FLOPs vs. Average Benchmark Accuracy], add [active params vs avg acc] too```
This is the problem: Installing some new package may override the triton-nightly installation with triton 3.2.0. So it is better to have the code work properly for triton 3.2.0 and later versions.
install from scratch or wait for the next version
this is triton's bug. I won't fix it because large number of warp is crucial for performance
I see. You have already uninstalled Triton so I shall not bother you for that.
Thank you both, I'll replace the package.
Ok, I'll bring up this issue.
Do we have the context extended checkpoints also as non-converted HF models (i.e. normal rwkv models) available somewhere?
How does RWKV-7 behave past its training context length? Does state collapse still happen?
It can extrapolate. RWKV-7 trained with 4k context extrapolates to 32k+ #1083107245971226685 message
In the paper we have currently Long Context Experiments with PG19 dataset as well as single needle-in-the-haystack.
There is perhaps some kind of overfitting on short context phenomenon for the world models but not pile models that was reported by @iron parrot #1103039376184852622 message
From @paper dove :
I have a small suggestion. I saw RWKV-6C mentioned in the discussion, but people who are not familiar with the rwkv version may not understand what it means.
Upgraded Finch (RWKV-6c) should be explained somewhere
Perhaps we can refer to the GoldFinch paper for this.
It is mentioned in the Additional Architecture Discussion Architecture Details Section
RWKV-6c is mentioned first time in the Method section 4.1.1 Weigh Preparation. I named it Upgraded Finch there and referred to the Additional Architecture Discussion where it is introduced under the same name in addition its version for now. Hope this makes it a bit more clear.
I don't think it's in the goldfinch paper - it was adapted from the ideas in that paper but ended up being a later designation by Blink for a RWKV6 variant that never was trained for anything much
It's not called Upgraded Finch anywhere in the world, so I don't think we should use that name here
Ah, ok.
iono maybe I did rename it 6c in GoldFinch? I don't think so tho - checking now
yeah in GoldFinch we had a version that included other changes called Finch-C2
its not Finch-C2 sorry
its Finch-C / v6c
there are differences, which is why I named the goldfinch version Finch-C2
Got it.
blink's internal designation is x060c
yeah
but yeah this isn't really described in any paper other than GoldFinch, which is where the idea for it came from
I'll take a look and add more descriptive content around it
Merged all lm_eval results files from benchmarks table 3 and table 4 per model. Parsed merged files, created a pandas dataframe with combined average accuracy across English and multilingual tasks, and plotted it with matplotlib.
i dont get it, this is just a merged combo of the existing two flops plots?
also why would you multiply tokens by params instead of calculating actual compute
Yeah, kinda. Just tinkering around to make concrete suggestions. It was just a simple quick approximation.
Perhaps scaling the dots size to parameter size might make it look more informative.
Here I multiplied parameters in billions with 100 and set as marker size.
I saw similar plots where the dots size represented model size in papers and I liked it.
I am also suggesting adding a bit of transparency. Helps with overplotting issue.
Used alpha=0.5 above.
2. for Figure [FLOPs vs. Average Benchmark Accuracy], add [active params vs avg acc] too```
could you describe params vs acc? I can show you what it would look like but I don't think it's very informative
x = log(active params), y = avg acc
define active? Like do you want to double tied embeddings?
non-embedding params (so related to inference flops)
actually can use [inference flops] vs [avg acc]
@brisk bronze did you test RWKV-7 1.5B and 2.9B on English evals?
Okay so no embed and no lm head either
lm head are active parameters
nvm I am still sick and clearly not thinking well
Someone else better do this chart
@brisk bronze maybe you can take care of it tomorrow?
Should be ez to copy our existing google plot to make it
I can do this chart
I've run them but with 0.4.7 not 0.4.3 so they should probably be re-run
https://github.com/jannalulu/lm-evaluation-harness/tree/main/results
shall we run everything with 0.4.7?
I don't really see the point? Pawsx and the bos_token are both getting fixed in 0.4.8
Shall we run 0.4.8?
@misty igloo So we do have some reason to run 0.4.8, since Paws-X and bos_token are fixed, and enhanced reproducibility as it's the newest version. And we can run glue with averaging too as requested by Bo,
yeah but didn't we have trouble reproducing qwen results or something
typo, the minimum of w_t is exp(-exp(-0.5)). I fixed it
@brisk bronze @keen tartan
Now I found a big problem: <bos> is added for RWKV-7 but not for other models like Qwen and Llama, so it's not a fair comparison.
But actually, RWKV-7 adding a [0] can enhance the performance of lambada by 0.6% but harms performance of arc by 2-3%.
I think a fair comparison should be conducted without [0] for all models. This also matches RWKV-FLA performance.
w/o [0]:
"arc_challenge": {
"alias": "arc_challenge",
"acc,none": 0.43430034129692835,
"acc_stderr,none": 0.014484703048857371,
"acc_norm,none": 0.4658703071672355,
"acc_norm_stderr,none": 0.014577311315231023
},
"arc_easy": {
"alias": "arc_easy",
"acc,none": 0.7706228956228957,
"acc_stderr,none": 0.008627087045485938,
"acc_norm,none": 0.7584175084175084,
"acc_norm_stderr,none": 0.008783247004042158
}
w/ [0]:
"arc_easy": {
"acc,none": 0.7584175084175084,
"acc_stderr,none": 0.008783247004042158,
"acc_norm,none": 0.7079124579124579,
"acc_norm_stderr,none": 0.009330705616569084,
"alias": "arc_easy"
},
"arc_challenge": {
"acc,none": 0.40784982935153585,
"acc_stderr,none": 0.01436109728844968,
"acc_norm,none": 0.42406143344709896,
"acc_norm_stderr,none": 0.014441889627464344,
"alias": "arc_challenge"
},
Lol this is exactly what I concluded last year so I used all the lm eval cmdline results for that paper, not giving rwkv a BOS
OK so don't give RWKV a bos then
And used bfloat16 not float32
lm_eval adds automatically BOS token for Gemma family of models.
Interesting
There is a comment in source. let me try to reference it.
Line 222
"...part of the Gemma family--a BOS token will be used as Gemma underperforms without it." is what gets logged right under it.
Its probably slightly better this way with a newer version, but I don't think it matters too much if it's annoying or slow to do. As for Glue, the new version doesn't give the non-weighted average as an output; you have to compute it manually anyway, which we can do just as easily using the existing results
I'm agnostic about BOS token usage... I think it's fine but as ZhangRC points out and I found last year, it helps some evals and hurts others so it kind of doesn't make a difference overall for RWKV
Adding a [0] gives -0.4% overall
I don't think that it's important to have the same thing for every model.
Think of it this way: when you have two chat models with different chat prompts, is it more fair to use the chat prompt model A expects for both models because it's the same input, or is it more fair to give each model the chat prompt it expects.
Not really. Should be done next morning
Abstract deadline within one week!
Full paper deadline within 2 weeks!
@everyone
- The COLM abstract submission deadline is on March 20
- We need authors to DM me their openreview ID or email address used for their openreview account.
- If you don't have an openreview account, you need to open one and get it approved ASAP
- If you are not currently listed as an author, and think you should be, now is the time to let us know. Authorship will be extended only to those who have contributed significantly to the paper by supplying experimental data that are included therein and/or doing significant writing. (but not for just having fixed some spelling or reworded a few things)
Table 1 feels cluttered with scalar annotations. Could we move them elsewhere or drop them without misleading the reader or losing nuance?
they're important so we can't drop them, but maybe there's a better way to indicate this?
I would say create a separate column for which variables are scalar, but we're almost at width as it is
The S and I variables are the only matrices right? Perhaps it would be cleaner to use bold to indicate "not a scalar" and note in the caption that S and I are matrices
we already mess with the notation for consistency with sec 4 so the latter would probably not be great. I think bold "not a scalar" works best considering boldface for vectors is convention in some fields
related consistency question: is it Delta Net or DeltaNet
@obsidian quest I added multilang and eng acc vs inference active params charts to the paper... the english one is a bit messy
@misty igloo I think it would be valuable to show the paper to someone who hasn't worked with RWKV much but follows this space and see how accessible the methodological explanation is to them. One of the things I consistency hear from people who work with Mamba and not RWKV is that finding the exposition inaccessible is a major reason why they use Mamba
I agree - @fresh mulch gave some initial feedback on that but more would be helpful
Maybe just drop a draft in #research and ask for feedback on this point as a starting point
@granite pike could provide that feedback if they're up for it
The quality of the diagrams has substantially improved which y'all should be proud of π
+1. Being (formerly, still kind of) that person, not having a clear picture of how RWKV works made me lean towards Mamba.
The question of why RWKV isn't as popular as Mamba came up some time ago. I still believe most of it is accessibility - particularly things like blogposts, etc. that spread the word to the "lay user", i.e. someone who won't read the paper but would use RWKV in their applications
Actually, I think it would be helpful to prepare a blogpost or X thread or similar to release concurrently with the paper, like "here's the technical report and here's a simpler intuitive explanation". E.g. Songlin did this for GDN
I've left some comments on the first half of the paper and will be back to do more later
Perhaps try to combine English and multilingual benchmarks in one chart as suggested previously to make it more robust.
I put improved versions in the paper
I check.
i tried compressing and revising sections 1-3, will be back later for more. things that still stand out to me:
- scalars in table 1, as before
- table 1's caption is really unwieldily long
- we use a lot of terminology in ways that would be familiar to someone in the space that might not be immediately obvious to an outside reader, such as using DeltaNet-specific terms in the introduction and the general idea of key-value retrieval in Section 2 (though maybe the latter is more obvious to people)
- section 2 flows well but still feels delineated at the "Concurrent work" paragraph - subsection break here?
- section 3 "Architecture" feels like it should be a subsection of section 4, or I don't see why it deserves its own section. It describes architectural changes over other methods, so I feel like it belongs at the beginning of the part where we describe the architecture in technical detail
- section 4.1.1, after the big table of parameter definitions, could use some better structuring
- I really love the new figures, they're super simple!
Should we already aim to compress the main part of the paper into 9 pages or at least plan ahead?
I am suggesting a project web page for the paper like it is nowadays very popular. Considering even some interactive visual elements.
Could be a simple Github page dedicated to the RWKV-7 Goose Model.
As well as hands-on tutorials on how to get started.
The RWKV Language Model
Yeah, we have the official website, true!
That is a good place for tutorials.
Probably just fleshing out the official website is the best way.
Who is managing it right now?
We also have the wiki: https://wiki.rwkv.com
It is just a bit outdated but a good starting point
I originally did this but there seems to be no apetite for it for the arxiv version so let's wait.
Updating and fleshing out website and wiki is a good idea. Also making it easy for people to get started, ie fewest steps to a (preferably customizable) working model with a walkthrough. Does the fla-hub kernel work with HF Transformers?
fla-hub has its own HF implementations that use its kernels yes
they will probably become the official RWKV HF implementations, at least temporarily
Good news: Will Merrill had some time to go through and do an initial pass merging and polishing the proofs in Appendix D. I think he still wants to do another pass, but it's something that could wait for v2.
@gusty condor The following are the results from experiments to check the impact of RWKV_PAD tokens.
RWKV7-0.1B 11 is with \n as PAD tokens ([11]) which is the default one recommended.
RWKV7-0.1B 0 is with the special <|endoftext|> token as PAD tokens ([0]).
RWKV7-0.1B None is with no PAD tokens at all ([]).
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| Model | Tokens | lmb.o | hella | piqa | arcE | arcC | glue | WG | sciq | mmlu | avg |
+=================+========+=======+========+======+======+======+======+======+======+======+======+
| (Name) | (T) | accβ | acc_nβ | accβ | accβ | accβ | accβ | accβ | accβ | accβ | accβ |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 11 | 1.6 | 48.1 | 42.1 | 67.3 | 59.3 | 25.5 | 48.1 | 52.7 | 86.3 | 25.4 | 50.5 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 0 | 1.6 | 49.0 | 42.2 | 67.1 | 56.6 | 23.6 | 46.3 | 52.6 | 86.2 | 25.8 | 49.9 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B None | 1.6 | 47.4 | 41.9 | 67.5 | 59.1 | 25.2 | 46.3 | 52.2 | 86.1 | 25.5 | 50.1 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| Model | lmb.m_p | lmb.m | pwasx | xcopa | xnli | xsClz | xwin | avg |
+=================+=========+=======+=======+=======+======+=======+======+======+
| (Name) | pplβ | accβ | accβ | accβ | accβ | accβ | accβ | accβ |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 11 | 166 | 31.6 | 46.1 | 53.3 | 37.6 | 52.6 | 64.1 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 0 | 167 | 31.6 | 46.5 | 53.0 | 37.4 | 52.5 | 64.0 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B None | 177 | 31.2 | 46.6 | 53.0 | 37.4 | 52.4 | 63.0 | 47.3 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
Using lm_eval version 0.4.8. GLUE subtasks were simply averaged without weighting.
It seems that the default \n as PAD tokens are preferred across benchmarks.
Hypothesis: RWKV uses \n kind of like an element of its chat template as this is frequently occurring to separate utterances in its training data.
the pad effect should be less for larger rwkv7 (1.5b 2.9b). you can check that π
moreover please check mamba too
please mention <|endoftext|> is always token 0 for all rwkv models
I found RWKV's exposition more accessible than mamba. By solely reading the mamba paper I can't formulate mamba in by brain.
i am building 100K and 1M random items from world-v3 dataset for reference
@gusty condor
I check for other models as well. Gonna also try \n\n for PAD.
pls check mamba 1 & 2 too π
is it possible to release the tools you use to put it together? that way other people can easily replicate the entire dataset
There are a few parts of the paper that look like intimidating walls of text during a cursory sweep of the paper. Would it be worth breaking these up by \subsection or \paragraph, or is this not an issue?
lol aren't you the one who removed the 'concurrent work' subheading in Background
it wasn't there originally! I put it in yesterday, then commented it out myself because I didn't know whether we wanted it or not
gotcha, thats funny
there have been so many changes I can't remember which is which π
@obsidian quest what's the flops/mfu/whatever we get during training?
Appendix J, state transitions. What is meant by comparing "the order of O(1)" to "the order of thousands"?
also will we mention QRWKV at all in this paper @misty igloo as further proof it works at scale
Heh it actually doesn't work at scale for me there
Gets unstable
Also I'm hoping to submit qrwkv paper to COLM separately
I assume it means average state length per element, maybe in an L2 kind of sense, maybe per column? Seems like it could be worded better
@dawn pewter or @gusty condor could you clarify
what are "such ideas" that can be traced back to fast weights and hebbian learning? @misty igloo
i want to modify this section a bit to motivate (our use of) the delta rule via deltanet, if that's fine with you
That was a line @obsidian quest asked to put in earlier in this chat, but it doesn't have to be stated in exactly that way or in that location.
What kind of motivation are you thinking of adding?
Fast weights is basically the idea of test time training the state
reformulating this bit to focus more on whatever deltanet does with the delta rule or something. it's just that the delta rule is not really emphasized anywhere in our background discussion, despite being foundational to the whole architecture
we talk about it a lot throughout but to me it reads like we assume the reader is familiar with the delta rule's role in the development of linear attention
You seem to have it backwards maybe? Delta rule has nothing to do with the development of linear attention
Linear Attention is a form of fast weights tho
oops, yes, that is backwards
Obviously this means my explanation in the text is maybe lacking
The way I was attempting to construct the narrative was:
Transformers, then Linear Attention and its issues, then delta rule fixes those issues, then we innovate on that
I guess my point is this: I like the flow of the discussion of the problem of numerically increasing state, but jumping into delta rule (DeltaNet was the first to...) after that is a shock, and it is not immediately clear to me what the transition is. Is it that the delta rule enables this fix, or it is this fix, or...?
Yes delta rule is one variety of fix
Maybe I didn't make that clear
I did say exactly how it solves that issue in the second sentence tho...
Maybe I should basically swap the order of sentences 1 and 2
oh... does sentence 2 describe delta rule in general (the way it is phrased makes it seem DeltaNet-specific)
I'm probably too tired for this right now π€£
And I think it's a good point that fast weights applies to linear attention too... not quite sure how to shoehorn that in tho.
Well the delta rule is a general rule, applied to stuff... But deltanet applies it to the state (which in its case is the same kind state as linear attention has)
I see, so sentence 4 (basically what you just said) is describing the process sentence 2 describes?
not sure I understand that 2 vs 4 comparison, but generally any messiness like that is bc I was trying to fit in the things Blink required us to say about it
ic okay. yeah the more I talk about this the more confused I get lmao
that's not great π means I probably have some fixing to do
I'd like to make sure it makes sense to readers, even if they have no delta rule background
not sure if explaining linear attention is outside the scope of the paper tho
linear attention at large probably (definitely?) is
No, it means WKV state entries.
https://huggingface.co/BlinkDL/temp-latest-training-models/tree/main/data_sample
data_sample is random subsample of world dataset. note: due to technical reasons (very complicated due to my horrible messy code), some distill instruct data are not included, and only subsamples of these instruct data are included: flan, Buzz-V12, WebInstructSub, SKGInstruct, PIPPA, COIG-PC-core
I think jsonl could be better
ok you can build them from my binidx
for depth 1 & long length, i think we need better optimizer or data design (can try curriculum learning) for rwkv7 to grok @crystal hull
Built. Imagine RWKV learned with this quality of data can surpass llama3 and Qwen2.5π
And very little Chinese (less than 1%)
@misty igloo @obsidian quest https://huggingface.co/datasets/rwkv-x-dev/rwkv-world-v3-subsample
and how many heads are you using? we need at least 5 heads for single layer rwkv7 to solve S5. so can use multiple small heads π
@crystal hull
pls add an alternative version diag(w)(I - ckak) where one is free to use c=2 (this version was used in othello and found to be useful)
@gusty condor subsampled 100k as jsonl: https://huggingface.co/datasets/hevok/rwkv-world-v3-subsample-100k Maybe combine as different subsets in one dataset
Got it working. Just need a yaml configuration defining the subsets in the README: https://huggingface.co/datasets/hevok/rwkv-world-v3-subsample
Made index a subset as well: https://huggingface.co/datasets/hevok/Goose-World-v3
I don't think that we should be supplying this data with the paper if it does not properly represent the actual data used to train the models, due to the limitations Blink mentioned.
@young sparrow in your opinion is this better than nothing, or is including something that doesn't fully match worse than not providing it at all?
(some distill instruct data are not included, and only subsamples of these instruct data are included: flan, Buzz-V12, WebInstructSub, SKGInstruct, PIPPA, COIG-PC-core)
like each element? the specific meaning of the big O notation here is confusing to me
I'd like to understand it better since I think there is similar notation used here:
The $\tilde{k}_t$ in the formula can be regarded as a "normalized key", a design to ensure that the state of $\bm{wkv}$ contains columns of $O(1)$ size.
and I had changed that from 'entries' to 'columns'
the columns in the state represent values, basically - so I'm not sure that a per-element analysis is really the best metric around keeping things normalized. A vector kept in the usual form in pytorch has L2 Norm of sqrt(vector_dim)
wait for me to provide a patch
O(1): not growing over context length, and no outliers
For COLM we are apparently required to designate one of our authors as a 'reciprocal reviewer', and I'm not qualified to be that person:
Reviewers must have research experience equivalent to a second-year graduate student in machine learning or a related field. They must have been a primary author* on at least two peer-reviewed conference or journal papers published in a related venue (e.g., ACL, NAACL, EMNLP, ICML, NeurIPS, ICLR, JMLR, TMLR, CVPR, ICCV β this is not an exhaustive list).
Please let us know if you're an author of the RWKV7 paper who meets the criteria above and would be willing to do this for us. This is a requirement - we need somone to do it in order to submit to COLM 2025.
Update: I think we have this covered now - thanks to everyone for reaching out!
@paper dove
What is going to be the result of this patch? A genuinely representative sample of the data?
@misty igloo @gusty condor @obsidian quest If the data isn't going to get released you can't say that the "RWKV v3 World public corpus" is a contribution of the paper
how about a detailed description of?
That's not a contribution to the scientific literature
Yes, and I think at least 1% of the total amount is required
You also can't refer to it as an "open source corpus"
We could take inspiration from the Allen AI Institute's OLmo (Open Language Model) Project.
They tried to address open source as best as they could
They released their dataset
Dolma?
Yes
And the way they talk about the licensing of their dataset has mislead a lot of people into thinking it's openly licensed
Yes, I am looking into how they did it and try to follow their guide.
I don't know what you mean by that
I am thinking about it.
What they did was release the data and wrap the entire thing, as a collection, in a database license. That database license is open source and the way they did messaging around it lead people to think that the data was openly licensed.
I now understand. We should avoid such pitfalls. Thus, also learn from their mistakes. I think it is a great project though.
I will not let us fall into those pitfalls π
If I'm getting authorship I'm happy to do it. If you need me to do some writing to qualify I can add some of the things I've suggested in comments.
Thanks, apparently we can list Will Merrill - so I think we're in a good place now
Describing the dataset in such a way that people can replicate it isn't a contribution to the scientific literature?
Maybe the phrasing needs to be a bit clearer around enabling replicability?
This is pretty much the exact same thing we did in the last paper, so I'm not sure why it's not valid this time
One sec (edit: actually I gotta run, be back later)
does rwkv7 show grokking without softmax?
@obsidian quest please pull https://github.com/RWKV/RWKV-LM to sync with your upstream repo
done
Shall we put RWKV7 code into RWKV-v7 folder?
No problem, let me know when you can. Also if you followed up on the 'missing' three (?) datasets please let me know where that ended up. We're trying to get the paper on arxiv as soon we can, and I want to make sure we have this dataset stuff ironed out to your satisfaction.
I think we shouldn't let the dataset issue to delay other valuable information in the paper from unveiling to the public. The dataset can be dealt later, but many people outside this channel are longing for the paper.
just call it dataset preview, for now. will fix it when i am less busy
All 3 missing datasets were found.
- Wikipedia: Loader not working anymore #1103039376184852622 message
- Guanaco #1103039376184852622 message
- Books3 #1103039376184852622 message
@misty igloo
Guanaco is taken down because of Josepheus. Has a reputation of rugpulling on datasets
Have been grinding though all the RWKV World v3 corpus components and made sure that it is possible to download and sample from each component.
Here is a the updated annotated dataset: https://huggingface.co/datasets/hevok/Goose-World-v3
Good news is: There are no major obstacles for reconstruction.
cool!
Just a few tiny details are lacking that would be helpful to eliminate ambiguity.
should I be copying this to the official RWKV HF
Yes, I think moving it to official RWKV HF.
I could just rename it if I am member of RWKV HF.
So it keeps the statistics from the original one as it already had quite some traction.
I can't give you that access, unfortunately
How about I move it to another org repo and you move it from there from org to org.
I could just create an org and add you to it.
sure if that's doable!
but then you wont be able to edit it any more
oh
Moved it to temporary Organization for now: https://huggingface.co/datasets/Goose-World/Goose-World-v3
ok well it's fine, let's just wait and add it in the next version of the paper
that way you have time to edit it a bit more
I will move those too.
They are also in the main one as subsets.
It has 3 subsets: index, 100k, and 1m
Moved all and assigned you admin rights.
I guess we just gotta update the links once we move orgs
@keen tartan do you want me to put it in RWKV now, or wait so you can keep editing
is there some way to add in the up/down sampling frequency info from the Eagle/Finch paper in as a column here for those that weren't just used as-is?
Let me check.
I can provide the code to generate the tables.
There is column for world version already.
Looking into how to add up/down sampling frequency column too.
the amounts are listed in the attached wiki.txt for the Eagle/Finch paper ... not sure its possible to include this
I check.
and oscar.txt
yeah seems tough to do, only reasonable way would be to maybe pre-process those datasets to create the filtered versions
and provide those separately as components
but if we did that it'd make the whole thing quite reproducible I think!
since those are the only specially sampled items
Is it only for the Wikipedia and OSCAR23.01 datasets that certain languages were randomly subsampled, right?
yeah
Then it seems doable.
I mean, review the Eagle/Finch paper to be sure, but that's my recollection
I will do so.
HuggingFace Hub is based on Git. So contributors outside of the organization should be able to make pull requests (called Discussions).
cool
I just dont want to make it harder for you to edit while you're still doing it a bunch
I am flexible. Whenever you think it is adequate to move it. I can work with Git. Just someone in the org needs to approve requests.
Wait.
"Discussions and Pull Requests are currently enabled for this dataset. Members of the community can propose changes to this repository."
Only members can make pull requests. I misinterpreted it.
You are right. Any members of the HuggingFace community can open discussions / make pull requests
@everyone no changes to the manuscript at this time, please - we are going to try to put it on arxiv
We will update our eval results for arxiv v2.
+2 points on ARC-e and ARC-c each, and small gains in MMLU.
We may exceed past Qwen2.5 this time with lm-eval 0.4.8
I don't think changing the evals in a future revision is a good look. Let's do it now or never.
@obsidian quest How many tokens, and on which dataset, did you tune v7-world3-2b9-preview into v7-world3-2b9?
the former seems to be higher on certain evals like glue, gsm8k, and several others.
I used the markdown package (which may require lualatex which is incompatible for arxiv). I will change it
Hi everyone, Xingjian DU and I are still working on the audio modeling task. We were wondering if there might be any space available in the Evaluation section or appendix, either in this version or a future one, to include our work on this task? Of course, we fully respect your timeline and will align with your schedule.
Sadly, @misty igloo missed the deadline. This means that we have another 24 hours to go.
So, we should focus on:
- evaluations
- Audio modeling tasks if applicable.
Yes, please go ahead and insert your audio modeling subsection in the Multimodal Evaluations section
You have 23 hours left.
Haha try to do it much sooner than 23 hours though, please π
I have to sleep now. I will check the evaluation section.
By the way, @obsidian quest please tell the difference between v7-2b9-preview and v7-2b9-release
@keen tartan if we're using these new results I'll need evals for Qwen2.5-7B as well for the FLOPs chart
also, I don't understand how your glue results have an average.. I thought those aren't given in later LM-eval versions
is this really using 0.4.8 for all of these?
I don't see the glue overall in the rwkv results for example
Is this the actual missing dataset, or is the missing one the original books3 that included the gutenberg portions?
Oh, ok. I try to get the evals together. Evaluation of a 7B takes a bit longer. But we try to complete in time.
I use average function based on @brisk bronze's extracted source when processing the results files. It is not a big deal to calculate it at all.
Yes, I have more results already.
it is using 0.4.8
mom
Trying to share relevant evaluations for the paper I run there.
hmm so 0.4.8 like gives glue averages sometimes but not others? are these the size-weighted ones or non-weighted?
bc either way we are going to have to manually calculate the averages for the ones it didn't print them for
It only gives results for the individual subsets, but it is easy to just average them. The function allows to toogle weight/non-weighting.
I share relevant code block.
but I see glue averages in some of them π
code block?
you're running these not via the cmdline?
that explains it
I aggregate all results files to make tables and plots.
I figured you ran the RWKV tests via lm-eval cmdline now, just like the rest of the evals
since in 0.4.8 it properly uses the flag for BOS token
I can do either way.
One moment please.
def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
# A helper function that is used to aggregate
# subtask scores cross-task.
if not weight_by_size:
sizes = [1] * len(sizes)
assert len(metrics) == len(sizes)
return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)
That is the function I am using to calculate the average in post-processing.
glue_val_split = {
'cola': 1043,
'mnli': 9815, # _matched
'mnli_mismatch': 9832, # ed
'mrpc': 408,
'qnli': 5463,
'qqp': 40430,
'rte': 277,
'sst2': 872,
'stsb': 1500,
'wnli': 71,
}
These are the individual subtask names and counts.
which variety is shown in the glue outputs you currently have in this folder?
It is not in the results files.
it is actually
yes that's why this question arises π
bc it generally doesnt show up there for 0.4.8 cmdline
I only see the statistics of the subtasks.
Anyway, calculating the average is not a big deal.
Hold on.
and as far as I know, that means it was not run under 0.4.8
This file was generated from conversion of markdown table from @gusty condor
It was from a previous experiment using 0.4.3
but.. you said that this folder had all 0.4.8 results
Sorry for the confusion.
this is why I am trying to make sure everything is done correctly
because it's clear to me that there is mismatching data
Then there should be '0.4.8' in the file name.
I don't see any files like that in this repo
in sub folders.
I mean for anything other than rwkv
I have not ran it yet for all models.
I have results for reference models that I have not yet pushed there yet.
oh ok
sorry, that was the mixup - I didn't realize that not everything was done yet
I have results for the reference models like SmolLM2, Llama, and Qwen as well.
Trying right now to organize them.
I did run those a week ago, but thought they will not be used in the paper.
yeah I didn't expect us to change to this in the last 24 hrs before publishing
I don't have Qwen 7B yet.
I try to share what I have before going sleeping.
@gusty condor and @brisk bronze and anyone else who likes can complement results.
she's busy until tomorrow, unfortunately, so probably not enough time for her to contribute to those
Good to know.
By the way I figured out we can speed evaluation with multiple GPUs.
there are a few ways to do that using the cmdline
Using accelerate, e.g. ```bash
accelerate launch -m lm_eval --model hf
--tasks lambada_openai,arc_easy
--batch_size 'auto'
yep
Also set batch size to 'auto', then it tries to calculate ideal batch size itself.
that 'auto' bsz tends to break (or it used to) for mmlu
but works on many normal evals
I also have a version of the RWKV eval harness thing that supported batched inference and is much faster
but I don't want to use it here
that's why I wanted to use the lm-eval cmdline version for RWKV, so we get multi-gpu acceleration and batching
anyway it doesnt matter, since you finished those
@nova frost Isn't the lm_eval version specified in the output json file's metadata?
no but we do log the git_hash
oh, that might be helpful to differentiate them.
yeah you can either checkout or browse https://github.com/EleutherAI/lm-evaluation-harness/tree/<git_hash>
"git_hash": null, -.-*
damn. lol. I'll add the lm_eval version going forward
but that probably meant it wasn't run from a git dir, so installed from pypi
I cannot tell for sure what version of lm_eval I ran the reference models evaluations for SmolLM2, Llama, Qwen some time ago.
We may need to recompute them to be certain.
LR decay 1e-5 to 1e-7, on randomly sampled 100G tokens. slightly improves loss & uncheatable eval
Tried to organize most of the evaluations I ran: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval
Not sure about the reference model's eval versions. Gonna try to get a few hours sleep.
@misty igloo Should mention in the paper
Running the reference models (qwen, llama, smollm) on 0.4.8, results will be here
https://github.com/jannalulu/lm-evaluation-harness/tree/main/results-0.4.8
Note: perplexity of lambada_multilingual should be the geometric mean over 5 languages, not the arithmetic average. Strange that even lm_eval was mistaken on that.
What is your source for this?
I am awake. I focus now on SmolLM2 model series evaluations.
I think it's almost obvious, from the definition of perplexity.
The geometric mean of perplexity is equal to the exponential of average negative log likelihood loss.
On the other hand, the arithmetic average of perplexity has no clear semantic meaning.
Awesome! The Qwen ones are almost done. Qwen 3B misses 5-shot MMLU and Qween 0.5B needs also be calculated.
Llama 1B and 3B is also needed.
SmolLM2 135M, 360M, and 1.7B also required.
You've done that.
I check.
The empty items in the tables 3 and 4 are currently missed.
We should recheck pile models too
Qwen 2.5 7B sciq and 5-shot MMLU is also missing. Any one running those? I could attempt, but my runs take always so long to finish.
I try Qwen 2.5 7B sciq
Done, trying now 5-shot MMLU (but it seems to take over 5h). I am abounding 7B and focus on the smaller models first for now.
Qwen2.5 7B mmlu 5-shot=74.2 confirmed by both https://arxiv.org/pdf/2412.15115 and https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/Qwen__Qwen2.5-7B/results_2025-03-06T22-02-41.880180.json
is the tab:audiorwkv_results table ready for this? It's not currently in the document
We will need that in the next few hours in order for your experiments to be a part of the arxiv pre-print in this version
The COLM deadline is also soon, and I will need your open review ID's and/or emails used to sign up with open review
I also think you need more explanation of how your "approach enhances RWKV-7's capabilities to interpret and process complex, high-dimensional spectrogram features"
If you make claims in the paper they need to come with evidence.
The text currently does not describe anything at all about how or what AudioRWKV-7 does, except that it uses spectograms.
Considering the timeline here I am going to comment this out of the paper for now. If you think you have what's needed before say 4pm UTC today, let us know here and we can consider if there is time to put it in.
This is a very late addition, and I'm not guaranteeing that it will be able to become a part of the paper. That will depend on both when you have the full writeup ready, and what the quality is like.
this seems on the high side. Are you batching correctly?
I set it to --batch_size="auto".
Used an NVIDIA L40S with 48Gb memory. It occupied almost all the VRAM, so I assumed it was batching it correctly.
For another run with Qwen-2.5 3B I got OOM for the same lm_eval command args but different smaller dual T4 GPU. Gonna try setting batch size to 1 or using single GPU in this case.
yeah auto can sometimes be unreliable
PRs welcome if anyone can improve on it: https://github.com/EleutherAI/lm-evaluation-harness/blob/fa1ce2c665aa4d079a822fbb6fae905d531aca1f/lm_eval/models/huggingface.py#L736
I am interested in contributing to lm-eval! Need to recover my GitHub account first...
can also do auto:N so that it recomputes the batch size N number of times. But this is mostly helpful if you're running multiple tasks (so more variation is seq lengths)
Oh, gonna try that too. Thanks! Default is auto:1.
Urgent: who is testing llama3.2?
it's possible Janna is running those, (but she's in PT timezone, and it's still 7:16am there, and she just got back from a late night flight)
last night she said "probably would also do llama and smollm"
I had asked her to coordinate with @keen tartan tho so maybe he knows if they're in progress
That is possible. I was very tired/exhausted last night.
I am testing Llama 3.2 1B and 3B.
I can push the results I got so far.
it is still not completed all yet.
lm_eval version is in file names. I move to folders later on.
I am testing too. ETA 50 min before deadline
Oh, you mean it will complete 50 min before the submission deadline? ugh
once we have all the data I have to see if christian can regenerate the flops plots he made
I have submitted our abstract to COLM.
this should not be very difficult... if the numbers don't drastically change, the same formatting should work, and that's most of the complexity in making the graphs
hey, wait, llama3.2 is not in the FLOPS charts
Great! We are almost done.
yeah just a matter of your timing availability considering the tight timeline around the 16:00 UTC deadline
What are we still missing revised number for? Just SmolLM?
I have a meeting for the next hour but after that will be available to update charts, which should work out
SmolLM numbers are done, see table 3 and 4
only llama 3.2 3b left
okay I'll start plugging in the data into my google sheet, and Christian can copy from that later
and eta 1h
we dont use llama on this sheet so no problem
@misty igloo Please redact links in the abstract for COLM for anonymity!
there is no new pawsx test in @brisk bronze 's output for Qwen
also, I need some numbers for lambada.m on Qwen 7B if we are calculating it some new way
We present RWKV-7 "Goose", a new sequence modeling architecture featuring constant memory usage, constant inference time per token, state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. To accomplish this, RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. We also present an extended open source 3.1 trillion token multilingual corpus. We trained a set of models from 0.19 billion to 2.9 billion parameters on this dataset and find they exhibit exceptional performance across a range of common benchmarks.
To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.
We present RWKV-7 "Goose", a new sequence modeling architecture. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To test RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. These models exhibit state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. Still, RWKV-7 models remains at constant memory usage and constant inference time per token.
To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.
Which version is better?
Thanks for you remind and we are still working on that. That version is just a place holder and an updated version with more model details is underwriting. I will finish it ASAP
please wait to add it until after 1600 UTC - we will be submitting the current version of the paper without it to arxiv at that time
we can still potentially add it later for v2
Please refer to hevok's tests
@gusty condor chart looks wrong for Qwen 3B multilingual
could you check that all the numbers in the manuscript are really correct for that one?
or maybe 1.5B numbers are too high
I attempt to test Qwen-2.5 7B lambada.m.
I think the numbers exist, I just need to know the final avg bc it seems like @gusty condor changed the formula
All right. Indeed, we have it already. Use geometric mean for averaging.
I try.
Where are these new results for RWKV Pile coming from? I don't see them anywhere
oh forgot to run that, will run it now
it got run already, see Hevok's data above
oh
import numpy as np
#define custom function
def g_mean(x):
a = np.log(x)
return np.exp(a.mean())
#calculate geometric mean
g_mean([41.524835786735544, 3.70873656895629, 67.94895756237318, 23.454130938244965, 31.073140477732952])
Output: 23.793823761545397
That's only for perplexity
lol
sorry, doing a million edits right now - this is nuts trying to change this whole sheet and all its sub evals at the last minute
without making mistakes
@gusty condor Where are these new results for RWKV Pile coming from? I don't see the source data anywhere
is this just recalculating glue via normal avg?
Seems, like table 3 and 4 is fully completed. Is anything missing by now?
I gonna try to double check values.
I think I have everything done on the google sheet
but I'm still concerned that Qwen line looks bad
I look at Qwen 3B now.
@misty igloo good for me to transfer numbers to mine?
at least provisionally... qwen multilingual seems like some weirdness but otherwise it should be correct
qwen has the same behavior with a relative dip at 3B multilingual in the previous data, but less pronounced ig
plus that's the easy one to change
it may be correct, but it looks sus to me
Seems correct. Checking pawsx again
Yes, and it turns out no difference of lm-eval 0.4.3 and 0.4.8 on these english benchmarks
could also be that 1.5B is the one that's wrong, which makes 3B look like it dips
or 7b
hey, who removed subfigure captions on 3 and 4? we need those for some of the crossreferences
I did - janna had asked if we needed it
I'll comment it back in, sorry
You haven't updated figure 4 yet
it is updated if you recompile
Nope, looks like something wrong with your plot. Now RWKV7 should be higher than Qwen2.5 at 3B
smerky's average sheet says 71.0 average RWKV7 2.9B and 71.4 average qwen2.5 3B
which sheet
@misty igloo
which data are you getting this idea from?
@keen tartan suggested [11] for pad
Why do we show Qwen2.5-7B in for eng in table 3 but not for multilingual in table 4? I think we should considering to comment it out from the table 3 as we have not a RWKV model yet of this class to compare with.
there is no reason to show it in the tables, it's there in the charts to show how it changes with further scaling
I put that in for reference. Should definitely comment out
Yeah, I can see this. Was also toying around in thought with extrapolating how RWKV7 7B and 14B would perform
Apologize for confusion.
Please be specific about what you're comparing - the plots in the document are outdated
the plots in the document as of latest recompile are updated to the latest data in your sheet
so the data we are working with does not appear to support that claim
I agree it looks wrong
checking
@fresh mulch I updated the RWKV7 1.5 and 2.9B numbers
they were old
fixed, uploaded
@gusty condor do you think we can put SoTA now for both, instead of 'near SoTA' english
I'm a little leery of making the claim of SoTA on english
because we don't establish a new SoTA, except on a per tokens trained basis
which should matter but... I just don't want to overclaim
I don't mind this reordering, but it leaves the juiciest part until the very end of the paragraph
It would be nice to lead with our best foot forward: that we have the best 3B LLM for way less training, and demolish everything on multilingual
Bo might have some different opinions: Architecture is the juiciest part, the model serves as a tool to demonstrate the architecture.
It's good for the first sentence to include the best results
I agree, but the only way anyone can judge the architecture is via the results
but not significant enough to claim the new sota
How about something like this:
We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset.
To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.
To test -> to demonstrate
'with LLMs' bothers me a bit.. not sure how to rephrase that
maybe 'with released LLMs'?
okay are we ready for publishing?
Yes!
LGTM
@gusty condor can I remove your footnotesize on the multilang table?
trying that now
post error msg asap so that we can debug
π πͺΏ π π₯³
anyone got an idea of what ACM class we are?
I.2.7 maybe
I.2.7 Natural Language Processing
I guess that's what I'll put
Sounds alright
I think I.2.0 (this is where general architectures should live). But I.2.7 is still great.
Yeah, as it can be applied to other modulaties as well, not only NLP.
Goose-tastic! (or even better: Honk-tastic!)
π
yup, time for that annoying process
We shall have a good rest
I go sleeping then π π΄
you deserve a good sleep!!! gnite! Great work!
great work π
please test mamba for <|endoftext|> effect as i predict it will be strong too.
Time's up! How is it going?
We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B mo...
It just went out two min ago!
how do we submit it to HF daily papers? maybe tweet at Akhaliq?
oh its here https://huggingface.co/papers/submit but I somehow don't have a paper listed so maybe someone else here can who does
Done!
Have you posted to r/LocalLlama and/or r/MachineLearning?
Nope! Plz do so if you can
Sure.
We are competing with 2 papers, #2 of which looks so clickbaity. Yet such papers receive lots of upvotes. This is unfairπ
Now we are #2 and Impossible Videos is #1.
One more upvote and we are #1!
It got nominated #1 paper of the day on HF. Of course. RWKV7 is more fundamental than a "merely deep fake generator". A "bimodal" benchmark is also no real competitor.
We shall may consider adding some illustrative visuals to the abstract page for next version.
your account seems to have been suspended
also r/localllama is horrible with their moderation, I always have difficulty posting about RWKV
@willow condor You can recover your account. Contact Reddit support and submit an appeal.
Thanks!
does rwkv work with attention as a hybrid?
Yes, but that adds no benefit. You will barely see decreased loss or benchmark improvements.
I think you will see a benefit. But I don't have experimental evidence for v7 yet
@paper dove did some: Adding one layer of attention to L12/D768 RWKV-7 decreases loss by around 0.0008 (not significant).
Interesting!
I'm not having the best results with rwkv, it's worse than attention in my experiments
What did you try?
wdym
I mean what experiments did you conduct and what was the result compared to the expected results?
Which models did you use for instance?
I tried it on my custom dataset for language modeling, but it's totally different than typical LM datasets. I compared it to transformer with rope, value residual and muon optimizer
So you trained model from scratch or fine-tuned?
There lots of things to consider for training.
That is very small model. Interesting!
What trainer have you used? The RWKV-LM repo?
no, mine code
I see. If you like to share,we can provide feedback on getting better results.
There must be something wrong with the code. 0.3 in loss is too significant.
I literally replaced attention layer with Rwkv7Attention from fla
RWKV7Attention(
mode="chunk",
hidden_size=hidden_size,
head_dim=64,
num_heads=None,
decay_low_rank_dim=64,
gate_low_rank_dim=128,
a_low_rank_dim=64,
v_low_rank_dim=32,
# v_low_rank_dim=16,
norm_eps=1e-5,
fuse_norm=True,
layer_idx=layer_idx
)
The initialization of FLA-RWKV7 does not function properly.
Parameter Initializations Proper parameter initialization is crucial for ensuring training stability and achieving optimal performance for language models. RWKV-7 employs a carefully designed initialization strategy tailored to its architecture. The detailed initialization scheme is beyond the scope here but can be found in the official code repository. We emphasize that using the recommended initialization is essential for replicating the results in this paper. Deviations from the prescribed initialization may lead to performance degradation.
good to know lmao
so riddle me this
how was this trained? https://huggingface.co/fla-hub/rwkv7-191M-world
It was converted from RWKV checkpoint.
ah ok
It was trained with code from this repo: https://github.com/BlinkDL/RWKV-LM
Check the RWKV-v5 folder. There is the training code for RWKV7. I know it is a bit confusing.
where is initialization code?
RWKV-v5/src/model.py
This function is extremely obfuscated, but the main purpose is:
- initialize down projections with 0
- initialize embedding with very small numbers
- orthogonally initialize up projections, r, k, v and output head with relatively small gains
- initialize token shifting with some magic numbers
moreover use LayerNorm for rwkv7 (not RMSnorm)
yeah I use ln
moreover you can modify https://github.com/BlinkDL/modded-nanogpt-rwkv
(note this is a variation of rwkv7)
you can verify this
https://x.com/BlinkDL_AI/status/1855245097094517181
RWKV-7 can also reach 2.27xx in 3200 steps (originally 5100 steps)πreproducible code & log: https://t.co/cuH0pItsPy π #RWKV #RNN
I mean it's kind of interesting because on some other dataset which was more prone to overfit on some implicit concepts, it behaved better
0.3 loss difference certainly means something is wrong π
if it is not better than your transformer, the code is buggy
yeah but in this case it was better by 0.3
are you comparing train loss, or val loss?
val
how about train loss
got train loss curve comparison?
@dusty skiff please do these first
yeah I have to figure out the code, or maybe you've got some idea how I can modify this
def _initialize_weights(self, module: nn.Module):
if getattr(module, "_is_hf_initialized", False):
return
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
if module.bias is not None:
nn.init.zeros_(module.bias)
if isinstance(module, nn.Parameter):
nn.init.xavier_uniform_(module, gain=2 ** -2.5)
module._is_hf_initialized = True
sorry for the mess haha
should I use this?
# !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
# self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
# self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
# self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
# self.output.weight.data.zero_()
I'm lost lol
idk I have this and there's little difference in loss
yeah uncomment this block and there's another block you need to uncomment
ctrl+f for the first comment line
dont use if name == 'xxx'
use if 'xxx' in name
and print all names, and print() sth inside if, to make sure these ifs are called
@dusty skiff let's move this out of the paper writing channel and into the rwkv discord or rwkv channel here
but generally speaking, wrt to the papers, we really need to provide an easy to use training code (FLA?) with proper inits
or else everyone will have this experience
the RWKV-Block repo could become this, but someone needs to devote time to making sure it's really perfect first
and that someone will not be me π€£
I think improving the FLA code specifically is important, since that's probably what people will try first
@gusty condor I don't know if you have time to help fix that but it'd be great if you do
I currently copied the FLA models to the official RWKV HF, so it's the default implementation right now that people find
does it have correct initialization
nope, but @random granite expressed interest in getting it to
I think the problem is their setup for all the FLA models isn't currently well suited towards special inits and needs some changes to support that (to be fair, our code for that is a horrible mess)
@sonic horizon When do you expect to have the full AudioRWKV experimental results and additional baselines ready? Please keep in mind that the final COLM submission date is about a week away.
If featured, just know that it will almost certainly end up in an appendix for that paper. We are an extreme premium for space, as the entire paper must fit in 9 pages.
Great work on the paper everyone π
We are training a larger model to match the parameter scale of the new baselines. Since the audio embedding training starts from scratch, it takes several days to complete. Of course we will pay attention to the COLM due.
uhm looks like yes? (I hate the notation we used)
will update the paper
Yes
The main problem is that FLA's initialization conflicts with RWKV-LM's initialization. If some layers initialized with RWKV, others handled by FLA, the model can't train properly.
i added a table to https://github.com/BlinkDL/RWKV-LM
let's fill in all details for a version in paper
Todo:
-
#1103039376184852622 message
-
#1103039376184852622 message
-
#1103039376184852622 message
-
#1103039376184852622 message
I ran a quick experiment to test Mamba2 with and without add_bos_token flag and found no difference in accuracy and no significant difference in perplexity.
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-370m
Perhaps the lm_eval's add_bos_token option is buggy for Mamba models as well and did not actually add it.
Gonna try again with installing from GitHub directly.
I don't see what value adding this to the paper has
I think it is meant for the RWKV-LM repo so that users will not run into issues with wrong initialization when attempting to train models from scratch as the table provides sane recommendations for implementations.
@obsidian quest You were right! 43.9% versus 43.5% in accuracy and 16.8 versus 17.1 in perplexity.
No bos:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai| 1|none | 0|acc |β | 0.4392|Β± |0.0069|
| | |none | 0|perplexity|β |16.8289|Β± |0.5443|
With bos:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai| 1|none | 0|acc |β | 0.4353|Β± |0.0069|
| | |none | 0|perplexity|β |17.1491|Β± |0.5539|
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m/0.4.8
No significant difference. Why right?
RWKV-7 has better scores with BOS.
The effect although small seems to be the reverse of the one for RWKV: no bos is preferred over bos for Mamba2.
test RWKV-G1 too: that is very significant.
I did already.
Might need to organize it better.
The difference of scores there is within the margin of error. The correct conclusion is that there isn't a meaningful preference.
Yeah.
It is a wild-goose chase (pun intended). ^^
please test those 142 special problems
Which 142 special problems?
Perhaps sample size is too small. So testing it on more evaluation tasks might reveal statistical significant difference.
Ah I see what you mean.
You mean the 142 examples where the answer is the first token as identified by @gusty condor. Gonna check those.
where do people want to put this? in an appendix on alternate designs? we're still working on fitting the paper into 9 pages for COLM so adding more to the main paper is probably infeasible, but maybe we could add this into Table 1
@obsidian quest do you have a name for this formula? RWKV7-alt? lol
(Also, did you try it for language modeling? I was always hoping we'd move the w outside of the evolution formula, if you recall!)
ok lets call it RWKV-7a
it causes more NaNs π
Did you ever capture a replicable NaN?
Created custom task for special 142 samples of lambada-openai and tested Mamba2 again. No significant effect it seems:
no bos:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 | 1|none | 0|acc |β | 0.4085|Β± |0.0414|
| | |none | 0|perplexity|β |14.5471|Β± |2.6646|
add_bos:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 | 1|none | 0|acc |β | 0.4085|Β± |0.0414|
| | |none | 0|perplexity|β |14.1444|Β± |2.6250|
I basically used the following code, but tested Mamba2 rather than SmalLM2: https://colab.research.google.com/drive/1nle-APaWJ12uA-WS9zgLEqAmY6cGWRHm?usp=sharing
I may have missed something perhaps.
That's perfectly normal for mamba 2.
we're now #1 on weekly papers on HF too
https://huggingface.co/papers/month/2025-03 11 votes to be #2 of the month
@misty igloo Hi, sorry to keep you waiting. I have updated the final experiment results of audio modeling and finish the writing of AudioRWKV, in \paragraph{RWKV for Audio Modeling} , RWKV-7 (preliminary) .
If there isn't enough room in the main text, we could include it in the appendix. However, I'm not sure which section would be the most appropriate β could you advise?
Awesome! Those results look great!
For COLM I will definitely have to put it in the appendix. But it might be okay to leave it in the main paper for the Arxiv version - I'll take a look and move it if needed.
Are DeepRes and HST-AT the transformer based architectures?
I'm unclear on the difference between HST-AT and HST-AT pretrained
Also, could you explain what is meant by
Note that we did not use the ensemble trick in this experiment, resulting in a slight drop in performance compared with results reported in \citet{rwkv6_colm}.
I'm not sure I understand what the ensemble trick is or what results you're saying were in the rwkv5/6 paper
DeepRes is based on Deep Residual Network , HST-AT is transformer based. HST-AT pretrained means its weights are initiated by pretrained vision models , which is a common used trick to improve performance.
For the ensemble trick, it provides a bigger ensemble result by using models with different patch settings. We used it in the audio modeling section in RWKV6.
how many heads are you using? we need at least 5 heads for single layer rwkv7 to solve S5. so can use multiple small heads @crystal hull
p.s. note i chose v^T k instead of k^T v because it fits the L2 loss
I can add these details in the writing .
I already added the RWKV7a formula to Table 1, not sure where else we should mention it
The proofs do currently show a version with c=2, and then show how to remove it while maintaining proof correctness
please mention RWKV-7a is found to be useful for othello @iron parrot
I'm a little confused bc I don't remember an audio section in the Eagle Finch paper, and can't find it in there now either?
(If it's not in that paper, maybe you could cite your repo for those results instead of the colm rwkv6 paper?)
Great! Thanks
I just checked and found that the results are in Arxiv v4 of Eagle Finch paper. If this version hasnβt been widely shared, I can remove the part of 'comparing with RWKV6' so as not to confuse the reader.
oh you could just cite the arxiv paper
but this thing where the results are worse isn't good
Could you train the RWKV7 version with the ensemble trick? It's important to show that it does better than the RWKV6 version, and will presumably improve your results vs the other models
Yes , we can do this. But it may take some more time.
I guess the other option would be to train the v6 without the ensemble trick, to show that RWKV7 is an improvement
Not sure if that's faster, but of course the better result is preferable
I think it's quite important to show some apples to apples comparison with v6 though
Otherwise you haven't really demonstrated anything about v7, which is the goal of putting this into the paper
That's a good idea. We can train RWKV6 without trick.
No tricks in training RWKV7 please, or use tricks for both.
Time to work for COLM submission!
@tropic minnow and I already have it mostly done
sorry, posted the wrong paper link a moment ago... will get one asap
fair point about variable definitions, though this notation is standard isn't it
i guess we still ought to define our terms before using them
also is one of these $\kappa_t$ supposed to be $\hat{\kappa}_t$
Christian Azinn
I have updated the paper accordingly.
doesn't really change anything to list it one way or the other.. its just a normalized version
view colm paper https://www.overleaf.com/read/vhrvqrmmztgj#d06246
this plot (right side) might benefit to nonoverlapping text witth grid, and making rwkv more orange, less yellow
Omitting Mamba and RWKV-Pile on the left looks weird at a glance. I know it's because of the minimal multilingual content in the pile, but you should explicitly say that in the caption so someone who glances at the plot has that context. If there are numbers for those models, I would recommend including them even if they're bad tbh. Most plots should be optimized to be easily digestible at a glance / to people skimming
"nonoverlapping text with grid" ie the rwkv7-pile caption?
caption it is, unless we have numbers @misty igloo (i forget who did evals)
We don't have those numbers
If you make edits, please do so only on the arxiv version
I will port them to the COLM document after validating the final choices made
otherwise it becomes really hard to track what changed
makes sense. i also need to change axis titles and fix alignment. will do in an hour or so
Woohoo! I finally got it all to fit in 9 pages with all the figures and tables we need.
"However, training for multi-query associative recall (MQAR) is highly unstable and strongly dependent on initialization and hyperparameter settings
some guy read this and say RWKV7 is bad at MQAR so we dont provide MQAR chart π
so let's add chart for this
in this style (show 1024 & 2048 if possible)
This is proved by some paper #1103039376184852622 message
I just want to avoid suppressing the baseline for other models, as shown by xLSTM paper. The default initialization of MQAR is clearly suboptimal for RWKV-7 and a few other models, but without knowing their correct initialization and implementation I decide to not put them in at all.
let's simply give all models better initialization
and lr too (I used transferable lr https://arxiv.org/abs/2407.05872 for RWKV-7 based on observations so I didn't sweep on the whole LR interval)
Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parame...
@iron parrot did you use RWKV7a with c=2 in the Othello experiments? I added a couple of sentences there - please review and expand on whether or not this was what the code did.
I checked and it's accurate. I tried both c=1 and c=2, I can include the loss curve comparison in the paper if needed
From the results in Appendix C, c=1.545239211892605 ( 1+exp(-exp(-0.5)) ) is the maximum value of c that ensures stability.
Everyone, please read through the COLM paper https://www.overleaf.com/read/vhrvqrmmztgj#d06246
and let us know if there is anything that's wrong
However, training for multi-query associative recall (MQAR) is highly unstable and strongly dependent on initialization and hyperparameter settings. We observe significant variability in performance under identical configurations across different studies
this is not true for rwkv so we shouldnt mention it, otherwise people think it's rwkv issue
lets fix table 10
this is not in the COLM paper yet
but I will fix it now
I have now removed this paragraph
Fixed and added to COLM Appendix
got link for COLM version? π
Typo: a product a product of elementary transition matrices
these proofs are undergoing revisions right now, and are probably the last thing that will change before I publish the final COLM version
Proof revisions integrated and COLM and ArXiV versions submitted.
(Updated ArXiV version supposedly going out March 31)
RWKV 7 can be made Turing Complete using permutation matrices and state dependent (not just data dependent) transition matrices.
I think the next RWKV should include matrices that aren't just diagonal but rather subdiagonal etc., which would reduce parallelizability for maximal expressivity. End the war with "DeltaFunction"s.
To expand, I mean explicitly give RWKV a way to simulate cellular automata in a continuous, differentiable way. For example, the formula for calculating Rule 110 (Turing-complete) is state + (state @ right) - state * (state @ right) * (1 + (state @ left)) where left, right are the last dimension left and right shifted versions of state (equivalent to multiplying by a subdiagonal matrix or a superdiagonal matrix)
Rule 110, when the state and everything else is bound between (0, 1), displays interesting converging properties where in 3D it converges to 1/phi for all coordinates, while if instead of right shifting or left shifting and treating the edges as constants a or b, a acts like the learning rate and b as the point which the rule converges to. See this Desmos graph if interested:
nobody complains about transformer expressiveness π we should improve rwkv's memory first
I analyzed the download data (only counting non-quantized models), and the results are roughly as follows:
| Organization | Downloads | Likes |
|---|---|---|
| meta-llama | 26,369,349 | 41,742 |
| Qwen | 21,092,745 | 25,817 |
| deepseek-ai | 12,927,530 | 36,137 |
| HuggingfaceTB | 2,439,032 | 3,107 |
| RWKV (incuding FLA) | 70,705 | 537 |
Vision-Language Models (VLMs) are very popular. The top models for both Qwen and HuggingfaceTB are VLMs.
For Qwen, Llama, and RWKV, their most popular models are all 7B-sized.
Based on this data, RWKV should release a 7B model as soon as possible.
This is why I've been doing the conversions. I have a 7B model distilled from Qwen 72B that we can release this week with the arxiv version of the RADLADS conversion paper.
If people want to look at the RADLADS (QRWKV) paper before I put it on arxiv, here's a link: https://www.overleaf.com/read/ytntsmbjwtdr#8bd0d4
any thoughts on making the weight matrix non-linear, by test time training an mlp instead?
maybe a large memory bank which is sparsely activated? im sure these ideas have come up before
sounds like pkm https://arxiv.org/abs/1907.05242
This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exac...
similar to the idea they describe. the keys should be read/write accessible for RWKV 7 somehow to aid the state in storing intermediates etc., which is likely a direction for improvement?
What if you had two states and did product keys on that
pkm is
# mostly copied from https://github.com/facebookresearch/memory/blob/main/lingua/product_key/memory.py but I removed some stuff for simplicity
def pkm(q, keys1, keys2, topk, values):
nkeys = keys1.shape[0]
q1, q2 = q.chunk(2, dim=-1)
scores1, indices1 = torch.topk(q1.mT@keys1, topk, dim=-1)
scores2, indices2 = torch.topk(q2.mT@keys2, topk, dim=-1)
# cartesian product on best candidate keys
all_scores = (
scores1.view(bs, topk, 1).expand(bs, topk, topk)
+ scores2.view(bs, 1, topk).expand(bs, topk, topk)
).view(
bs, -1
) # (bs, topk ** 2)
all_indices = (
indices1.view(bs, topk, 1).expand(bs, topk, topk)
* nkeys
+ indices2.view(bs, 1, topk).expand(bs, topk, topk)
).view(
bs, -1
) # (bs, topk ** 2)
# select overall best scores and indices
scores, best_indices = torch.topk(
all_scores, k=topk, dim=2, largest=True, sorted=True
) # (bs, topk)
indices = all_indices.gather(2, best_indices) # (bs, topk)
return F.embedding_bag(values, indices, per_sample_weights=scores)
rwkv7 handles the state S like
# from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v7/rwkv_v7_numpy.py
...
S = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r
...
maybe you could do
S, ind = pkm(x, k1, k2, topk, bigS)
bigS[ind] = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r
it's probably not correct as is but something similar maybe
@sly agate tried state MoE, and prefix+suffix state tuning
What about TTT mlps? Titans tried that I think and they had good results. Intuitively sparse activations would prevent catastrophic forgetting as the gradients simply wouldn't propagate to irrelevant information.
I tried pseudo State MoE. Fixed gating. Suffix tuning works well for multi-turn QA
does pseudo state moe work?
My method is an attempt to use multiple trained states(Prefix + Suffix) simultaneously during inference.
So it is not MoE.(thats why i call pseudo moe)
It works for my purposes (characterization, knowledge, agent).
By adding routers, we can achieve state sparsity, which may bring us closer to State-MoE.
I previously experimented with the non-state MLPSparse MoE on LoRA.
v7 0.4B(World v2.9) + Router + 4MLPLoRA(r=256) = 0.6B
Due to the dynamic LoRA merge, there were problems with the inference speed, but as a benchmark (Japanese), it improved slightly.
The basic design of MoE is based on Flock of Finches, and the HashRouter has been removed.
pls add these to paper appendix
Thanks to FLA, RWKV v6 and v7 can perform 384 batch inferences on a single RTX4090. This means that there is almost no degradation in inference speed even when inferring multiple states simultaneously.
@obsidian quest about multiple-State-inference? or Prefix + Suffix Tuning?
Multiple state inference is experimental and cannot be guaranteed to be mathematically correct.(But the implementation is simple)
add all as experiments π
I have an idea. What if, we had an external memory that is separate from the state but which can only be read from in a way that automatically changes it? This is more similar to how human memory works where recalling a fact increases its strength, and would allow for better parallelization.
k = key generated from state
v = expected value generated from state
return dot(memory @ k, v) * v to state
memory += k^Tv
or something more generalized.
Updated arxiv paper is live.
Thank you for your excellent drafts. Is there a follow up plan to extend this method to multi-lingual and/or math reasoning to study further applications?
@misty igloo
We have a converted QwQ model, but I haven't tried it specifically with multi-lingual or math! You could see how the QwQ model works - it's available on our featherless.ai platform or at https://huggingface.co/featherless-ai/Qwerky-QwQ-32B
As mentioned in the paper, I generally found that post-training with a different dataset resulted in a 'confused' model. But maybe there are workarounds for this that could be discovered.
is this draft submitted to COLM?
Yeah. I'm just getting all the open source parts together so I can put it on arxiv
btw please add a link to the RWKV paper for all models in https://huggingface.co/collections/RWKV/rwkv-v7-67d43835efa125006183fece
https://huggingface.co/featherless-ai/Qwerky-QwQ-32B/discussions/1 someone is already asking for the source code and data π€£
Here are @sly agate 's experiment log
https://docs.google.com/document/d/1sgX-BpM6RYW0eym_ucPN--WTLs7bTODMhho_k7sTl0I/edit?tab=t.bng3px5w2lfb#heading=h.ne4yo6k6bcp1
Pseudo-MoE Technique: Introduction to Multi Recurrent State Sampling (MRSS) Abstract This paper introduces Multi Recurrent State Sampling (MRSS), a novel pseudo-Mixture of Experts approach for enhancing inference diversity in recurrent neural network architectures. By strategically combining mu...
@misty igloo
I think this is different than state offset tuning. Very interesting that it seems to work well.
# normal matrix-evolution recurrence
for t in range(timesteps):
h = G @ h + k.mT @ v
y = r @ h
# offset tuning recurrence
for t in range(timesteps):
h = G @ h + k.mT @ v
y = r @ (h + self.offset)
# offset tuning removed from the recurrence (kernel)
for t in range(timesteps):
h = G @ h + k.mT @ v
y = r @ h
# post-kernel step
y = y + r @ self.offset
# OpenMOSE method
for t in range(timesteps):
h = G @ h + k.mT @ v
y = r @ h
y = y * (1 + self.time_offset_y)
y = groupnorm(y)
y = self.output(y * g)
# plus another change after the output
y = y * (1 + self.time_offset)
This is why we should open at least a subsample!
sorry, this discussion was about RADLADS not RWKV7
So the dataset in question is the unknown Qwen dataset, not RWKV World v3
We do provide a world v3 subsample (though it's imperfect) see https://huggingface.co/datasets/RWKV/RWKV-World-Listing
I think it's at least 1/3 code + Fineweb + Fineweb-Edu-CN + OpenWebMath + ProofPile-2 + C4
(Simulating Qwen pretraining distribution)
However, I have no idea of the instruction data (likely proprietary; I heard from some Zhihu user that Qwen and DeepSeek own a same proprietary English instruction dataset)
@obsidian quest have you considered test time training something similar to this instead of delta rule (30%@1 on arc agi with minimal inductive biases): https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html#how-to-derive-our-solution-method
i have tried using their arch after replacing all of the multitensor stuff with pure vectors and lobotomizing many softmax cummax layers to generalize it, but it is hard to get the symmetry and weight tying back. seems like it would perfectly compliment rwkv, so i was wondering if you or someone else already knew about this and had tried to incorporate it into in context gradient descent models
let's try to keep this channel for paper related discussion, and use either the eleuther 'rwkv' channel or or rwkv discord for architecture ideas
this blogpost did get brought up in the RWKV discord previously
happy to discuss there more
understood, thanks
I do, but even transformers can handle pure ICL tasks, rwkv cannot
you won't improve memory, it's just fundamentally impossible with such parallelization. Memory also requires expressiveness, but we won't achieve this without making the models sequential at least to some degree
that's why no one will get true length extrapolation with 1 forward pass over 13421512532k tokens bullshit
RWKV-7 paper just got its first citation: https://arxiv.org/pdf/2503.21614
Second citation: https://arxiv.org/pdf/2504.03289
cleaned rwkv7 training reference code
https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v7/train_temp
My code is more aggressive:
- https://github.com/Triang-jyed-driung/rwkv7mini (completely restructured dataset loading)
- https://github.com/Triang-jyed-driung/my-pretrain (pretraining code, applicable for HF-compatible models, including RWKV-7 FLA, and supports pytorch-lightning from 1.9.5 to 2.5.1)
Is there an RWKV paper at ICLR? @void quartz @misty igloo
None that I know of
Apparently there is one on vision-rwkv
what is "RWKV-like"
They just incorporate RWKV