#RWKV-papers

1 messages Β· Page 10 of 1

misty igloo
#

Cool, are those three everything that's unavailable, including in all the parts of world v1/2 as well?

keen tartan
#

MMLU results seem to be missing for Llama-3.2 1B/3B and Qwen-2.5 1.5B/3B. Could you please check whether you can find them as well?

#

@gusty condor By the way, in the case you were not aware, you can specify the --output_path <output_folder_name> flag of lm_eval to let it output directly json files with more metadata.

obsidian quest
#

yeah improved in world-3.5

gusty condor
dawn pewter
#

If Οƒ means the sigmoid, should we used uniformly? For example, both use Οƒ or both use sigmoid

dawn pewter
#

Should the ranges of ΞΎ and Ξ± also be specified? we've only described what they do but we haven't described their range

gusty condor
#

Yes, you can change it

gusty condor
#

Now, I have finally found the main reason for the performance degradation of converting RWKV7 models to Flash-Linear-Attention format.
It is not related to numerical precision. it is related to the prompt format.
The code used by Blink and HowardHou (VisualRWKV) adds an extra BOS token [0] before the text. However, lm-eval does not add that extra token.
Upon further inspection, I found the failure pattern: The model is unable to perform recall for the very first token it receives, witnessed by these examples:

2728: "Mathews lifted a dark brow. \"Are you sure about that? I mean, wouldn't it be better to wait until Dale is home safe and sound?\"\n\n\"The longer I wait to tell her, the worse it will be for both of us.\"\n\n\"Good luck. You're going to need it,\" said Mathews"

1225: "Seth traced the dirt with the end of a stick. \u201cYou say I\u2019m stubborn\u201d I laughed and he continued, \u201cListen, I don\u2019t even know if it\u2019s true or not. There\u2019s no need for me to worry any of you. That\u2019s why I didn\u2019t say anything.\u201d\n\u201cI still don\u2019t care, Seth"

3999: "Sirona tried to quell her sense of disappointment. \u201cWhat, then? Why did I see what I saw?\u201d\n\u201cThe young woman you observed being sacrificed,\u201d her teacher asked, \u201cDid she appear distraught, or did she go along with the ritual willingly?\u201d\n\u201cI\u2019m certain she was terrified,\u201d said Sirona"

The inability to recall the first token is probably related to WKV state initialization.
After removing the bos token [0] from VisualRWKV code, the performance matches FLA implementation.

Now, I have these questions:

  1. Should I write that into the paper?
  2. Does lm-eval have an option of adding a BOS token before the text?

@misty igloo @keen tartan

obsidian quest
#

this is like bos token

keen tartan
#

Gemma series of models also seem to need a bos_token added.

keen tartan
# nova frost `add_bos_token=True` in the model_args.

On a related note, I had to set adapter = EvalHarnessAdapter() adapter.custom_prefix_token_id = None when evaluting RWKV models to get some benchmarks working. There was otherwise an undefined variable error raised somewhere. I try perhaps to reproduce it. Might be already gone in newer versions.

#

@gusty condor RWKV6-3B v2.1 multilingual appears to be missing. Could you please check whether you can find them?

keen tartan
keen tartan
#

Setting adapter.custom_prefix_token_id = None or not changes results a tiny bit for perplexity but not accuracy. However, values are identical up to 7 decimal points (e.g. 12.59587346 versus 12.59587348). So it is properly not significant.

gusty condor
#

Please print the tokens at line 442 of flash-linear-attention/fla/models/rwkv7/modeling_rwkv7.py to see if BOS token is properly added.

#
token_ids = input_ids.flatten().tolist()
print(token_ids)
nova frost
gusty condor
#

RWKV7-G1-0.1B drops 1% (49.1% -> 48.1%) without [0] token for lambada_openai

keen tartan
gusty condor
#

Are you using g1

keen tartan
#

yes

gusty condor
#

You converted the model to FLA format?

keen tartan
#

No. Not yet.

#

Just RWKV7 pth models directly with adapters like Blink and HowardHou.

gusty condor
keen tartan
#

I did set the strategy to use fp32

#

strategy = 'cuda fp32'

#

@nova frost Here is the error I get if not setting adapter.custom_prefix_token_id = None: ```
/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py in loglikelihood(self, requests, disable_tqdm)
361 # BOS or EOS as context
362 context_enc, continuation_enc = (
--> 363 [self.prefix_token_id],
364 self.tok_encode(continuation),
365 )

/usr/local/lib/python3.10/dist-packages/lm_eval/models/huggingface.py in prefix_token_id(self)
360 def prefix_token_id(self):
361 # it is used as prefix for loglikelihood
--> 362 if self.custom_prefix_token_id is not None:
363 return self.custom_prefix_token_id
364 if self.tokenizer.bos_token_id is not None:```

Full traceback here: https://huggingface.co/spaces/hevok/evals/blob/main/errors/custom_prefix_token_id.txt

nova frost
keen tartan
#

Yes

#

It should be the BOS token there, right?

#

In RWKV we have eos_token_id == bos_token_id == 0, I suppose.

nova frost
#

yeah, should be fine i think as long as tokenizer.bos_token_id == 0

#

it's added through tokenizer.encode(string, add_special_tokens=add_bos_token)

keen tartan
#

I think we only specify tokenizer.eos_token_id = 0 in the tokenizer wrapper right now. Gonna set the bos_token_id there too.

gusty condor
#

RWKV7 adapter code, without [0]:

{
  "lambada_openai": {
    "perplexity,none": 13.835651924377974,
    "perplexity_stderr,none": 0.4269067454771951,
    "acc,none": 0.4812730448282554,
    "acc_stderr,none": 0.006961090021795178,
    "alias": "lambada_openai"
  }
}

RWKV7 adapter code, with [0]:

{
  "lambada_openai": {
    "perplexity,none": 12.362614971985607,
    "perplexity_stderr,none": 0.36913900917528986,
    "acc,none": 0.4913642538327188,
    "acc_stderr,none": 0.006964938588638406,
    "alias": "lambada_openai"
  }
}

RWKV7 FLA, without [0]:

{
  "lambada_openai": {
    "perplexity,none": 13.835802857719031,
    "perplexity_stderr,none": 0.4368222446505151,
    "acc,none": 0.4814671065398797,
    "acc_stderr,none": 0.00696119082972564,
    "alias": "lambada_openai"
  }
}

RWKV7 FLA, with [0] (code hacking):

{
  "lambada_openai": {
    "perplexity,none": 12.364938863860012,
    "perplexity_stderr,none": 0.3773660539093024,
    "acc,none": 0.49117019212109453,
    "acc_stderr,none": 0.006964891360529564,
    "alias": "lambada_openai"
  }
}
keen tartan
#

Oh, that is indeed a significant impact!!!

gusty condor
keen tartan
#

Great find. We need to fix this issue.

#

I am not able to reproduce the RWKV7 adapter code results for RWKV7 G0. As baseline I get: { "lambada_openai": { "perplexity,none": 12.596602010153171, "perplexity_stderr,none": 0.3822718659650309, "acc,none": 0.48961769842810016, "acc_stderr,none": 0.006964475739361981, "alias": "lambada_openai" } }

#

Hold on

#

I prepare some code.

gusty condor
#

Use this
RWKV_PAD = [0] # you can try using [0] as pad

keen tartan
#

Ahhh

#

That's it!!!

gusty condor
keen tartan
#

By default it uses RWKV_PAD = pipeline.tokenizer.encode('\n')

#

What about the STOP_TOKEN?

#

Default is STOP_TOKEN = RWKV_PAD + pipeline.tokenizer.encode('\n\n')

dawn pewter
#

Fun fact: k_k can even be more than 4 and less than -4

obsidian quest
dawn pewter
#

$\xi$ is a learned parameter representing the removal key multiplier, which transforms the original key into a version to be removed from the state.

This is the description in the paper, which may leave the reader a little confused, why can the removal key multiplier even be greater than 1 and less than 0

silent urchinBOT
#

Kaguya

keen tartan
#

@gusty condor Getting similar but not identical results now for RWKV7 EvalHarnessAdapter with PAD = [0]: ```{
"lambada_openai": {
"perplexity,none": 12.364936956333898,
"perplexity_stderr,none": 0.3764505210612126,
"acc,none": 0.49117019212109453,
"acc_stderr,none": 0.006964891360529504,
"alias": "lambada_openai"
}
}

gusty condor
#

An error of 0.02% is not significant at all.

keen tartan
dawn pewter
#

The average value of k_k is still between 0.7 and 0.8

keen tartan
#

@gusty condor I noticed you have been using STOP_TOKEN = [535] which will be decoded as +). Is there a specific reason for this choice?
But wait, that is for PILE models! Wondering what would be proper value for world tokenizer/models. /n/n might terminate long form generations.

#

@nova frost Even when setting the tokenizer.bos_token_id = 0 still raises the Exception. I try setting the adapter.custom_prefix_token_id = 0 Hope this makes sense.

nova frost
misty igloo
#

Sorry guys, got sick and may not be much help for the next two days. I'll try to put in an updated flops plot soon.

keen tartan
gusty condor
keen tartan
gusty condor
#

[261] for '\n\n' in rwkv_vocab_v20230424

keen tartan
#

@misty igloo In the case of infections, try to consume high amounts of fruits and berries (things that are rich in vitamin C) as well as consider supplementing zinc. Get well soon!

gusty condor
#

Well? One bottle of vitamin C (100 tablets, 100mg x 100) costs only $0.5 in China.

keen tartan
gusty condor
#

add w_0 too

brisk bronze
# gusty condor RWKV7 adapter code, without `[0]`: ```json { "lambada_openai": { "perplexi...

Implemented your fix in the fla code...

Replicated your RWKV7 FLA 0.1B-G1 with code hacking results:

  "lambada_openai":{
      "perplexity,none": 12.364936711373161,
      "perplexity_stderr,none": 0.37736600715379043,
      "acc,none": 0.49117019212109453,
      "acc_stderr,none": 0.006964891360529504
  }
}```

BUT RWKV7 FLA 1.5B-World with the same code hack gets much higher results than in the paper:
```{
  "lambada_openai":{
      "perplexity,none": 4.136933117540389,
      "perplexity_stderr,none": 0.0886568308581175,
      "acc,none": 0.6931884339219871,
      "acc_stderr,none": 0.006425006782127488
  }
}```

RWKV7 1.5B-World (with adapter I assume) in the paper:
```{
  "lambada_openai":{
      "perplexity,none": 3.4,
      "acc,none": 0.483
  }
}```
obsidian quest
gusty condor
dawn pewter
#

interesting, k_a can take on big values like 13, -20

#

As models grow larger, the average k_k looks like going up, while k_a seems to trend downward

keen tartan
# brisk bronze Implemented your fix in the fla code... Replicated your RWKV7 FLA 0.1B-G1 with ...

For RWK7 World 1.5B via HarnessAdapter I get the following results depending on the specified PAD token IDs with the jupyter notebook I provided:

RWKV_PAD = [11] (tokenizer encoded '\n'): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T10-50-30.361472.json

"lambada_openai": {
  "perplexity,none": 4.174870788924788,
  "perplexity_stderr,none": 0.09003244838599012,
  "acc,none": 0.6951290510382302,
  "acc_stderr,none": 0.006413613926848405,
}

RWKV_PAD = [0] (only special token in World tokenizer, often denoted as '<|endoftext|>' or '<EOS>' ): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T11-04-07.000221.json

"lambada_openai": {
  "perplexity,none": 4.133062406363742,
  "perplexity_stderr,none": 0.08879176331605698,
  "acc,none": 0.6933824956336115,
  "acc_stderr,none": 0.006423873526429436,
}
obsidian quest
#

move Figure 7 to section 3 (Architecture) because it highlights the limits of attention & mamba

gusty condor
#

I selected a subset of Lambada (142 problems) that satisfies these requirements:

  1. The answer is the first word;
  2. the first word does not appear again in the middle of the text.
    The diferences are very significant:
    v7 0.1B world 2.8:
    No padding: ppl=357 acc=9.15
    padding with [0]: ppl=16.4 acc=36.6
    padding with [0,0]: ppl=10.7 acc=43.7
#

This is so significant and worth to be written in the paper

#

Examples:

{"text": "Beth smoothed her wiry half-black, half-gray hair from her makeup-free face. In New Mexico, the natural look was common. Standing next to Cindy Fanucci, she felt like a disaster. She hid her ragged nails under the sleeves of her sweatshirt.\n\u201cHi, I\u2019m Cindy. It\u2019s so nice to meet you, Beth"}
{"text": "Cooper groaned, and his body sagged back.\n\n\"You weren't supposed to be first,\" Deuce snarled as he lifted the gun and took aim at Cooper's prone form. \"But if that's the way you want it, old buddy...\"\n\n\"No!\" Gabrielle threw her body forward and wrapped her arms around Cooper"}
bronze frost
#

I've been discussing the expressivity of RWKV-7 behind the scenes with @misty igloo , @dawn pewter and William Merrill, and

we finally have a proof that RWKV-7 can recognize any regular language!

This is significantly stronger than our prior claims, and doesn't rely on assumptions such as c = 2, "multi-step computations", or a special BOS token. This result clearly motivates our use of a data-dependent and elementwise ICLR "a". Prior works could only simulate permutation DFAs, while we can simulate general DFAs, because of this "a". RWKV-7 might be the first model to use diagonal + low-rank updates, and still be able to recognize regular languages.

The proof is a bit involved (~4 pages, added as Appendix E), but I tried to write it in a way where the core ideas appear early, and the complicated details appear later. A core insight is that multiple layers are needed. Numerical experiments indicate that 2 layers should be enough, but my construction uses 4 layers to simplify the proof.

There were some interesting insights from the proof of simulating DFAs with RWKV-7:

  1. Because "a" is applied on the right instead of the left, we actually simulate the reversed DFA (the DFA which recognizes the reversed language). EDIT: Sorry, this was actually incorrect, it is "a" on the left which simulates backwards. Thanks Merrill for finding this mistake.
  2. For DFA simulation, we often want to extract a single row of the wkv state. But because the receptance "r" is applied on the right instead of the left of the state, the readout requires simulating many identical wkv heads, where each head reads out a single element of the wkv state.
  3. For DFA simulation, we do not need element-wise control or data-dependence for "w".
obsidian quest
obsidian quest
bronze frost
#

yeah, I thought lemma 3 would be a "well known" result, but I couldn't find a reference, so I cooked up a construction myself. If you can find a simpler proof without requiring the reader to known graph theory terms, that would be great.

bronze frost
# obsidian quest great work πŸ™‚ do you have suggestions for increasing rwkv7 expressivity

Recognizing regular languages is already very strong, things beyond that are usually clearly impossible in constant time per token. For example, some NC1 problems require linearly growing state size in the sequence length. However, the current construction has uses huge state sizes and lookup tables of size vocabulary^(DFA states), which is probably limiting which regular languages can be simulated in practice.
Points 1. and 2. above indicate that we might want to experiment with readout on the "value" dimension of the state, even though this breaks the intuition from linear transformers. And maybe also apply "a" on the left side.
My way to avoid c = 2 is based on the group normalization immediately after the wkv heads, there might exist other/better normalizations of the state which could also improve performance (like how rwkv-6c normalization was great).

misty igloo
#

yeah I still maintain that a more balanced construction of the overall formula with implicit normalization has the potential to improve performance

#

but I'm not sure whether that will improve or harm these regular language abilities
gotta wait a day or so until my brain works fully again to think about it 🀣

bronze frost
#

@obsidian quest In summary, regular languages basically include everything we can reasonably do, and we can already technically solve regular languages (so we can do state tracking and basically what classical RNNs can do). However, the way we currently simulate them can be very inefficient. So most further improvement in expressivity probably comes from decreasing the required number of heads / head size / precision / etc.
A practical limitation on the expressivity of RWKV-7 wkv heads is that it applies all vectors to the "key" dimension of the state. This makes the slots in the "value" dimension independent. Some mixing also in the value dimension could potentially make the wkv heads more powerful (while making parallelization a bit more tricky πŸ™‚ ).

dawn pewter
gusty condor
misty igloo
# dawn pewter Am I correct in understanding that, rather than aiming to simulate each individu...

I haven't read the final proof yet, but the idea from a couple of days ago was that all size-n blocks up until the final 2n-1 tokens would be deferred by one block size (think 'pipelining') and evaluated in a non-block manner, as a deferred set of per-token elementary matrices

Then, the final block is done block-wise so that it does not require deferral (and therefore no extra tokens are required)

[update: looks like icecuber simplified it to 2n tokens instead of 2n-1, but seems like otherwise same idea]

keen tartan
#

@gusty condor I saw you pushed the missing RWKV6 World 3B multilingual results. Thx! We appear to still miss RWKV7 World 1.5B/2.9B results files for lambada_openai, hellaswag, piqa, arc_easy, arc_challenge, winogrande, and sciq. Please check.

keen tartan
#

For future runs I suggest to also output a bit more metadata for better reproducibility including lm_eval version and special token IDs:

from importlib.metadata import version
# ...
output_dict = dict(
    model=MODEL_NAME,
    tasks=eval_tasks,
    num_fewshot=num_fewshot,
    lm_eval_version=version('lm_eval'),
    bos_token_id=adapter.tokenizer.bos_token_id,
    eos_token_id=adapter.tokenizer.eos_token_id,
    custom_prefix_token_id=adapter.custom_prefix_token_id,
    pad_token_ids=RWKV_PAD,
    stop_token_ids=STOP_TOKEN,
    results=results['results']
)
#...

Note: I added bos_token_id to the TokenizerWrapper am assigning right now adapter.custom_prefix_token_id = RWKV_PAD[0].

# ...
class TokenizerWrapper:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.bos_token_id = 0
        self.eos_token_id = 0
# ...
adapter = EvalHarnessAdapter()
adapter.custom_prefix_token_id = RWKV_PAD[0]
# ...
misty igloo
#

we shouldn't need an adapter or any of this stuff

#

should be able to just run lm eval from cmdline and have it work

gusty condor
#

No idea. I am not a maintainer of lm-eval

young sparrow
misty igloo
brisk bronze
#

both in 0.4.3 and 0.4.7

nova frost
brisk bronze
#

this is the output when I do add_bos_token=True, and the 0's are not prepended.
[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297, 59888, 799, 22464, 4855, 25779, 47, 269, 24326, 4491, 22799, 31391, 22799, 21556, 461, 31059, 21273, 0, 0, 0, 24043, 8828, 21795, 30259, 22590, 31254, 46795, 4811, 32451, 39944, 45447, 45,

#

the output looks the same when I set add_bos_token=False

[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297,

misty igloo
#

But she'll have to describe the exact details of what was run and how- I didn't run it myself

brisk bronze
#

afaik rwkv7 were run with the adapter code because running fla converted rwkv in lm-eval had degraded results

#

this is the command I used: lm_eval --model hf --model_args pretrained=fla-hub/rwkv7-1.5B-world,trust_remote_code=True,add_bos_token=True,dtype=float32 --tasks lambada_openai --batch_size 8 --output_path /workspace/lm-evaluation-harness/results

misty igloo
nova frost
#

basically some HF tokenizers need to be initialized with add_bos_token=True

brisk bronze
nova frost
#

what tokenizer are you using?

#

I just merged it today

brisk bronze
#

ok I'll run the tests again

nova frost
#

yeah. this wouldn't add the bos before:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world")
tokenizer.encode("hello", add_special_tokens=True)
# [34550]

intializing with tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world", add_bos_token=True) works properly

brisk bronze
misty igloo
#

Well that finally resolves an issue that's only taken a year to figure out 🀣

brisk bronze
#

eval results match up too

    "lambada_openai": {
      "alias": "lambada_openai",
      "perplexity,none": 4.136982815151818,
      "perplexity_stderr,none": 0.08865873813063398,
      "acc,none": 0.6931884339219871,
      "acc_stderr,none": 0.006425006782127488
    }
}```
fringe egret
#

Has anyone managed to successfully run the 128k fine-tuning version? We're encountering conflicts when using the world version environment.

misty igloo
fringe egret
#

python: /project/Lib/Tools/LinearLayout.cpp:562:mlir::triton::LinearLayout mlir::triton::LinearLayout::reshapeOuts(llvm::ArrayRef<std::pair<mLir::StringAttr, int>>) const: Assertion `getTotalOutDimSize( )== std::accmulate( newOutDims.begin(), newOutDims.end(), 1, [&](int32_t acc, auto &outDim) { return acc * outDim.second; })' failed.

obsidian quest
gusty condor
# random granite use triton nightly

This is the problem: Installing some new package may override the triton-nightly installation with triton 3.2.0. So it is better to have the code work properly for triton 3.2.0 and later versions.

random granite
#

install from scratch or wait for the next version

#

this is triton's bug. I won't fix it because large number of warp is crucial for performance

gusty condor
fringe egret
fringe egret
keen tartan
#

Do we have the context extended checkpoints also as non-converted HF models (i.e. normal rwkv models) available somewhere?

jovial meteor
#

How does RWKV-7 behave past its training context length? Does state collapse still happen?

keen tartan
#

In the paper we have currently Long Context Experiments with PG19 dataset as well as single needle-in-the-haystack.

#

There is perhaps some kind of overfitting on short context phenomenon for the world models but not pile models that was reported by @iron parrot #1103039376184852622 message

gusty condor
#

From @paper dove :

I have a small suggestion. I saw RWKV-6C mentioned in the discussion, but people who are not familiar with the rwkv version may not understand what it means.

keen tartan
#

Perhaps we can refer to the GoldFinch paper for this.

#

It is mentioned in the Additional Architecture Discussion Architecture Details Section

#

RWKV-6c is mentioned first time in the Method section 4.1.1 Weigh Preparation. I named it Upgraded Finch there and referred to the Additional Architecture Discussion where it is introduced under the same name in addition its version for now. Hope this makes it a bit more clear.

misty igloo
#

It's not called Upgraded Finch anywhere in the world, so I don't think we should use that name here

misty igloo
#

iono maybe I did rename it 6c in GoldFinch? I don't think so tho - checking now

#

yeah in GoldFinch we had a version that included other changes called Finch-C2

keen tartan
#

So we can name it Finch-C2 or maybe Finch-C1

#

??

misty igloo
#

its not Finch-C2 sorry

#

its Finch-C / v6c

#

there are differences, which is why I named the goldfinch version Finch-C2

keen tartan
#

Got it.

misty igloo
#

blink's internal designation is x060c

keen tartan
#

yeah

misty igloo
#

but yeah this isn't really described in any paper other than GoldFinch, which is where the idea for it came from

#

I'll take a look and add more descriptive content around it

keen tartan
#

Merged all lm_eval results files from benchmarks table 3 and table 4 per model. Parsed merged files, created a pandas dataframe with combined average accuracy across English and multilingual tasks, and plotted it with matplotlib.

misty igloo
keen tartan
#

Perhaps scaling the dots size to parameter size might make it look more informative.

#

Here I multiplied parameters in billions with 100 and set as marker size.

#

I saw similar plots where the dots size represented model size in papers and I liked it.

#

I am also suggesting adding a bit of transparency. Helps with overplotting issue.

#

Used alpha=0.5 above.

obsidian quest
misty igloo
obsidian quest
misty igloo
obsidian quest
#

actually can use [inference flops] vs [avg acc]

gusty condor
misty igloo
gusty condor
#

lm head are active parameters

misty igloo
#

nvm I am still sick and clearly not thinking well

#

Someone else better do this chart

#

@brisk bronze maybe you can take care of it tomorrow?

#

Should be ez to copy our existing google plot to make it

gusty condor
#

I can do this chart

brisk bronze
gusty condor
#

shall we run everything with 0.4.7?

brisk bronze
gusty condor
#

Shall we run 0.4.8?

gusty condor
#

@misty igloo So we do have some reason to run 0.4.8, since Paws-X and bos_token are fixed, and enhanced reproducibility as it's the newest version. And we can run glue with averaging too as requested by Bo,

brisk bronze
gusty condor
#

No trouble!

#

I can rerun them

dawn pewter
#

typo, the minimum of w_t is exp(-exp(-0.5)). I fixed it

gusty condor
#

Thank you!

#

Bad news: two of our authors are currently sick (both Bo and Smerky).

gusty condor
#

@brisk bronze @keen tartan
Now I found a big problem: <bos> is added for RWKV-7 but not for other models like Qwen and Llama, so it's not a fair comparison.
But actually, RWKV-7 adding a [0] can enhance the performance of lambada by 0.6% but harms performance of arc by 2-3%.
I think a fair comparison should be conducted without [0] for all models. This also matches RWKV-FLA performance.
w/o [0]:

"arc_challenge": {
    "alias": "arc_challenge",
    "acc,none": 0.43430034129692835,
    "acc_stderr,none": 0.014484703048857371,
    "acc_norm,none": 0.4658703071672355,
    "acc_norm_stderr,none": 0.014577311315231023
  },
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.7706228956228957,
    "acc_stderr,none": 0.008627087045485938,
    "acc_norm,none": 0.7584175084175084,
    "acc_norm_stderr,none": 0.008783247004042158
  }

w/ [0]:

  "arc_easy": {
      "acc,none": 0.7584175084175084,
      "acc_stderr,none": 0.008783247004042158,
      "acc_norm,none": 0.7079124579124579,
      "acc_norm_stderr,none": 0.009330705616569084,
      "alias": "arc_easy"
    },
    "arc_challenge": {
      "acc,none": 0.40784982935153585,
      "acc_stderr,none": 0.01436109728844968,
      "acc_norm,none": 0.42406143344709896,
      "acc_norm_stderr,none": 0.014441889627464344,
      "alias": "arc_challenge"
    },
misty igloo
gusty condor
#

OK so don't give RWKV a bos then

misty igloo
#

And used bfloat16 not float32

keen tartan
#

lm_eval adds automatically BOS token for Gemma family of models.

keen tartan
#

There is a comment in source. let me try to reference it.

#

Line 222

#

"...part of the Gemma family--a BOS token will be used as Gemma underperforms without it." is what gets logged right under it.

misty igloo
misty igloo
#

I'm agnostic about BOS token usage... I think it's fine but as ZhangRC points out and I found last year, it helps some evals and hurts others so it kind of doesn't make a difference overall for RWKV

gusty condor
#

Adding a [0] gives -0.4% overall

young sparrow
#

I don't think that it's important to have the same thing for every model.

#

Think of it this way: when you have two chat models with different chat prompts, is it more fair to use the chat prompt model A expects for both models because it's the same input, or is it more fair to give each model the chat prompt it expects.

gusty condor
gusty condor
#

Abstract deadline within one week!
Full paper deadline within 2 weeks!

misty igloo
#

@everyone

  • The COLM abstract submission deadline is on March 20
  • We need authors to DM me their openreview ID or email address used for their openreview account.
  • If you don't have an openreview account, you need to open one and get it approved ASAP
  • If you are not currently listed as an author, and think you should be, now is the time to let us know. Authorship will be extended only to those who have contributed significantly to the paper by supplying experimental data that are included therein and/or doing significant writing. (but not for just having fixed some spelling or reworded a few things)
fresh mulch
#

Table 1 feels cluttered with scalar annotations. Could we move them elsewhere or drop them without misleading the reader or losing nuance?

misty igloo
#

they're important so we can't drop them, but maybe there's a better way to indicate this?

fresh mulch
#

I would say create a separate column for which variables are scalar, but we're almost at width as it is

young sparrow
#

The S and I variables are the only matrices right? Perhaps it would be cleaner to use bold to indicate "not a scalar" and note in the caption that S and I are matrices

fresh mulch
#

we already mess with the notation for consistency with sec 4 so the latter would probably not be great. I think bold "not a scalar" works best considering boldface for vectors is convention in some fields

#

related consistency question: is it Delta Net or DeltaNet

misty igloo
#

@obsidian quest I added multilang and eng acc vs inference active params charts to the paper... the english one is a bit messy

young sparrow
#

@misty igloo I think it would be valuable to show the paper to someone who hasn't worked with RWKV much but follows this space and see how accessible the methodological explanation is to them. One of the things I consistency hear from people who work with Mamba and not RWKV is that finding the exposition inaccessible is a major reason why they use Mamba

misty igloo
young sparrow
#

Maybe just drop a draft in #research and ask for feedback on this point as a starting point

misty igloo
#

@granite pike could provide that feedback if they're up for it

young sparrow
#

The quality of the diagrams has substantially improved which y'all should be proud of πŸ™‚

fresh mulch
#

+1. Being (formerly, still kind of) that person, not having a clear picture of how RWKV works made me lean towards Mamba.

#

The question of why RWKV isn't as popular as Mamba came up some time ago. I still believe most of it is accessibility - particularly things like blogposts, etc. that spread the word to the "lay user", i.e. someone who won't read the paper but would use RWKV in their applications

fresh mulch
young sparrow
#

I've left some comments on the first half of the paper and will be back to do more later

keen tartan
misty igloo
keen tartan
#

I check.

fresh mulch
#

i tried compressing and revising sections 1-3, will be back later for more. things that still stand out to me:

  • scalars in table 1, as before
  • table 1's caption is really unwieldily long
  • we use a lot of terminology in ways that would be familiar to someone in the space that might not be immediately obvious to an outside reader, such as using DeltaNet-specific terms in the introduction and the general idea of key-value retrieval in Section 2 (though maybe the latter is more obvious to people)
  • section 2 flows well but still feels delineated at the "Concurrent work" paragraph - subsection break here?
  • section 3 "Architecture" feels like it should be a subsection of section 4, or I don't see why it deserves its own section. It describes architectural changes over other methods, so I feel like it belongs at the beginning of the part where we describe the architecture in technical detail
  • section 4.1.1, after the big table of parameter definitions, could use some better structuring
  • I really love the new figures, they're super simple!
keen tartan
#

Should we already aim to compress the main part of the paper into 9 pages or at least plan ahead?

keen tartan
#

Could be a simple Github page dedicated to the RWKV-7 Goose Model.

#

As well as hands-on tutorials on how to get started.

keen tartan
#

That is a good place for tutorials.

#

Probably just fleshing out the official website is the best way.

#

Who is managing it right now?

#

It is just a bit outdated but a good starting point

misty igloo
fresh mulch
#

Updating and fleshing out website and wiki is a good idea. Also making it easy for people to get started, ie fewest steps to a (preferably customizable) working model with a walkthrough. Does the fla-hub kernel work with HF Transformers?

misty igloo
#

they will probably become the official RWKV HF implementations, at least temporarily

#

Good news: Will Merrill had some time to go through and do an initial pass merging and polishing the proofs in Appendix D. I think he still wants to do another pass, but it's something that could wait for v2.

keen tartan
#

@gusty condor The following are the results from experiments to check the impact of RWKV_PAD tokens.
RWKV7-0.1B 11 is with \n as PAD tokens ([11]) which is the default one recommended.
RWKV7-0.1B 0 is with the special <|endoftext|> token as PAD tokens ([0]).
RWKV7-0.1B None is with no PAD tokens at all ([]).

+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| Model           | Tokens | lmb.o | hella  | piqa | arcE | arcC | glue | WG   | sciq | mmlu | avg  |
+=================+========+=======+========+======+======+======+======+======+======+======+======+
| (Name)          | (T)    | acc↑  | acc_n↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 11   | 1.6    | 48.1  | 42.1   | 67.3 | 59.3 | 25.5 | 48.1 | 52.7 | 86.3 | 25.4 | 50.5 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 0    | 1.6    | 49.0  | 42.2   | 67.1 | 56.6 | 23.6 | 46.3 | 52.6 | 86.2 | 25.8 | 49.9 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B None | 1.6    | 47.4  | 41.9   | 67.5 | 59.1 | 25.2 | 46.3 | 52.2 | 86.1 | 25.5 | 50.1 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
#
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| Model           | lmb.m_p | lmb.m | pwasx | xcopa | xnli | xsClz | xwin | avg  |
+=================+=========+=======+=======+=======+======+=======+======+======+
| (Name)          | ppl↓    | acc↑  | acc↑  | acc↑  | acc↑ | acc↑  | acc↑ | acc↑ |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 11   | 166     | 31.6  | 46.1  | 53.3  | 37.6 | 52.6  | 64.1 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 0    | 167     | 31.6  | 46.5  | 53.0  | 37.4 | 52.5  | 64.0 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B None | 177     | 31.2  | 46.6  | 53.0  | 37.4 | 52.4  | 63.0 | 47.3 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
#

Using lm_eval version 0.4.8. GLUE subtasks were simply averaged without weighting.
It seems that the default \n as PAD tokens are preferred across benchmarks.

#

Hypothesis: RWKV uses \n kind of like an element of its chat template as this is frequently occurring to separate utterances in its training data.

obsidian quest
gusty condor
obsidian quest
#

i am building 100K and 1M random items from world-v3 dataset for reference

#

@gusty condor

keen tartan
obsidian quest
misty igloo
fresh mulch
#

There are a few parts of the paper that look like intimidating walls of text during a cursory sweep of the paper. Would it be worth breaking these up by \subsection or \paragraph, or is this not an issue?

misty igloo
fresh mulch
misty igloo
#

gotcha, thats funny

#

there have been so many changes I can't remember which is which πŸ™‚

young sparrow
#

@obsidian quest what's the flops/mfu/whatever we get during training?

fresh mulch
#

Appendix J, state transitions. What is meant by comparing "the order of O(1)" to "the order of thousands"?

#

also will we mention QRWKV at all in this paper @misty igloo as further proof it works at scale

misty igloo
#

Gets unstable

#

Also I'm hoping to submit qrwkv paper to COLM separately

misty igloo
#

@dawn pewter or @gusty condor could you clarify

fresh mulch
#

what are "such ideas" that can be traced back to fast weights and hebbian learning? @misty igloo

#

i want to modify this section a bit to motivate (our use of) the delta rule via deltanet, if that's fine with you

misty igloo
#

Fast weights is basically the idea of test time training the state

fresh mulch
#

we talk about it a lot throughout but to me it reads like we assume the reader is familiar with the delta rule's role in the development of linear attention

misty igloo
#

You seem to have it backwards maybe? Delta rule has nothing to do with the development of linear attention

#

Linear Attention is a form of fast weights tho

fresh mulch
#

oops, yes, that is backwards

misty igloo
#

Obviously this means my explanation in the text is maybe lacking

#

The way I was attempting to construct the narrative was:
Transformers, then Linear Attention and its issues, then delta rule fixes those issues, then we innovate on that

fresh mulch
#

I guess my point is this: I like the flow of the discussion of the problem of numerically increasing state, but jumping into delta rule (DeltaNet was the first to...) after that is a shock, and it is not immediately clear to me what the transition is. Is it that the delta rule enables this fix, or it is this fix, or...?

misty igloo
#

Yes delta rule is one variety of fix

#

Maybe I didn't make that clear

#

I did say exactly how it solves that issue in the second sentence tho...

#

Maybe I should basically swap the order of sentences 1 and 2

fresh mulch
#

oh... does sentence 2 describe delta rule in general (the way it is phrased makes it seem DeltaNet-specific)

#

I'm probably too tired for this right now 🀣

misty igloo
#

And I think it's a good point that fast weights applies to linear attention too... not quite sure how to shoehorn that in tho.

misty igloo
fresh mulch
#

I see, so sentence 4 (basically what you just said) is describing the process sentence 2 describes?

misty igloo
#

not sure I understand that 2 vs 4 comparison, but generally any messiness like that is bc I was trying to fit in the things Blink required us to say about it

fresh mulch
#

ic okay. yeah the more I talk about this the more confused I get lmao

misty igloo
#

that's not great πŸ™‚ means I probably have some fixing to do

#

I'd like to make sure it makes sense to readers, even if they have no delta rule background

#

not sure if explaining linear attention is outside the scope of the paper tho

fresh mulch
#

linear attention at large probably (definitely?) is

gusty condor
obsidian quest
gusty condor
#

I think jsonl could be better

obsidian quest
#

for depth 1 & long length, i think we need better optimizer or data design (can try curriculum learning) for rwkv7 to grok @crystal hull

gusty condor
#

And very little Chinese (less than 1%)

obsidian quest
#

@crystal hull

obsidian quest
#

pls add an alternative version diag(w)(I - ckak) where one is free to use c=2 (this version was used in othello and found to be useful)

keen tartan
keen tartan
keen tartan
misty igloo
#

@young sparrow in your opinion is this better than nothing, or is including something that doesn't fully match worse than not providing it at all?

#

(some distill instruct data are not included, and only subsamples of these instruct data are included: flan, Buzz-V12, WebInstructSub, SKGInstruct, PIPPA, COIG-PC-core)

misty igloo
#

I'd like to understand it better since I think there is similar notation used here:
The $\tilde{k}_t$ in the formula can be regarded as a "normalized key", a design to ensure that the state of $\bm{wkv}$ contains columns of $O(1)$ size.
and I had changed that from 'entries' to 'columns'

#

the columns in the state represent values, basically - so I'm not sure that a per-element analysis is really the best metric around keeping things normalized. A vector kept in the usual form in pytorch has L2 Norm of sqrt(vector_dim)

obsidian quest
gusty condor
misty igloo
#

For COLM we are apparently required to designate one of our authors as a 'reciprocal reviewer', and I'm not qualified to be that person:

Reviewers must have research experience equivalent to a second-year graduate student in machine learning or a related field. They must have been a primary author* on at least two peer-reviewed conference or journal papers published in a related venue (e.g., ACL, NAACL, EMNLP, ICML, NeurIPS, ICLR, JMLR, TMLR, CVPR, ICCV – this is not an exhaustive list).

Please let us know if you're an author of the RWKV7 paper who meets the criteria above and would be willing to do this for us. This is a requirement - we need somone to do it in order to submit to COLM 2025.

Update: I think we have this covered now - thanks to everyone for reaching out!

gusty condor
#

@paper dove

young sparrow
#

@misty igloo @gusty condor @obsidian quest If the data isn't going to get released you can't say that the "RWKV v3 World public corpus" is a contribution of the paper

misty igloo
young sparrow
#

That's not a contribution to the scientific literature

gusty condor
young sparrow
#

You also can't refer to it as an "open source corpus"

keen tartan
#

We could take inspiration from the Allen AI Institute's OLmo (Open Language Model) Project.

#

They tried to address open source as best as they could

young sparrow
#

They released their dataset

keen tartan
#

Dolma?

young sparrow
#

Yes

#

And the way they talk about the licensing of their dataset has mislead a lot of people into thinking it's openly licensed

keen tartan
#

Yes, I am looking into how they did it and try to follow their guide.

young sparrow
#

I don't know what you mean by that

keen tartan
young sparrow
#

What they did was release the data and wrap the entire thing, as a collection, in a database license. That database license is open source and the way they did messaging around it lead people to think that the data was openly licensed.

keen tartan
young sparrow
#

I will not let us fall into those pitfalls πŸ™‚

young sparrow
misty igloo
misty igloo
#

Maybe the phrasing needs to be a bit clearer around enabling replicability?

#

This is pretty much the exact same thing we did in the last paper, so I'm not sure why it's not valid this time

young sparrow
#

One sec (edit: actually I gotta run, be back later)

jovial meteor
misty igloo
gusty condor
#

Shall we put RWKV7 code into RWKV-v7 folder?

misty igloo
# young sparrow One sec (edit: actually I gotta run, be back later)

No problem, let me know when you can. Also if you followed up on the 'missing' three (?) datasets please let me know where that ended up. We're trying to get the paper on arxiv as soon we can, and I want to make sure we have this dataset stuff ironed out to your satisfaction.

gusty condor
obsidian quest
#

just call it dataset preview, for now. will fix it when i am less busy

gusty condor
#

All 3 missing datasets were found.

  1. Wikipedia: Loader not working anymore #1103039376184852622 message
  2. Guanaco #1103039376184852622 message
  3. Books3 #1103039376184852622 message
    @misty igloo
pure pike
keen tartan
#

Have been grinding though all the RWKV World v3 corpus components and made sure that it is possible to download and sample from each component.

#

Good news is: There are no major obstacles for reconstruction.

misty igloo
#

cool!

keen tartan
#

Just a few tiny details are lacking that would be helpful to eliminate ambiguity.

misty igloo
#

should I be copying this to the official RWKV HF

keen tartan
#

I could just rename it if I am member of RWKV HF.

#

So it keeps the statistics from the original one as it already had quite some traction.

misty igloo
#

I can't give you that access, unfortunately

keen tartan
#

How about I move it to another org repo and you move it from there from org to org.

#

I could just create an org and add you to it.

misty igloo
#

sure if that's doable!

keen tartan
#

So migrating it 2 times.

#

I will do it.

misty igloo
#

but then you wont be able to edit it any more

keen tartan
#

oh

misty igloo
#

ok well it's fine, let's just wait and add it in the next version of the paper

#

that way you have time to edit it a bit more

misty igloo
#

thanks!

#

the subsamples are still in your account

keen tartan
#

I will move those too.

#

They are also in the main one as subsets.

#

It has 3 subsets: index, 100k, and 1m

#

Moved all and assigned you admin rights.

misty igloo
#

I guess we just gotta update the links once we move orgs

#

@keen tartan do you want me to put it in RWKV now, or wait so you can keep editing

#

is there some way to add in the up/down sampling frequency info from the Eagle/Finch paper in as a column here for those that weren't just used as-is?

keen tartan
#

I can provide the code to generate the tables.

#

There is column for world version already.

#

Looking into how to add up/down sampling frequency column too.

misty igloo
#

the amounts are listed in the attached wiki.txt for the Eagle/Finch paper ... not sure its possible to include this

keen tartan
#

I check.

misty igloo
#

and oscar.txt

#

yeah seems tough to do, only reasonable way would be to maybe pre-process those datasets to create the filtered versions

#

and provide those separately as components

#

but if we did that it'd make the whole thing quite reproducible I think!

#

since those are the only specially sampled items

keen tartan
keen tartan
#

Then it seems doable.

misty igloo
#

I mean, review the Eagle/Finch paper to be sure, but that's my recollection

keen tartan
#

I will do so.

#

HuggingFace Hub is based on Git. So contributors outside of the organization should be able to make pull requests (called Discussions).

misty igloo
#

cool

#

I just dont want to make it harder for you to edit while you're still doing it a bunch

keen tartan
#

I am flexible. Whenever you think it is adequate to move it. I can work with Git. Just someone in the org needs to approve requests.

#

Wait.

#

"Discussions and Pull Requests are currently enabled for this dataset. Members of the community can propose changes to this repository."

#

Only members can make pull requests. I misinterpreted it.

misty igloo
#

'members of the community'

#

not sure that means anyone or members of the org

keen tartan
#

Yeah, it is ambiguous.

#

I just test.

#

^^

keen tartan
misty igloo
#

@everyone no changes to the manuscript at this time, please - we are going to try to put it on arxiv

gusty condor
#

We will update our eval results for arxiv v2.
+2 points on ARC-e and ARC-c each, and small gains in MMLU.

#

We may exceed past Qwen2.5 this time with lm-eval 0.4.8

misty igloo
gusty condor
#

@obsidian quest How many tokens, and on which dataset, did you tune v7-world3-2b9-preview into v7-world3-2b9?

#

the former seems to be higher on certain evals like glue, gsm8k, and several others.

gusty condor
#

I used the markdown package (which may require lualatex which is incompatible for arxiv). I will change it

sonic horizon
#

Hi everyone, Xingjian DU and I are still working on the audio modeling task. We were wondering if there might be any space available in the Evaluation section or appendix, either in this version or a future one, to include our work on this task? Of course, we fully respect your timeline and will align with your schedule.

gusty condor
#

Sadly, @misty igloo missed the deadline. This means that we have another 24 hours to go.
So, we should focus on:

  1. evaluations
  2. Audio modeling tasks if applicable.
misty igloo
gusty condor
#

You have 23 hours left.

misty igloo
#

Haha try to do it much sooner than 23 hours though, please πŸ™‚

gusty condor
#

I have to sleep now. I will check the evaluation section.
By the way, @obsidian quest please tell the difference between v7-2b9-preview and v7-2b9-release

misty igloo
#

@keen tartan if we're using these new results I'll need evals for Qwen2.5-7B as well for the FLOPs chart

#

also, I don't understand how your glue results have an average.. I thought those aren't given in later LM-eval versions

#

is this really using 0.4.8 for all of these?

#

I don't see the glue overall in the rwkv results for example

misty igloo
keen tartan
#

I use average function based on @brisk bronze's extracted source when processing the results files. It is not a big deal to calculate it at all.

keen tartan
#

it is using 0.4.8

#

mom

#

Trying to share relevant evaluations for the paper I run there.

misty igloo
#

hmm so 0.4.8 like gives glue averages sometimes but not others? are these the size-weighted ones or non-weighted?

#

bc either way we are going to have to manually calculate the averages for the ones it didn't print them for

keen tartan
#

It only gives results for the individual subsets, but it is easy to just average them. The function allows to toogle weight/non-weighting.

#

I share relevant code block.

misty igloo
#

but I see glue averages in some of them πŸ™‚

#

code block?

#

you're running these not via the cmdline?

#

that explains it

keen tartan
#

I aggregate all results files to make tables and plots.

misty igloo
#

I figured you ran the RWKV tests via lm-eval cmdline now, just like the rest of the evals

#

since in 0.4.8 it properly uses the flag for BOS token

keen tartan
#

I can do either way.

#

One moment please.

#
def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
    # A helper function that is used to aggregate
    # subtask scores cross-task.

    if not weight_by_size:
        sizes = [1] * len(sizes)

    assert len(metrics) == len(sizes)

    return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)
#

That is the function I am using to calculate the average in post-processing.

#
glue_val_split = {
  'cola': 1043,
  'mnli': 9815, # _matched
  'mnli_mismatch': 9832, # ed
  'mrpc': 408,
  'qnli': 5463,
  'qqp': 40430,
  'rte': 277,
  'sst2': 872,
  'stsb': 1500,
  'wnli': 71,
}
#

These are the individual subtask names and counts.

misty igloo
#

which variety is shown in the glue outputs you currently have in this folder?

keen tartan
#

It is not in the results files.

misty igloo
#

it is actually

keen tartan
#

I calcualte it afterwards.

#

Oh

#

is it?

#

perhaps

misty igloo
#

yes that's why this question arises πŸ™‚

#

bc it generally doesnt show up there for 0.4.8 cmdline

keen tartan
#

I only see the statistics of the subtasks.

#

Anyway, calculating the average is not a big deal.

misty igloo
#

this one has the avg under 'glue'

keen tartan
#

Hold on.

misty igloo
#

and as far as I know, that means it was not run under 0.4.8

keen tartan
#

This file was generated from conversion of markdown table from @gusty condor

#

It was from a previous experiment using 0.4.3

misty igloo
#

but.. you said that this folder had all 0.4.8 results

keen tartan
#

Sorry for the confusion.

misty igloo
#

this is why I am trying to make sure everything is done correctly

#

because it's clear to me that there is mismatching data

keen tartan
#

Then there should be '0.4.8' in the file name.

misty igloo
#

I don't see any files like that in this repo

keen tartan
#

in sub folders.

misty igloo
#

I mean for anything other than rwkv

keen tartan
#

I have not ran it yet for all models.

#

I have results for reference models that I have not yet pushed there yet.

misty igloo
#

oh ok

keen tartan
#

I was focusing on RWKV

#

For RWKV World models I have all 0.4.8 results already.

misty igloo
#

sorry, that was the mixup - I didn't realize that not everything was done yet

keen tartan
#

I have results for the reference models like SmolLM2, Llama, and Qwen as well.

#

Trying right now to organize them.

#

I did run those a week ago, but thought they will not be used in the paper.

misty igloo
#

yeah I didn't expect us to change to this in the last 24 hrs before publishing

keen tartan
#

I don't have Qwen 7B yet.

#

I try to share what I have before going sleeping.

#

@gusty condor and @brisk bronze and anyone else who likes can complement results.

misty igloo
#

she's busy until tomorrow, unfortunately, so probably not enough time for her to contribute to those

keen tartan
#

By the way I figured out we can speed evaluation with multiple GPUs.

misty igloo
#

there are a few ways to do that using the cmdline

keen tartan
#

Using accelerate, e.g. ```bash
accelerate launch -m lm_eval --model hf
--tasks lambada_openai,arc_easy
--batch_size 'auto'

misty igloo
#

yep

keen tartan
#

Also set batch size to 'auto', then it tries to calculate ideal batch size itself.

misty igloo
#

that 'auto' bsz tends to break (or it used to) for mmlu

#

but works on many normal evals

#

I also have a version of the RWKV eval harness thing that supported batched inference and is much faster

#

but I don't want to use it here

#

that's why I wanted to use the lm-eval cmdline version for RWKV, so we get multi-gpu acceleration and batching

#

anyway it doesnt matter, since you finished those

keen tartan
#

@nova frost Isn't the lm_eval version specified in the output json file's metadata?

keen tartan
nova frost
keen tartan
#

"git_hash": null, -.-*

nova frost
#

damn. lol. I'll add the lm_eval version going forward

#

but that probably meant it wasn't run from a git dir, so installed from pypi

keen tartan
#

I cannot tell for sure what version of lm_eval I ran the reference models evaluations for SmolLM2, Llama, Qwen some time ago.

#

We may need to recompute them to be certain.

obsidian quest
keen tartan
gusty condor
brisk bronze
gusty condor
#

Note: perplexity of lambada_multilingual should be the geometric mean over 5 languages, not the arithmetic average. Strange that even lm_eval was mistaken on that.

young sparrow
keen tartan
#

I am awake. I focus now on SmolLM2 model series evaluations.

gusty condor
# young sparrow What is your source for this?

I think it's almost obvious, from the definition of perplexity.
The geometric mean of perplexity is equal to the exponential of average negative log likelihood loss.
On the other hand, the arithmetic average of perplexity has no clear semantic meaning.

keen tartan
#

Llama 1B and 3B is also needed.

#

SmolLM2 135M, 360M, and 1.7B also required.

gusty condor
keen tartan
gusty condor
#

The empty items in the tables 3 and 4 are currently missed.

#

We should recheck pile models too

keen tartan
#

Qwen 2.5 7B sciq and 5-shot MMLU is also missing. Any one running those? I could attempt, but my runs take always so long to finish.

keen tartan
#

I try Qwen 2.5 7B sciq

keen tartan
#

Done, trying now 5-shot MMLU (but it seems to take over 5h). I am abounding 7B and focus on the smaller models first for now.

misty igloo
#

We will need that in the next few hours in order for your experiments to be a part of the arxiv pre-print in this version

#

The COLM deadline is also soon, and I will need your open review ID's and/or emails used to sign up with open review

#

I also think you need more explanation of how your "approach enhances RWKV-7's capabilities to interpret and process complex, high-dimensional spectrogram features"

#

If you make claims in the paper they need to come with evidence.

#

The text currently does not describe anything at all about how or what AudioRWKV-7 does, except that it uses spectograms.

#

Considering the timeline here I am going to comment this out of the paper for now. If you think you have what's needed before say 4pm UTC today, let us know here and we can consider if there is time to put it in.

#

This is a very late addition, and I'm not guaranteeing that it will be able to become a part of the paper. That will depend on both when you have the full writeup ready, and what the quality is like.

nova frost
keen tartan
#

Used an NVIDIA L40S with 48Gb memory. It occupied almost all the VRAM, so I assumed it was batching it correctly.

#

For another run with Qwen-2.5 3B I got OOM for the same lm_eval command args but different smaller dual T4 GPU. Gonna try setting batch size to 1 or using single GPU in this case.

nova frost
#

yeah auto can sometimes be unreliable

keen tartan
nova frost
#

can also do auto:N so that it recomputes the batch size N number of times. But this is mostly helpful if you're running multiple tasks (so more variation is seq lengths)

keen tartan
gusty condor
#

Urgent: who is testing llama3.2?

misty igloo
#

it's possible Janna is running those, (but she's in PT timezone, and it's still 7:16am there, and she just got back from a late night flight)

#

last night she said "probably would also do llama and smollm"
I had asked her to coordinate with @keen tartan tho so maybe he knows if they're in progress

keen tartan
keen tartan
#

I can push the results I got so far.

#

it is still not completed all yet.

#

lm_eval version is in file names. I move to folders later on.

gusty condor
misty igloo
#

once we have all the data I have to see if christian can regenerate the flops plots he made

#

I have submitted our abstract to COLM.

fresh mulch
#

hey, wait, llama3.2 is not in the FLOPS charts

gusty condor
#

Great! We are almost done.

misty igloo
#

What are we still missing revised number for? Just SmolLM?

fresh mulch
#

I have a meeting for the next hour but after that will be available to update charts, which should work out

gusty condor
#

only llama 3.2 3b left

misty igloo
gusty condor
misty igloo
#

we dont use llama on this sheet so no problem

gusty condor
#

@misty igloo Please redact links in the abstract for COLM for anonymity!

misty igloo
#

thanks, forgot that

#

updated.

misty igloo
#

there is no new pawsx test in @brisk bronze 's output for Qwen

#

also, I need some numbers for lambada.m on Qwen 7B if we are calculating it some new way

gusty condor
#
We present RWKV-7 "Goose", a new sequence modeling architecture featuring constant memory usage, constant inference time per token, state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. To accomplish this, RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. We also present an extended open source 3.1 trillion token multilingual corpus. We trained a set of models from 0.19 billion to 2.9 billion parameters on this dataset and find they exhibit exceptional performance across a range of common benchmarks.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.
#
We present RWKV-7 "Goose", a new sequence modeling architecture. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To test RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. These models exhibit state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. Still, RWKV-7 models remains at constant memory usage and constant inference time per token.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.

Which version is better?

sonic horizon
misty igloo
#

we can still potentially add it later for v2

gusty condor
misty igloo
#

@gusty condor chart looks wrong for Qwen 3B multilingual

#

could you check that all the numbers in the manuscript are really correct for that one?

#

or maybe 1.5B numbers are too high

keen tartan
misty igloo
keen tartan
keen tartan
#

I try.

misty igloo
#

Where are these new results for RWKV Pile coming from? I don't see them anywhere

brisk bronze
misty igloo
brisk bronze
keen tartan
#
import numpy as np

#define custom function
def g_mean(x):
    a = np.log(x)
    return np.exp(a.mean())

#calculate geometric mean
g_mean([41.524835786735544, 3.70873656895629, 67.94895756237318, 23.454130938244965, 31.073140477732952])

Output: 23.793823761545397

gusty condor
misty igloo
#

uh right

#

sorry I guess we dont need it for avg

#

I just gotta average them all

keen tartan
#

lol

misty igloo
#

sorry, doing a million edits right now - this is nuts trying to change this whole sheet and all its sub evals at the last minute

#

without making mistakes

#

@gusty condor Where are these new results for RWKV Pile coming from? I don't see the source data anywhere

#

is this just recalculating glue via normal avg?

keen tartan
#

Seems, like table 3 and 4 is fully completed. Is anything missing by now?

#

I gonna try to double check values.

misty igloo
#

I think I have everything done on the google sheet

#

but I'm still concerned that Qwen line looks bad

keen tartan
#

I look at Qwen 3B now.

fresh mulch
#

@misty igloo good for me to transfer numbers to mine?

misty igloo
fresh mulch
#

qwen has the same behavior with a relative dip at 3B multilingual in the previous data, but less pronounced ig

#

plus that's the easy one to change

misty igloo
#

it may be correct, but it looks sus to me

gusty condor
gusty condor
misty igloo
#

could also be that 1.5B is the one that's wrong, which makes 3B look like it dips

fresh mulch
#

hey, who removed subfigure captions on 3 and 4? we need those for some of the crossreferences

misty igloo
#

I'll comment it back in, sorry

gusty condor
fresh mulch
#

it is updated if you recompile

gusty condor
#

Nope, looks like something wrong with your plot. Now RWKV7 should be higher than Qwen2.5 at 3B

fresh mulch
#

smerky's average sheet says 71.0 average RWKV7 2.9B and 71.4 average qwen2.5 3B

gusty condor
#

which sheet

fresh mulch
#

@misty igloo

fresh mulch
gusty condor
keen tartan
#

Why do we show Qwen2.5-7B in for eng in table 3 but not for multilingual in table 4? I think we should considering to comment it out from the table 3 as we have not a RWKV model yet of this class to compare with.

misty igloo
gusty condor
#

I put that in for reference. Should definitely comment out

keen tartan
#

Yeah, I can see this. Was also toying around in thought with extrapolating how RWKV7 7B and 14B would perform

gusty condor
#

Apologize for confusion.

misty igloo
fresh mulch
#

so the data we are working with does not appear to support that claim

misty igloo
#

I agree it looks wrong

#

checking

#

@fresh mulch I updated the RWKV7 1.5 and 2.9B numbers

#

they were old

fresh mulch
#

ah okay

#

english only or both

misty igloo
#

eng, checking multi now

#

the multi look ok

fresh mulch
#

fixed, uploaded

misty igloo
#

@gusty condor do you think we can put SoTA now for both, instead of 'near SoTA' english

#

I'm a little leery of making the claim of SoTA on english

#

because we don't establish a new SoTA, except on a per tokens trained basis

#

which should matter but... I just don't want to overclaim

gusty condor
#

SoTA-level?

#

we are somehow on par with sota

misty igloo
#

It would be nice to lead with our best foot forward: that we have the best 3B LLM for way less training, and demolish everything on multilingual

gusty condor
#

Bo might have some different opinions: Architecture is the juiciest part, the model serves as a tool to demonstrate the architecture.

misty igloo
#

It's good for the first sentence to include the best results

misty igloo
gusty condor
misty igloo
#

How about something like this:

We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.

gusty condor
#

To test -> to demonstrate

misty igloo
#

'with LLMs' bothers me a bit.. not sure how to rephrase that

#

maybe 'with released LLMs'?

gusty condor
#

along with language models

#

Nevertheless, RWKV-7 models use ...

misty igloo
#

okay are we ready for publishing?

gusty condor
#

Yes!

fresh mulch
#

LGTM

misty igloo
#

@gusty condor can I remove your footnotesize on the multilang table?

gusty condor
#

OK, that is up to you

#

Is compilation successful?

misty igloo
#

trying that now

gusty condor
#

post error msg asap so that we can debug

misty igloo
#

Submission processed OK

#

take a look and let me know if anyone sees anything wrong

gusty condor
#

πŸŽ‰ πŸͺΏ πŸŽ‰ πŸ₯³

crystal hull
misty igloo
#

anyone got an idea of what ACM class we are?

#

I.2.7 maybe

#

I.2.7 Natural Language Processing

#

I guess that's what I'll put

keen tartan
#

Sounds alright

gusty condor
#

I think I.2.0 (this is where general architectures should live). But I.2.7 is still great.

keen tartan
#

Yeah, as it can be applied to other modulaties as well, not only NLP.

misty igloo
#

I can put multiple

#

I'll do both

#

Computation and Language (cs.CL)
Cross lists (optional):
Artificial Intelligence (cs.AI)
Machine Learning (cs.LG)

#
  • Article submitted
keen tartan
#

Goose-tastic! (or even better: Honk-tastic!)

misty igloo
#

πŸŽ‰

fresh mulch
#

πŸ₯³

#

time to compress for colm?

misty igloo
#

yup, time for that annoying process

gusty condor
#

We shall have a good rest

misty igloo
#

don't edit the current doc for that, we will make a new one

#

Great job, everyone!!!!

gusty condor
#

I go sleeping then πŸ›Œ 😴

misty igloo
#

you deserve a good sleep!!! gnite! Great work!

obsidian quest
#

great work πŸ™‚
please test mamba for <|endoftext|> effect as i predict it will be strong too.

gusty condor
#

Time's up! How is it going?

misty igloo
#
#

It just went out two min ago!

#

how do we submit it to HF daily papers? maybe tweet at Akhaliq?

gusty condor
#

Done!

willow condor
#

Have you posted to r/LocalLlama and/or r/MachineLearning?

misty igloo
#

Nope! Plz do so if you can

willow condor
#

Sure.

gusty condor
#

We are competing with 2 papers, #2 of which looks so clickbaity. Yet such papers receive lots of upvotes. This is unfairπŸ˜‚

gusty condor
#

Now we are #2 and Impossible Videos is #1.

gusty condor
#

One more upvote and we are #1!

crystal hull
keen tartan
#

It got nominated #1 paper of the day on HF. Of course. RWKV7 is more fundamental than a "merely deep fake generator". A "bimodal" benchmark is also no real competitor.

#

We shall may consider adding some illustrative visuals to the abstract page for next version.

spring epoch
#

also r/localllama is horrible with their moderation, I always have difficulty posting about RWKV

keen tartan
dusty skiff
#

does rwkv work with attention as a hybrid?

gusty condor
#

Yes, but that adds no benefit. You will barely see decreased loss or benchmark improvements.

misty igloo
gusty condor
#

@paper dove did some: Adding one layer of attention to L12/D768 RWKV-7 decreases loss by around 0.0008 (not significant).

misty igloo
#

Interesting!

dusty skiff
#

I'm not having the best results with rwkv, it's worse than attention in my experiments

dusty skiff
#

wdym

keen tartan
# dusty skiff wdym

I mean what experiments did you conduct and what was the result compared to the expected results?

#

Which models did you use for instance?

dusty skiff
#

I tried it on my custom dataset for language modeling, but it's totally different than typical LM datasets. I compared it to transformer with rope, value residual and muon optimizer

keen tartan
dusty skiff
#

the result is it's worse 0.3 in loss

#

from scratch

#

3.5m model

keen tartan
#

There lots of things to consider for training.

keen tartan
#

What trainer have you used? The RWKV-LM repo?

dusty skiff
#

no, mine code

keen tartan
gusty condor
dusty skiff
#

I literally replaced attention layer with Rwkv7Attention from fla

#

RWKV7Attention(
mode="chunk",
hidden_size=hidden_size,
head_dim=64,
num_heads=None,
decay_low_rank_dim=64,
gate_low_rank_dim=128,
a_low_rank_dim=64,
v_low_rank_dim=32,
# v_low_rank_dim=16,
norm_eps=1e-5,
fuse_norm=True,
layer_idx=layer_idx
)

gusty condor
#

The initialization of FLA-RWKV7 does not function properly.

Parameter Initializations Proper parameter initialization is crucial for ensuring training stability and achieving optimal performance for language models. RWKV-7 employs a carefully designed initialization strategy tailored to its architecture. The detailed initialization scheme is beyond the scope here but can be found in the official code repository. We emphasize that using the recommended initialization is essential for replicating the results in this paper. Deviations from the prescribed initialization may lead to performance degradation.
dusty skiff
#

good to know lmao

#

so riddle me this

keen tartan
#

It was converted from RWKV checkpoint.

dusty skiff
#

ah ok

keen tartan
#

Check the RWKV-v5 folder. There is the training code for RWKV7. I know it is a bit confusing.

dusty skiff
#

where is initialization code?

keen tartan
#

RWKV-v5/src/model.py

gusty condor
#

This function is extremely obfuscated, but the main purpose is:

  • initialize down projections with 0
  • initialize embedding with very small numbers
  • orthogonally initialize up projections, r, k, v and output head with relatively small gains
  • initialize token shifting with some magic numbers
obsidian quest
#

moreover use LayerNorm for rwkv7 (not RMSnorm)

dusty skiff
#

yeah I use ln

obsidian quest
dusty skiff
#

I mean it's kind of interesting because on some other dataset which was more prone to overfit on some implicit concepts, it behaved better

obsidian quest
#

0.3 loss difference certainly means something is wrong πŸ˜‚

obsidian quest
dusty skiff
obsidian quest
#

are you comparing train loss, or val loss?

dusty skiff
#

val

obsidian quest
#

how about train loss

dusty skiff
#

same story

#

byte-level tokenizer btw

obsidian quest
#

got train loss curve comparison?

obsidian quest
dusty skiff
#

yeah I have to figure out the code, or maybe you've got some idea how I can modify this

    def _initialize_weights(self, module: nn.Module):
        if getattr(module, "_is_hf_initialized", False):
            return
        if isinstance(module, nn.Linear):
            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        if isinstance(module, nn.Parameter):
            nn.init.xavier_uniform_(module, gain=2 ** -2.5)
        module._is_hf_initialized = True
dusty skiff
dusty skiff
#

should I use this?

            # !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
            # self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
            # self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.output.weight.data.zero_()
#

I'm lost lol

#

idk I have this and there's little difference in loss

fresh mulch
#

ctrl+f for the first comment line

obsidian quest
#

and print all names, and print() sth inside if, to make sure these ifs are called

misty igloo
#

@dusty skiff let's move this out of the paper writing channel and into the rwkv discord or rwkv channel here

obsidian quest
#

you should see dramatic better loss after these

#

#rwkv

misty igloo
#

but generally speaking, wrt to the papers, we really need to provide an easy to use training code (FLA?) with proper inits

#

or else everyone will have this experience

#

the RWKV-Block repo could become this, but someone needs to devote time to making sure it's really perfect first

#

and that someone will not be me 🀣

#

I think improving the FLA code specifically is important, since that's probably what people will try first
@gusty condor I don't know if you have time to help fix that but it'd be great if you do

#

I currently copied the FLA models to the official RWKV HF, so it's the default implementation right now that people find

obsidian quest
misty igloo
#

I think the problem is their setup for all the FLA models isn't currently well suited towards special inits and needs some changes to support that (to be fair, our code for that is a horrible mess)

#

@sonic horizon When do you expect to have the full AudioRWKV experimental results and additional baselines ready? Please keep in mind that the final COLM submission date is about a week away.

#

If featured, just know that it will almost certainly end up in an appendix for that paper. We are an extreme premium for space, as the entire paper must fit in 9 pages.

young sparrow
#

Great work on the paper everyone πŸ™‚

sonic horizon
obsidian quest
misty igloo
#

will update the paper

gusty condor
#

Yes

gusty condor
obsidian quest
obsidian quest
#

Todo:

  1. #1103039376184852622 message

  2. #1103039376184852622 message

  3. #1103039376184852622 message

  4. #1103039376184852622 message

keen tartan
# obsidian quest great work πŸ™‚ please test mamba for <|endoftext|> effect as i predict it will be...

I ran a quick experiment to test Mamba2 with and without add_bos_token flag and found no difference in accuracy and no significant difference in perplexity.
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-370m

Perhaps the lm_eval's add_bos_token option is buggy for Mamba models as well and did not actually add it.

#

Gonna try again with installing from GitHub directly.

young sparrow
keen tartan
#

@obsidian quest You were right! 43.9% versus 43.5% in accuracy and 16.8 versus 17.1 in perplexity.

No bos:

|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.4392|Β±  |0.0069|
|              |       |none  |     0|perplexity|↓  |16.8289|Β±  |0.5443|

With bos:

|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.4353|Β±  |0.0069|
|              |       |none  |     0|perplexity|↓  |17.1491|Β±  |0.5539|

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m/0.4.8

gusty condor
keen tartan
gusty condor
#

test RWKV-G1 too: that is very significant.

young sparrow
keen tartan
#

It is a wild-goose chase (pun intended). ^^

obsidian quest
keen tartan
#

Perhaps sample size is too small. So testing it on more evaluation tasks might reveal statistical significant difference.

keen tartan
#

You mean the 142 examples where the answer is the first token as identified by @gusty condor. Gonna check those.

gusty condor
misty igloo
misty igloo
#

@obsidian quest do you have a name for this formula? RWKV7-alt? lol

#

(Also, did you try it for language modeling? I was always hoping we'd move the w outside of the evolution formula, if you recall!)

obsidian quest
gusty condor
#

Did you ever capture a replicable NaN?

keen tartan
#

Created custom task for special 142 samples of lambada-openai and tested Mamba2 again. No significant effect it seems:

no bos:

|   Tasks    |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 |      1|none  |     0|acc       |↑  | 0.4085|Β±  |0.0414|
|            |       |none  |     0|perplexity|↓  |14.5471|Β±  |2.6646|

add_bos:

|   Tasks    |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 |      1|none  |     0|acc       |↑  | 0.4085|Β±  |0.0414|
|            |       |none  |     0|perplexity|↓  |14.1444|Β±  |2.6250|

I basically used the following code, but tested Mamba2 rather than SmalLM2: https://colab.research.google.com/drive/1nle-APaWJ12uA-WS9zgLEqAmY6cGWRHm?usp=sharing
I may have missed something perhaps.

gusty condor
#

That's perfectly normal for mamba 2.

brisk bronze
#

we're now #1 on weekly papers on HF too

gusty condor
sonic horizon
#

@misty igloo Hi, sorry to keep you waiting. I have updated the final experiment results of audio modeling and finish the writing of AudioRWKV, in \paragraph{RWKV for Audio Modeling} , RWKV-7 (preliminary) .

#

If there isn't enough room in the main text, we could include it in the appendix. However, I'm not sure which section would be the most appropriate β€” could you advise?

misty igloo
misty igloo
#

Are DeepRes and HST-AT the transformer based architectures?

#

I'm unclear on the difference between HST-AT and HST-AT pretrained

#

Also, could you explain what is meant by

Note that we did not use the ensemble trick in this experiment, resulting in a slight drop in performance compared with results reported in \citet{rwkv6_colm}.
I'm not sure I understand what the ensemble trick is or what results you're saying were in the rwkv5/6 paper

sonic horizon
#

For the ensemble trick, it provides a bigger ensemble result by using models with different patch settings. We used it in the audio modeling section in RWKV6.

obsidian quest
#

how many heads are you using? we need at least 5 heads for single layer rwkv7 to solve S5. so can use multiple small heads @crystal hull

p.s. note i chose v^T k instead of k^T v because it fits the L2 loss

sonic horizon
#

I can add these details in the writing .

misty igloo
obsidian quest
misty igloo
misty igloo
sonic horizon
misty igloo
#

but this thing where the results are worse isn't good

misty igloo
sonic horizon
misty igloo
#

I think it's quite important to show some apples to apples comparison with v6 though

#

Otherwise you haven't really demonstrated anything about v7, which is the goal of putting this into the paper

sonic horizon
gusty condor
#

No tricks in training RWKV7 please, or use tricks for both.

gusty condor
#

Time to work for COLM submission!

misty igloo
misty igloo
#

sorry, posted the wrong paper link a moment ago... will get one asap

fresh mulch
#

fair point about variable definitions, though this notation is standard isn't it

#

i guess we still ought to define our terms before using them

fresh mulch
#

also is one of these $\kappa_t$ supposed to be $\hat{\kappa}_t$

silent urchinBOT
#

Christian Azinn

misty igloo
misty igloo
tropic minnow
#

this plot (right side) might benefit to nonoverlapping text witth grid, and making rwkv more orange, less yellow

young sparrow
#

Omitting Mamba and RWKV-Pile on the left looks weird at a glance. I know it's because of the minimal multilingual content in the pile, but you should explicitly say that in the caption so someone who glances at the plot has that context. If there are numbers for those models, I would recommend including them even if they're bad tbh. Most plots should be optimized to be easily digestible at a glance / to people skimming

fresh mulch
fresh mulch
misty igloo
#

If you make edits, please do so only on the arxiv version

#

I will port them to the COLM document after validating the final choices made

#

otherwise it becomes really hard to track what changed

fresh mulch
#

makes sense. i also need to change axis titles and fix alignment. will do in an hour or so

misty igloo
#

Woohoo! I finally got it all to fit in 9 pages with all the figures and tables we need.

obsidian quest
#

"However, training for multi-query associative recall (MQAR) is highly unstable and strongly dependent on initialization and hyperparameter settings
some guy read this and say RWKV7 is bad at MQAR so we dont provide MQAR chart πŸ˜‚

#

so let's add chart for this

#

in this style (show 1024 & 2048 if possible)

gusty condor
#

I just want to avoid suppressing the baseline for other models, as shown by xLSTM paper. The default initialization of MQAR is clearly suboptimal for RWKV-7 and a few other models, but without knowing their correct initialization and implementation I decide to not put them in at all.

obsidian quest
#

let's simply give all models better initialization

gusty condor
#

and lr too (I used transferable lr https://arxiv.org/abs/2407.05872 for RWKV-7 based on observations so I didn't sweep on the whole LR interval)

misty igloo
#

@iron parrot did you use RWKV7a with c=2 in the Othello experiments? I added a couple of sentences there - please review and expand on whether or not this was what the code did.

iron parrot
dawn pewter
#

From the results in Appendix C, c=1.545239211892605 ( 1+exp(-exp(-0.5)) ) is the maximum value of c that ensures stability.

misty igloo
obsidian quest
#

lets fix table 10

misty igloo
#

but I will fix it now

misty igloo
misty igloo
obsidian quest
willow condor
#

Typo: a product a product of elementary transition matrices

misty igloo
#

these proofs are undergoing revisions right now, and are probably the last thing that will change before I publish the final COLM version

misty igloo
#

Proof revisions integrated and COLM and ArXiV versions submitted.

#

(Updated ArXiV version supposedly going out March 31)

willow condor
#

RWKV 7 can be made Turing Complete using permutation matrices and state dependent (not just data dependent) transition matrices.

I think the next RWKV should include matrices that aren't just diagonal but rather subdiagonal etc., which would reduce parallelizability for maximal expressivity. End the war with "DeltaFunction"s.

willow condor
#

To expand, I mean explicitly give RWKV a way to simulate cellular automata in a continuous, differentiable way. For example, the formula for calculating Rule 110 (Turing-complete) is state + (state @ right) - state * (state @ right) * (1 + (state @ left)) where left, right are the last dimension left and right shifted versions of state (equivalent to multiplying by a subdiagonal matrix or a superdiagonal matrix)

#

Rule 110, when the state and everything else is bound between (0, 1), displays interesting converging properties where in 3D it converges to 1/phi for all coordinates, while if instead of right shifting or left shifting and treating the edges as constants a or b, a acts like the learning rate and b as the point which the rule converges to. See this Desmos graph if interested:

https://www.desmos.com/3d/mhlntzgruo

obsidian quest
#

nobody complains about transformer expressiveness πŸ˜‚ we should improve rwkv's memory first

crystal hull
gusty condor
#

I analyzed the download data (only counting non-quantized models), and the results are roughly as follows:

Organization Downloads Likes
meta-llama 26,369,349 41,742
Qwen 21,092,745 25,817
deepseek-ai 12,927,530 36,137
HuggingfaceTB 2,439,032 3,107
RWKV (incuding FLA) 70,705 537

Vision-Language Models (VLMs) are very popular. The top models for both Qwen and HuggingfaceTB are VLMs.
For Qwen, Llama, and RWKV, their most popular models are all 7B-sized.
Based on this data, RWKV should release a 7B model as soon as possible.

misty igloo
#

This is why I've been doing the conversions. I have a 7B model distilled from Qwen 72B that we can release this week with the arxiv version of the RADLADS conversion paper.

willow condor
#

maybe a large memory bank which is sparsely activated? im sure these ideas have come up before

remote elbow
# willow condor maybe a large memory bank which is sparsely activated? im sure these ideas have ...
willow condor
remote elbow
# willow condor similar to the idea they describe. the keys should be read/write accessible for ...

What if you had two states and did product keys on that
pkm is

# mostly copied from https://github.com/facebookresearch/memory/blob/main/lingua/product_key/memory.py but I removed some stuff for simplicity
def pkm(q, keys1, keys2, topk, values):
    nkeys = keys1.shape[0]
    q1, q2 = q.chunk(2, dim=-1)
    scores1, indices1 = torch.topk(q1.mT@keys1, topk, dim=-1)
    scores2, indices2 = torch.topk(q2.mT@keys2, topk, dim=-1)
    # cartesian product on best candidate keys
    all_scores = (
      scores1.view(bs, topk, 1).expand(bs, topk, topk)
      + scores2.view(bs, 1, topk).expand(bs, topk, topk)
    ).view(
      bs, -1
    )  # (bs, topk ** 2)
    all_indices = (
      indices1.view(bs, topk, 1).expand(bs, topk, topk)
      * nkeys
      + indices2.view(bs, 1, topk).expand(bs, topk, topk)
    ).view(
       bs, -1
    )  # (bs, topk ** 2)

    # select overall best scores and indices
    scores, best_indices = torch.topk(
        all_scores, k=topk, dim=2, largest=True, sorted=True
    )  # (bs, topk)
    indices = all_indices.gather(2, best_indices)  # (bs, topk)
    return F.embedding_bag(values, indices, per_sample_weights=scores)

rwkv7 handles the state S like

# from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v7/rwkv_v7_numpy.py
...
S = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r
...

maybe you could do

S, ind = pkm(x, k1, k2, topk, bigS)
bigS[ind] = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r
#

it's probably not correct as is but something similar maybe

obsidian quest
obsidian quest
willow condor
#

What about TTT mlps? Titans tried that I think and they had good results. Intuitively sparse activations would prevent catastrophic forgetting as the gradients simply wouldn't propagate to irrelevant information.

sly agate
#

I tried pseudo State MoE. Fixed gating. Suffix tuning works well for multi-turn QA

misty igloo
#

I'll re-export for arxiv now

remote elbow
sly agate
# remote elbow does pseudo state moe work?

My method is an attempt to use multiple trained states(Prefix + Suffix) simultaneously during inference.

So it is not MoE.(thats why i call pseudo moe)

It works for my purposes (characterization, knowledge, agent).

By adding routers, we can achieve state sparsity, which may bring us closer to State-MoE.

I previously experimented with the non-state MLPSparse MoE on LoRA.

v7 0.4B(World v2.9) + Router + 4MLPLoRA(r=256) = 0.6B
Due to the dynamic LoRA merge, there were problems with the inference speed, but as a benchmark (Japanese), it improved slightly.
The basic design of MoE is based on Flock of Finches, and the HashRouter has been removed.

obsidian quest
#

pls add these to paper appendix

sly agate
#

Thanks to FLA, RWKV v6 and v7 can perform 384 batch inferences on a single RTX4090. This means that there is almost no degradation in inference speed even when inferring multiple states simultaneously.

@obsidian quest about multiple-State-inference? or Prefix + Suffix Tuning?
Multiple state inference is experimental and cannot be guaranteed to be mathematically correct.(But the implementation is simple)

obsidian quest
#

add all as experiments πŸ™‚

willow condor
#

I have an idea. What if, we had an external memory that is separate from the state but which can only be read from in a way that automatically changes it? This is more similar to how human memory works where recalling a fact increases its strength, and would allow for better parallelization.

k = key generated from state
v = expected value generated from state

return dot(memory @ k, v) * v to state
memory += k^Tv

or something more generalized.

misty igloo
#

Updated arxiv paper is live.

tough crane
misty igloo
#

As mentioned in the paper, I generally found that post-training with a different dataset resulted in a 'confused' model. But maybe there are workarounds for this that could be discovered.

tough crane
misty igloo
gusty condor
obsidian quest
#
#

@misty igloo

misty igloo
# obsidian quest Here are <@1078605512138043403> 's experiment log https://docs.google.com/docume...

I think this is different than state offset tuning. Very interesting that it seems to work well.

# normal matrix-evolution recurrence
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
  
# offset tuning recurrence
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ (h + self.offset)

# offset tuning removed from the recurrence (kernel)
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
# post-kernel step
y = y + r @ self.offset
 
# OpenMOSE method
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
y = y * (1 + self.time_offset_y)
y = groupnorm(y)
y = self.output(y * g)
# plus another change after the output
y = y * (1 + self.time_offset)
gusty condor
misty igloo
gusty condor
#

(Simulating Qwen pretraining distribution)

#

However, I have no idea of the instruction data (likely proprietary; I heard from some Zhihu user that Qwen and DeepSeek own a same proprietary English instruction dataset)

willow condor
#

i have tried using their arch after replacing all of the multitensor stuff with pure vectors and lobotomizing many softmax cummax layers to generalize it, but it is hard to get the symmetry and weight tying back. seems like it would perfectly compliment rwkv, so i was wondering if you or someone else already knew about this and had tried to incorporate it into in context gradient descent models

misty igloo
#

this blogpost did get brought up in the RWKV discord previously

#

happy to discuss there more

dusty skiff
#

you won't improve memory, it's just fundamentally impossible with such parallelization. Memory also requires expressiveness, but we won't achieve this without making the models sequential at least to some degree

#

that's why no one will get true length extrapolation with 1 forward pass over 13421512532k tokens bullshit

gusty condor
gusty condor
obsidian quest
gusty condor
#

My code is more aggressive:

  1. https://github.com/Triang-jyed-driung/rwkv7mini (completely restructured dataset loading)
  2. https://github.com/Triang-jyed-driung/my-pretrain (pretraining code, applicable for HF-compatible models, including RWKV-7 FLA, and supports pytorch-lightning from 1.9.5 to 2.5.1)
GitHub

RWKV-7 mini. Contribute to Triang-jyed-driung/rwkv7mini development by creating an account on GitHub.

GitHub

My pretraining code for HF-compatible models. Contribute to Triang-jyed-driung/my-pretrain development by creating an account on GitHub.

young sparrow
#

Is there an RWKV paper at ICLR? @void quartz @misty igloo

steady ether
#

Apparently there is one on vision-rwkv

fresh mulch
#

what is "RWKV-like"

steady ether