RWKV-papers | EleutherAI | Page 10

gusty condor Mar 9, 2025, 4:20 PM

#

https://modelscope.cn/datasets/AI-ModelScope/GuanacoDataset

GuanacoDataset

keen tartan Mar 9, 2025, 4:23 PM

#

gusty condor 2. https://modelscope.cn/datasets/AI-ModelScope/GuanacoDataset

Fantastic! Thx

#

I think this is the missing books3 dataset: https://huggingface.co/datasets/SaylorTwift/the_pile_books3_minus_gutenberg

SaylorTwift/the_pile_books3_minus_gutenberg · Datasets at Hugging F...

misty igloo Mar 9, 2025, 9:21 PM

#

Cool, are those three everything that's unavailable, including in all the parts of world v1/2 as well?

keen tartan Mar 9, 2025, 10:35 PM

#

misty igloo Cool, are those three everything that's unavailable, including in all the parts ...

Yes, as far as I can assess. I gonna double check all 88 sets again just to be sure.

#

MMLU results seem to be missing for Llama-3.2 1B/3B and Qwen-2.5 1.5B/3B. Could you please check whether you can find them as well?

#

Added a list with links to all datasets in the README for convenience: https://huggingface.co/datasets/hevok/Goose-World-v3
It is right under the pic.

#

I quickly hacked together a script to convert the lm_eval markdown text outputs to json outputs: https://huggingface.co/spaces/hevok/evals/blob/main/txt2json.py

#

I used it to convert the Llama-3.2 1B/3B and Qwen-2.5 1.5B/3B text files into json format for easier processing with software: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval

#

@gusty condor By the way, in the case you were not aware, you can specify the --output_path <output_folder_name> flag of lm_eval to let it output directly json files with more metadata.

obsidian quest Mar 10, 2025, 12:01 AM

#

yeah improved in world-3.5

gusty condor Mar 10, 2025, 2:40 AM

#

keen tartan <@803473343705514025> By the way, in the case you were not aware, you can specif...

I will rerun them

gusty condor Mar 10, 2025, 4:38 AM

#

updated https://github.com/Triang-jyed-driung/myevals

GitHub

GitHub - Triang-jyed-driung/myevals: my evaluations

my evaluations. Contribute to Triang-jyed-driung/myevals development by creating an account on GitHub.

dawn pewter Mar 10, 2025, 8:03 AM

#

If σ means the sigmoid, should we used uniformly? For example, both use σ or both use sigmoid

dawn pewter Mar 10, 2025, 8:20 AM

#

Should the ranges of ξ and α also be specified? we've only described what they do but we haven't described their range

gusty condor Mar 10, 2025, 9:23 AM

#

Yes, you can change it

gusty condor Mar 10, 2025, 11:06 AM

#

Now, I have finally found the main reason for the performance degradation of converting RWKV7 models to Flash-Linear-Attention format.
It is not related to numerical precision. it is related to the prompt format.
The code used by Blink and HowardHou (VisualRWKV) adds an extra BOS token [0] before the text. However, lm-eval does not add that extra token.
Upon further inspection, I found the failure pattern: The model is unable to perform recall for the very first token it receives, witnessed by these examples:

2728: "Mathews lifted a dark brow. \"Are you sure about that? I mean, wouldn't it be better to wait until Dale is home safe and sound?\"\n\n\"The longer I wait to tell her, the worse it will be for both of us.\"\n\n\"Good luck. You're going to need it,\" said Mathews"

1225: "Seth traced the dirt with the end of a stick. \u201cYou say I\u2019m stubborn\u201d I laughed and he continued, \u201cListen, I don\u2019t even know if it\u2019s true or not. There\u2019s no need for me to worry any of you. That\u2019s why I didn\u2019t say anything.\u201d\n\u201cI still don\u2019t care, Seth"

3999: "Sirona tried to quell her sense of disappointment. \u201cWhat, then? Why did I see what I saw?\u201d\n\u201cThe young woman you observed being sacrificed,\u201d her teacher asked, \u201cDid she appear distraught, or did she go along with the ritual willingly?\u201d\n\u201cI\u2019m certain she was terrified,\u201d said Sirona"

The inability to recall the first token is probably related to WKV state initialization.
After removing the bos token [0] from VisualRWKV code, the performance matches FLA implementation.

Now, I have these questions:

Should I write that into the paper?
Does lm-eval have an option of adding a BOS token before the text?

@misty igloo @keen tartan

obsidian quest Mar 10, 2025, 11:06 AM

#

this is like bos token

nova frost Mar 10, 2025, 11:12 AM

#

gusty condor Now, I have finally found the main reason for the performance degradation of con...

add_bos_token=True in the model_args.

#

uses tokenizer.bos_token_id

#

https://github.com/EleutherAI/lm-evaluation-harness/blob/4890e881031a8ff00fd3136f938c4cf1ae101de4/lm_eval/models/huggingface.py#L799

keen tartan Mar 10, 2025, 11:24 AM

#

Gemma series of models also seem to need a bos_token added.

keen tartan Mar 10, 2025, 11:30 AM

#

nova frost `add_bos_token=True` in the model_args.

On a related note, I had to set adapter = EvalHarnessAdapter() adapter.custom_prefix_token_id = None when evaluting RWKV models to get some benchmarks working. There was otherwise an undefined variable error raised somewhere. I try perhaps to reproduce it. Might be already gone in newer versions.

#

@gusty condor RWKV6-3B v2.1 multilingual appears to be missing. Could you please check whether you can find them?

gusty condor Mar 10, 2025, 11:41 AM

#

keen tartan <@803473343705514025> RWKV6-3B v2.1 multilingual appears to be missing. Could yo...

overwritten, I test it now

keen tartan Mar 10, 2025, 11:41 AM

#

gusty condor overwritten, I test it now

Oh ok, very well.

keen tartan Mar 10, 2025, 12:05 PM

#

Setting adapter.custom_prefix_token_id = None or not changes results a tiny bit for perplexity but not accuracy. However, values are identical up to 7 decimal points (e.g. 12.59587346 versus 12.59587348). So it is properly not significant.

gusty condor Mar 10, 2025, 12:18 PM

#

Please print the tokens at line 442 of flash-linear-attention/fla/models/rwkv7/modeling_rwkv7.py to see if BOS token is properly added.

#

token_ids = input_ids.flatten().tolist()
print(token_ids)

nova frost Mar 10, 2025, 12:31 PM

#

keen tartan Setting `adapter.custom_prefix_token_id = None` or not changes results a tiny bi...

looks like custom_prefix_token is only used for loglikelihood_rolling (perplexity) tasks right now (like wikitext)

gusty condor Mar 10, 2025, 12:33 PM

#

RWKV7-G1-0.1B drops 1% (49.1% -> 48.1%) without [0] token for lambada_openai

keen tartan Mar 10, 2025, 12:35 PM

#

gusty condor RWKV7-G1-0.1B drops 1% (49.1% -> 48.1%) without `[0]` token for `lambada_openai`

I am getting 0.4898 right now (so ~ 49.0%)

gusty condor Mar 10, 2025, 12:38 PM

#

Are you using g1

keen tartan Mar 10, 2025, 12:38 PM

#

yes

gusty condor Mar 10, 2025, 12:38 PM

#

You converted the model to FLA format?

keen tartan Mar 10, 2025, 12:38 PM

#

No. Not yet.

#

Just RWKV7 pth models directly with adapters like Blink and HowardHou.

gusty condor Mar 10, 2025, 12:41 PM

#

keen tartan I am getting 0.4898 right now (so ~ 49.0%)

Use fp32 for more accurate results

keen tartan Mar 10, 2025, 12:41 PM

#

I did set the strategy to use fp32

#

strategy = 'cuda fp32'

#

@nova frost Here is the error I get if not setting adapter.custom_prefix_token_id = None: ```
/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py in loglikelihood(self, requests, disable_tqdm)
361 # BOS or EOS as context
362 context_enc, continuation_enc = (
--> 363 [self.prefix_token_id],
364 self.tok_encode(continuation),
365 )

/usr/local/lib/python3.10/dist-packages/lm_eval/models/huggingface.py in prefix_token_id(self)
360 def prefix_token_id(self):
361 # it is used as prefix for loglikelihood
--> 362 if self.custom_prefix_token_id is not None:
363 return self.custom_prefix_token_id
364 if self.tokenizer.bos_token_id is not None:```

Full traceback here: https://huggingface.co/spaces/hevok/evals/blob/main/errors/custom_prefix_token_id.txt

nova frost Mar 10, 2025, 12:53 PM

#

keen tartan <@328142664476131330> Here is the error I get if not setting `adapter.custom_pre...

looks like its to do with EvalHarnessAdapter

keen tartan Mar 10, 2025, 12:53 PM

#

Yes

#

It should be the BOS token there, right?

#

In RWKV we have eos_token_id == bos_token_id == 0, I suppose.

nova frost Mar 10, 2025, 12:57 PM

#

yeah, should be fine i think as long as tokenizer.bos_token_id == 0

#

it's added through tokenizer.encode(string, add_special_tokens=add_bos_token)

keen tartan Mar 10, 2025, 12:59 PM

#

I think we only specify tokenizer.eos_token_id = 0 in the tokenizer wrapper right now. Gonna set the bos_token_id there too.

gusty condor Mar 10, 2025, 12:59 PM

#

RWKV7 adapter code, without [0]:

{
  "lambada_openai": {
    "perplexity,none": 13.835651924377974,
    "perplexity_stderr,none": 0.4269067454771951,
    "acc,none": 0.4812730448282554,
    "acc_stderr,none": 0.006961090021795178,
    "alias": "lambada_openai"
  }
}

RWKV7 adapter code, with [0]:

{
  "lambada_openai": {
    "perplexity,none": 12.362614971985607,
    "perplexity_stderr,none": 0.36913900917528986,
    "acc,none": 0.4913642538327188,
    "acc_stderr,none": 0.006964938588638406,
    "alias": "lambada_openai"
  }
}

RWKV7 FLA, without [0]:

{
  "lambada_openai": {
    "perplexity,none": 13.835802857719031,
    "perplexity_stderr,none": 0.4368222446505151,
    "acc,none": 0.4814671065398797,
    "acc_stderr,none": 0.00696119082972564,
    "alias": "lambada_openai"
  }
}

RWKV7 FLA, with [0] (code hacking):

{
  "lambada_openai": {
    "perplexity,none": 12.364938863860012,
    "perplexity_stderr,none": 0.3773660539093024,
    "acc,none": 0.49117019212109453,
    "acc_stderr,none": 0.006964891360529564,
    "alias": "lambada_openai"
  }
}

keen tartan Mar 10, 2025, 1:00 PM

#

Oh, that is indeed a significant impact!!!

gusty condor Mar 10, 2025, 1:02 PM

#

I added this line at line 440 for https://github.com/fla-org/flash-linear-attention/blob/main/fla/models/rwkv7/modeling_rwkv7.py#L440:

input_ids = torch.cat((torch.tensor([[0]], dtype=input_ids.dtype, device=input_ids.device), input_ids), dim=1)

GitHub

flash-linear-attention/fla/models/rwkv7/modeling_rwkv7.py at main ·...

🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton - fla-org/flash-linear-attention

keen tartan Mar 10, 2025, 1:16 PM

#

Great find. We need to fix this issue.

#

I am not able to reproduce the RWKV7 adapter code results for RWKV7 G0. As baseline I get: { "lambada_openai": { "perplexity,none": 12.596602010153171, "perplexity_stderr,none": 0.3822718659650309, "acc,none": 0.48961769842810016, "acc_stderr,none": 0.006964475739361981, "alias": "lambada_openai" } }

#

Hold on

#

I prepare some code.

#

https://colab.research.google.com/drive/1Wywmv_c5rBjrcytLnf4ZWPrskJayBxBZ?usp=sharing

Google Colab

gusty condor Mar 10, 2025, 1:21 PM

#

Use this
RWKV_PAD = [0] # you can try using [0] as pad

keen tartan Mar 10, 2025, 1:21 PM

#

Ahhh

#

That's it!!!

gusty condor Mar 10, 2025, 1:21 PM

#

@keen tartan FLA model is here: https://huggingface.co/fla-hub/rwkv7-0.1B-g1/

fla-hub/rwkv7-0.1B-g1 · Hugging Face

keen tartan Mar 10, 2025, 1:22 PM

#

By default it uses RWKV_PAD = pipeline.tokenizer.encode('\n')

#

What about the STOP_TOKEN?

#

Default is STOP_TOKEN = RWKV_PAD + pipeline.tokenizer.encode('\n\n')

dawn pewter Mar 10, 2025, 1:28 PM

#

Fun fact: k_k can even be more than 4 and less than -4

obsidian quest Mar 10, 2025, 1:31 PM

#

dawn pewter Fun fact: k_k can even be more than 4 and less than -4

yeah that's why it's useful

dawn pewter Mar 10, 2025, 1:32 PM

#

$\xi$ is a learned parameter representing the removal key multiplier, which transforms the original key into a version to be removed from the state.

This is the description in the paper, which may leave the reader a little confused, why can the removal key multiplier even be greater than 1 and less than 0

silent urchinBOT Mar 10, 2025, 1:32 PM

#

Kaguya

keen tartan Mar 10, 2025, 1:37 PM

#

@gusty condor Getting similar but not identical results now for RWKV7 EvalHarnessAdapter with PAD = [0]: ```{
"lambada_openai": {
"perplexity,none": 12.364936956333898,
"perplexity_stderr,none": 0.3764505210612126,
"acc,none": 0.49117019212109453,
"acc_stderr,none": 0.006964891360529504,
"alias": "lambada_openai"
}
}

gusty condor Mar 10, 2025, 1:39 PM

#

An error of 0.02% is not significant at all.

keen tartan Mar 10, 2025, 1:39 PM

#

gusty condor An error of 0.02% is not significant at all.

Yeah, I think so too.

dawn pewter Mar 10, 2025, 1:47 PM

#

The average value of k_k is still between 0.7 and 0.8

keen tartan Mar 10, 2025, 1:51 PM

#

@gusty condor I noticed you have been using STOP_TOKEN = [535] which will be decoded as +). Is there a specific reason for this choice?
But wait, that is for PILE models! Wondering what would be proper value for world tokenizer/models. /n/n might terminate long form generations.

#

@nova frost Even when setting the tokenizer.bos_token_id = 0 still raises the Exception. I try setting the adapter.custom_prefix_token_id = 0 Hope this makes sense.

nova frost Mar 10, 2025, 1:59 PM

#

keen tartan <@328142664476131330> Even when setting the `tokenizer.bos_token_id = 0` still r...

yeah should be fine. custom_prefix_token_id isn't even used in any of the eval tasks

keen tartan Mar 10, 2025, 2:00 PM

#

nova frost yeah should be fine. `custom_prefix_token_id` isn't even used in any of the eval...

All right. Thanks!

misty igloo Mar 10, 2025, 2:03 PM

#

Sorry guys, got sick and may not be much help for the next two days. I'll try to put in an updated flops plot soon.

keen tartan Mar 10, 2025, 2:04 PM

#

misty igloo Sorry guys, got sick and may not be much help for the next two days. I'll try to...

Please rest and take care of your health. It is most important!

gusty condor Mar 10, 2025, 2:04 PM

#

keen tartan <@803473343705514025> I noticed you have been using `STOP_TOKEN = [535]` which w...

That is unused in the code for evaluating lambada_openai, piqa, mmlu et al.

keen tartan Mar 10, 2025, 2:06 PM

#

gusty condor That is unused in the code for evaluating `lambada_openai, piqa, mmlu` et al.

That is a relieve, but for other tasks we should have it set to reasonable values.

gusty condor Mar 10, 2025, 2:06 PM

#

[261] for '\n\n' in rwkv_vocab_v20230424

keen tartan Mar 10, 2025, 2:09 PM

#

@misty igloo In the case of infections, try to consume high amounts of fruits and berries (things that are rich in vitamin C) as well as consider supplementing zinc. Get well soon!

gusty condor Mar 10, 2025, 2:11 PM

#

Well? One bottle of vitamin C (100 tablets, 100mg x 100) costs only $0.5 in China.

keen tartan Mar 10, 2025, 2:13 PM

#

gusty condor Well? One bottle of vitamin C (100 tablets, 100mg x 100) costs only $0.5 in Chin...

That is a good deal. It is however recommended to also get vitamins from natural food sources as there is other stuff inside that increases bioavailability. It is extremely difficult to overdose on Vitamin C as it is extremely water-soluble. So consuming it from both food and supplements is fine.

gusty condor Mar 10, 2025, 3:14 PM

#

dawn pewter The average value of k_k is still between 0.7 and 0.8

Use smaller plot and relatively larger font size

#

add w_0 too

brisk bronze Mar 11, 2025, 1:10 AM

#

gusty condor RWKV7 adapter code, without `[0]`: ```json { "lambada_openai": { "perplexi...

Implemented your fix in the fla code...

Replicated your RWKV7 FLA 0.1B-G1 with code hacking results:

  "lambada_openai":{
      "perplexity,none": 12.364936711373161,
      "perplexity_stderr,none": 0.37736600715379043,
      "acc,none": 0.49117019212109453,
      "acc_stderr,none": 0.006964891360529504
  }
}```

BUT RWKV7 FLA 1.5B-World with the same code hack gets much higher results than in the paper:
```{
  "lambada_openai":{
      "perplexity,none": 4.136933117540389,
      "perplexity_stderr,none": 0.0886568308581175,
      "acc,none": 0.6931884339219871,
      "acc_stderr,none": 0.006425006782127488
  }
}```

RWKV7 1.5B-World (with adapter I assume) in the paper:
```{
  "lambada_openai":{
      "perplexity,none": 3.4,
      "acc,none": 0.483
  }
}```

#

According to this issue, pip rwkv7-1.5B-world gets 0.6931 on lambada.openai
https://github.com/fla-org/flash-linear-attention/issues/198

obsidian quest Mar 11, 2025, 3:02 AM

#

correct setup for MQAR https://openreview.net/pdf?id=CcqAd5RPk5

#

gusty condor Mar 11, 2025, 4:54 AM

#

https://arxiv.org/pdf/2503.06121
They used my figure (figure 2) with neither citation nor my consent!

dawn pewter Mar 11, 2025, 9:39 AM

#

interesting, k_a can take on big values like 13, -20

#

As models grow larger, the average k_k looks like going up, while k_a seems to trend downward

keen tartan Mar 11, 2025, 11:16 AM

#

brisk bronze Implemented your fix in the fla code... Replicated your RWKV7 FLA 0.1B-G1 with ...

For RWK7 World 1.5B via HarnessAdapter I get the following results depending on the specified PAD token IDs with the jupyter notebook I provided:

RWKV_PAD = [11] (tokenizer encoded '\n'): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T10-50-30.361472.json

"lambada_openai": {
  "perplexity,none": 4.174870788924788,
  "perplexity_stderr,none": 0.09003244838599012,
  "acc,none": 0.6951290510382302,
  "acc_stderr,none": 0.006413613926848405,
}

RWKV_PAD = [0] (only special token in World tokenizer, often denoted as '<|endoftext|>' or '<EOS>' ): https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-1.5B-v3-20250127-ctx4096/lambada_openai_2025-03-11T11-04-07.000221.json

"lambada_openai": {
  "perplexity,none": 4.133062406363742,
  "perplexity_stderr,none": 0.08879176331605698,
  "acc,none": 0.6933824956336115,
  "acc_stderr,none": 0.006423873526429436,
}

obsidian quest Mar 11, 2025, 12:27 PM

#

move Figure 7 to section 3 (Architecture) because it highlights the limits of attention & mamba

gusty condor Mar 11, 2025, 12:53 PM

#

I selected a subset of Lambada (142 problems) that satisfies these requirements:

The answer is the first word;
the first word does not appear again in the middle of the text.
The diferences are very significant:
v7 0.1B world 2.8:
No padding: ppl=357 acc=9.15
padding with [0]: ppl=16.4 acc=36.6
padding with [0,0]: ppl=10.7 acc=43.7

#

This is so significant and worth to be written in the paper

#

Examples:

{"text": "Beth smoothed her wiry half-black, half-gray hair from her makeup-free face. In New Mexico, the natural look was common. Standing next to Cindy Fanucci, she felt like a disaster. She hid her ragged nails under the sleeves of her sweatshirt.\n\u201cHi, I\u2019m Cindy. It\u2019s so nice to meet you, Beth"}
{"text": "Cooper groaned, and his body sagged back.\n\n\"You weren't supposed to be first,\" Deuce snarled as he lifted the gun and took aim at Cooper's prone form. \"But if that's the way you want it, old buddy...\"\n\n\"No!\" Gabrielle threw her body forward and wrapped her arms around Cooper"}

bronze frost Mar 11, 2025, 1:18 PM

#

I've been discussing the expressivity of RWKV-7 behind the scenes with @misty igloo , @dawn pewter and William Merrill, and

we finally have a proof that RWKV-7 can recognize any regular language!

This is significantly stronger than our prior claims, and doesn't rely on assumptions such as c = 2, "multi-step computations", or a special BOS token. This result clearly motivates our use of a data-dependent and elementwise ICLR "a". Prior works could only simulate permutation DFAs, while we can simulate general DFAs, because of this "a". RWKV-7 might be the first model to use diagonal + low-rank updates, and still be able to recognize regular languages.

The proof is a bit involved (~4 pages, added as Appendix E), but I tried to write it in a way where the core ideas appear early, and the complicated details appear later. A core insight is that multiple layers are needed. Numerical experiments indicate that 2 layers should be enough, but my construction uses 4 layers to simplify the proof.

There were some interesting insights from the proof of simulating DFAs with RWKV-7:

Because "a" is applied on the right instead of the left, we actually simulate the reversed DFA (the DFA which recognizes the reversed language). EDIT: Sorry, this was actually incorrect, it is "a" on the left which simulates backwards. Thanks Merrill for finding this mistake.
For DFA simulation, we often want to extract a single row of the wkv state. But because the receptance "r" is applied on the right instead of the left of the state, the readout requires simulating many identical wkv heads, where each head reads out a single element of the wkv state.
For DFA simulation, we do not need element-wise control or data-dependence for "w".

obsidian quest Mar 11, 2025, 1:24 PM

#

gusty condor I selected a subset of Lambada (142 problems) that satisfies these requirements:...

i think its because of tokenizer, and the effect is only large for tiny models

obsidian quest Mar 11, 2025, 1:34 PM

#

bronze frost I've been discussing the expressivity of RWKV-7 behind the scenes with <@1007072...

great work 🙂 do you have suggestions for increasing rwkv7 expressivity

dawn pewter Mar 11, 2025, 1:36 PM

#

bronze frost I've been discussing the expressivity of RWKV-7 behind the scenes with <@1007072...

amazing! I'll read it later!

bronze frost Mar 11, 2025, 1:39 PM

#

yeah, I thought lemma 3 would be a "well known" result, but I couldn't find a reference, so I cooked up a construction myself. If you can find a simpler proof without requiring the reader to known graph theory terms, that would be great.

bronze frost Mar 11, 2025, 1:47 PM

#

obsidian quest great work 🙂 do you have suggestions for increasing rwkv7 expressivity

Recognizing regular languages is already very strong, things beyond that are usually clearly impossible in constant time per token. For example, some NC1 problems require linearly growing state size in the sequence length. However, the current construction has uses huge state sizes and lookup tables of size vocabulary^(DFA states), which is probably limiting which regular languages can be simulated in practice.
Points 1. and 2. above indicate that we might want to experiment with readout on the "value" dimension of the state, even though this breaks the intuition from linear transformers. And maybe also apply "a" on the left side.
My way to avoid c = 2 is based on the group normalization immediately after the wkv heads, there might exist other/better normalizations of the state which could also improve performance (like how rwkv-6c normalization was great).

misty igloo Mar 11, 2025, 2:05 PM

#

yeah I still maintain that a more balanced construction of the overall formula with implicit normalization has the potential to improve performance

#

but I'm not sure whether that will improve or harm these regular language abilities
gotta wait a day or so until my brain works fully again to think about it 🤣

bronze frost Mar 11, 2025, 2:21 PM

#

@obsidian quest In summary, regular languages basically include everything we can reasonably do, and we can already technically solve regular languages (so we can do state tracking and basically what classical RNNs can do). However, the way we currently simulate them can be very inefficient. So most further improvement in expressivity probably comes from decreasing the required number of heads / head size / precision / etc.
A practical limitation on the expressivity of RWKV-7 wkv heads is that it applies all vectors to the "key" dimension of the state. This makes the slots in the "value" dimension independent. Some mixing also in the value dimension could potentially make the wkv heads more powerful (while making parallelization a bit more tricky 🙂 ).

dawn pewter Mar 11, 2025, 2:45 PM

#

bronze frost I've been discussing the expressivity of RWKV-7 behind the scenes with <@1007072...

Am I correct in understanding that, rather than aiming to simulate each individual step of an arbitrary DFA, we are now adopting a block-wise approach where we simulate the corresponding n-step emulation results of the DFA at every n-step interval?

gusty condor Mar 11, 2025, 2:47 PM

#

gusty condor Examples: ```json {"text": "Beth smoothed her wiry half-black, half-gray hair fr...

OK I am testing Qwen-0.5B and SmolLM, the results make a difference but not statistically significant, p=0.09

misty igloo Mar 11, 2025, 3:21 PM

#

dawn pewter Am I correct in understanding that, rather than aiming to simulate each individu...

I haven't read the final proof yet, but the idea from a couple of days ago was that all size-n blocks up until the final 2n-1 tokens would be deferred by one block size (think 'pipelining') and evaluated in a non-block manner, as a deferred set of per-token elementary matrices

Then, the final block is done block-wise so that it does not require deferral (and therefore no extra tokens are required)

[update: looks like icecuber simplified it to 2n tokens instead of 2n-1, but seems like otherwise same idea]

keen tartan Mar 11, 2025, 4:19 PM

#

@gusty condor I saw you pushed the missing RWKV6 World 3B multilingual results. Thx! We appear to still miss RWKV7 World 1.5B/2.9B results files for lambada_openai, hellaswag, piqa, arc_easy, arc_challenge, winogrande, and sciq. Please check.

keen tartan Mar 11, 2025, 4:39 PM

#

For future runs I suggest to also output a bit more metadata for better reproducibility including lm_eval version and special token IDs:

from importlib.metadata import version
# ...
output_dict = dict(
    model=MODEL_NAME,
    tasks=eval_tasks,
    num_fewshot=num_fewshot,
    lm_eval_version=version('lm_eval'),
    bos_token_id=adapter.tokenizer.bos_token_id,
    eos_token_id=adapter.tokenizer.eos_token_id,
    custom_prefix_token_id=adapter.custom_prefix_token_id,
    pad_token_ids=RWKV_PAD,
    stop_token_ids=STOP_TOKEN,
    results=results['results']
)
#...

Note: I added bos_token_id to the TokenizerWrapper am assigning right now adapter.custom_prefix_token_id = RWKV_PAD[0].

# ...
class TokenizerWrapper:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.bos_token_id = 0
        self.eos_token_id = 0
# ...
adapter = EvalHarnessAdapter()
adapter.custom_prefix_token_id = RWKV_PAD[0]
# ...

misty igloo Mar 11, 2025, 5:21 PM

#

keen tartan For future runs I suggest to also output a bit more metadata for better reproduc...

@brisk bronze did you guys figure out why add_bos_token=True is not working when calling lm eval harness from cmdline?

#

we shouldn't need an adapter or any of this stuff

#

should be able to just run lm eval from cmdline and have it work

gusty condor Mar 11, 2025, 5:52 PM

#

No idea. I am not a maintainer of lm-eval

young sparrow Mar 11, 2025, 5:58 PM

#

misty igloo <@533592838529744917> did you guys figure out why add_bos_token=True is not work...

Did someone actually check that this doesn't work

misty igloo Mar 11, 2025, 6:34 PM

#

gusty condor No idea. I am not a maintainer of lm-eval

Well @nova frost is, so @brisk bronze could you try to work with him to figure out what's going on here

brisk bronze Mar 11, 2025, 6:37 PM

#

young sparrow Did someone actually check that this doesn't work

yes, this doesn't work because adding add_bos_token=True in a model.args doesn't improve performance (the result is the same)

#

both in 0.4.3 and 0.4.7

nova frost Mar 11, 2025, 6:45 PM

#

misty igloo Well <@328142664476131330> is, so <@533592838529744917> could you try to work wi...

happy to help out, but my understanding was that y'all are using a a custom model adapter?

brisk bronze Mar 11, 2025, 6:45 PM

#

nova frost happy to help out, but my understanding was that y'all are using a a custom mode...

we were using lm-eval 0.4.3

#

this is the output when I do add_bos_token=True, and the 0's are not prepended.
[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297, 59888, 799, 22464, 4855, 25779, 47, 269, 24326, 4491, 22799, 31391, 22799, 21556, 461, 31059, 21273, 0, 0, 0, 24043, 8828, 21795, 30259, 22590, 31254, 46795, 4811, 32451, 39944, 45447, 45,

#

the output looks the same when I set add_bos_token=False

[6624, 31220, 32227, 28471, 98, 22748, 332, 45675, 40240, 22068, 45, 32227, 4569, 22748, 46896, 39867, 21265, 56287, 32487, 21811, 47, 20147, 1853, 31220, 32232, 21795, 332, 21365, 4706, 332, 51929, 45, 21400, 22464, 53137, 32227, 32234, 32251, 22464, 40219, 21751, 37678, 45301, 32487, 21820, 45, 32227, 22464, 38128, 56798, 39944, 56939, 47, 265, 24241, 46372, 45, 21265, 22464, 31220, 32227, 22464, 38897, 28471, 98, 45, 21400, 32230, 29042, 1950, 59901, 101, 47, 20885, 22748, 4788, 37704, 32227, 22464, 63613, 31929, 45, 31578, 32232, 32462, 22249, 4706, 30721, 7714, 4706, 28471, 98, 45, 22903, 6833, 21556, 28310, 31391, 55521, 4811, 4660, 50931, 45, 29042, 35, 23978, 22226, 47342, 308, 31223, 4706, 32227, 40219, 4435, 46439, 4811, 38127, 4833, 45865, 39052, 4811, 51745, 332, 31670, 122, 39712, 57103, 56198, 21265, 22269, 46787, 21357, 37921, 4811, 39740, 45619, 4596, 39931, 45793, 544, 19878, 26846, 59350, 21569, 2030, 47, 261, 35, 24326, 21413, 308, 22441, 799, 36623, 57486, 21823, 30180, 21265, 53348, 21823, 38660, 47, 269, 24349, 22572, 51514, 4424, 32499, 575, 261, 40327, 19878, 26846, 38128, 52732, 45, 20276, 2007, 460, 40139, 31901, 30917, 46301, 22590, 38717, 47, 261, 35, 1297,

misty igloo Mar 11, 2025, 6:46 PM

#

nova frost happy to help out, but my understanding was that y'all are using a a custom mode...

Jannas tests were with no adapter just cmdline

#

But she'll have to describe the exact details of what was run and how- I didn't run it myself

brisk bronze Mar 11, 2025, 6:48 PM

#

afaik rwkv7 were run with the adapter code because running fla converted rwkv in lm-eval had degraded results

#

this is the command I used: lm_eval --model hf --model_args pretrained=fla-hub/rwkv7-1.5B-world,trust_remote_code=True,add_bos_token=True,dtype=float32 --tasks lambada_openai --batch_size 8 --output_path /workspace/lm-evaluation-harness/results

misty igloo Mar 11, 2025, 6:48 PM

#

brisk bronze afaik rwkv7 were run with the adapter code because running fla converted rwkv in...

But aren't we discussing your tests of rwkv7 FLA via cmdline lmeval?

nova frost Mar 11, 2025, 6:49 PM

#

oh, then should have notified you of this: https://github.com/EleutherAI/lm-evaluation-harness/pull/2781

#

basically some HF tokenizers need to be initialized with add_bos_token=True

brisk bronze Mar 11, 2025, 6:50 PM

#

nova frost basically _some_ HF tokenizers need to be initialized with `add_bos_token=True`

but the performance for fla-rwkv7 was the same with add_bos_token=True vs. False. I can test again

nova frost Mar 11, 2025, 6:50 PM

#

what tokenizer are you using?

#

I just merged it today

brisk bronze Mar 11, 2025, 6:51 PM

#

nova frost what tokenizer are you using?

rwkv7 world tokenizer

#

ok I'll run the tests again

misty igloo Mar 11, 2025, 6:52 PM

#

nova frost what tokenizer are you using?

https://huggingface.co/fla-hub/rwkv7-2.9B-world/blob/main/tokenizer_config.json

tokenizer_config.json · fla-hub/rwkv7-2.9B-world at main

nova frost Mar 11, 2025, 6:55 PM

#

yeah. this wouldn't add the bos before:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world")
tokenizer.encode("hello", add_special_tokens=True)
# [34550]

intializing with tokenizer = AutoTokenizer.from_pretrained("fla-hub/rwkv7-1.5B-world", add_bos_token=True) works properly

brisk bronze Mar 11, 2025, 6:56 PM

#

nova frost I just merged it today

looks like it's pre-pending the 0 now!
[0, 6624, 31220, 32227, 28471, 98, 22748, 332,

misty igloo Mar 11, 2025, 6:58 PM

#

Well that finally resolves an issue that's only taken a year to figure out 🤣

brisk bronze Mar 11, 2025, 7:01 PM

#

eval results match up too

    "lambada_openai": {
      "alias": "lambada_openai",
      "perplexity,none": 4.136982815151818,
      "perplexity_stderr,none": 0.08865873813063398,
      "acc,none": 0.6931884339219871,
      "acc_stderr,none": 0.006425006782127488
    }
}```

fringe egret Mar 12, 2025, 12:43 AM

#

Has anyone managed to successfully run the 128k fine-tuning version? We're encountering conflicts when using the world version environment.

#

https://huggingface.co/SmerkyG/RWKV7-2.9B-World3-128k-250225

SmerkyG/RWKV7-2.9B-World3-128k-250225 · Hugging Face

misty igloo Mar 12, 2025, 2:46 AM

#

fringe egret Has anyone managed to successfully run the 128k fine-tuning version? We're encou...

What kind of conflicts? Sorry, not sure what you mean by the world version environment... but @brisk bronze used the redone 1.5b version recently. Do I need to update the 2.9b for some reason like a change to the FLA repo?

#

Does this one https://huggingface.co/SmerkyG/RWKV7-1.5B-World3-128k-250309 work for you?

SmerkyG/RWKV7-1.5B-World3-128k-250309 · Hugging Face

fringe egret Mar 12, 2025, 3:00 AM

#

python: /project/Lib/Tools/LinearLayout.cpp:562:mlir::triton::LinearLayout mlir::triton::LinearLayout::reshapeOuts(llvm::ArrayRef<std::pair<mLir::StringAttr, int>>) const: Assertion `getTotalOutDimSize( )== std::accmulate( newOutDims.begin(), newOutDims.end(), 1, [&](int32_t acc, auto &outDim) { return acc * outDim.second; })' failed.

fringe egret Mar 12, 2025, 3:02 AM

#

misty igloo Does this one https://huggingface.co/SmerkyG/RWKV7-1.5B-World3-128k-250309 work ...

I'll try this, thank you.

gusty condor Mar 12, 2025, 3:45 AM

#

fringe egret python: /project/Lib/Tools/LinearLayout.cpp:562:mlir::triton::LinearLayout mlir:...

Please send an issue to FLA

random granite Mar 12, 2025, 3:56 AM

#

fringe egret python: /project/Lib/Tools/LinearLayout.cpp:562:mlir::triton::LinearLayout mlir:...

use triton nightly

obsidian quest Mar 12, 2025, 3:56 AM

#

obsidian quest move Figure 7 to section 3 (Architecture) because it highlights the limits of at...


2. for Figure [FLOPs vs. Average Benchmark Accuracy], add [active params vs avg acc] too```

gusty condor Mar 12, 2025, 3:58 AM

#

random granite use triton nightly

This is the problem: Installing some new package may override the triton-nightly installation with triton 3.2.0. So it is better to have the code work properly for triton 3.2.0 and later versions.

random granite Mar 12, 2025, 4:16 AM

#

install from scratch or wait for the next version

#

https://github.com/triton-lang/triton/issues/5609

GitHub

Assertion failure in LinearLayoutConversions on H100s when num_warp...

Describe the bug I'm getting many errors related to Linear Layouts when num_wraps=8. After commit e57b468 python: triton/lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp:1008: mlir::triton:...

#

this is triton's bug. I won't fix it because large number of warp is crucial for performance

gusty condor Mar 12, 2025, 4:36 AM

#

random granite this is triton's bug. I won't fix it because large number of warp is crucial for...

I see. You have already uninstalled Triton so I shall not bother you for that.

fringe egret Mar 12, 2025, 5:52 AM

#

random granite use triton nightly

Thank you both, I'll replace the package.

fringe egret Mar 12, 2025, 5:53 AM

#

gusty condor Please send an issue to FLA

Ok, I'll bring up this issue.

keen tartan Mar 12, 2025, 12:59 PM

#

Do we have the context extended checkpoints also as non-converted HF models (i.e. normal rwkv models) available somewhere?

jovial meteor Mar 12, 2025, 1:25 PM

#

How does RWKV-7 behave past its training context length? Does state collapse still happen?

keen tartan Mar 12, 2025, 2:24 PM

#

jovial meteor How does RWKV-7 behave past its training context length? Does state collapse sti...

It can extrapolate. RWKV-7 trained with 4k context extrapolates to 32k+ #1083107245971226685 message

#

In the paper we have currently Long Context Experiments with PG19 dataset as well as single needle-in-the-haystack.

#

There is perhaps some kind of overfitting on short context phenomenon for the world models but not pile models that was reported by @iron parrot #1103039376184852622 message

gusty condor Mar 12, 2025, 2:36 PM

#

From @paper dove :

I have a small suggestion. I saw RWKV-6C mentioned in the discussion, but people who are not familiar with the rwkv version may not understand what it means.

keen tartan Mar 12, 2025, 2:38 PM

#

gusty condor From <@1072058174552686632> : > I have a small suggestion. I saw RWKV-6C mention...

Upgraded Finch (RWKV-6c) should be explained somewhere

#

Perhaps we can refer to the GoldFinch paper for this.

#

It is mentioned in the Additional Architecture Discussion Architecture Details Section

#

RWKV-6c is mentioned first time in the Method section 4.1.1 Weigh Preparation. I named it Upgraded Finch there and referred to the Additional Architecture Discussion where it is introduced under the same name in addition its version for now. Hope this makes it a bit more clear.

misty igloo Mar 12, 2025, 4:44 PM

#

keen tartan Perhaps we can refer to the GoldFinch paper for this.

I don't think it's in the goldfinch paper - it was adapted from the ideas in that paper but ended up being a later designation by Blink for a RWKV6 variant that never was trained for anything much

#

It's not called Upgraded Finch anywhere in the world, so I don't think we should use that name here

keen tartan Mar 12, 2025, 4:45 PM

#

misty igloo It's not called Upgraded Finch anywhere in the world, so I don't think we should...

Ah, ok.

misty igloo Mar 12, 2025, 4:45 PM

#

iono maybe I did rename it 6c in GoldFinch? I don't think so tho - checking now

#

yeah in GoldFinch we had a version that included other changes called Finch-C2

keen tartan Mar 12, 2025, 4:47 PM

#

So we can name it Finch-C2 or maybe Finch-C1

#

??

misty igloo Mar 12, 2025, 4:47 PM

#

its not Finch-C2 sorry

#

its Finch-C / v6c

#

there are differences, which is why I named the goldfinch version Finch-C2

keen tartan Mar 12, 2025, 4:47 PM

#

Got it.

misty igloo Mar 12, 2025, 4:47 PM

#

blink's internal designation is x060c

keen tartan Mar 12, 2025, 4:48 PM

#

yeah

misty igloo Mar 12, 2025, 4:48 PM

#

but yeah this isn't really described in any paper other than GoldFinch, which is where the idea for it came from

#

I'll take a look and add more descriptive content around it

keen tartan Mar 12, 2025, 8:41 PM

#

Merged all lm_eval results files from benchmarks table 3 and table 4 per model. Parsed merged files, created a pandas dataframe with combined average accuracy across English and multilingual tasks, and plotted it with matplotlib.

misty igloo Mar 12, 2025, 9:28 PM

#

keen tartan Merged all lm_eval results files from benchmarks table 3 and table 4 per model. ...

i dont get it, this is just a merged combo of the existing two flops plots?
also why would you multiply tokens by params instead of calculating actual compute

keen tartan Mar 12, 2025, 9:47 PM

#

misty igloo i dont get it, this is just a merged combo of the existing two flops plots? also...

Yeah, kinda. Just tinkering around to make concrete suggestions. It was just a simple quick approximation.

#

Perhaps scaling the dots size to parameter size might make it look more informative.

#

Here I multiplied parameters in billions with 100 and set as marker size.

#

I saw similar plots where the dots size represented model size in papers and I liked it.

#

I am also suggesting adding a bit of transparency. Helps with overplotting issue.

#

Used alpha=0.5 above.

obsidian quest Mar 13, 2025, 1:36 AM

#

obsidian quest ```1. move [Figure: Minimum number of layers (lower is better) required to attai...


2. for Figure [FLOPs vs. Average Benchmark Accuracy], add [active params vs avg acc] too```

misty igloo Mar 13, 2025, 1:49 AM

#

obsidian quest ```1. move [Figure: Minimum number of layers (lower is better) required to attai...

could you describe params vs acc? I can show you what it would look like but I don't think it's very informative

obsidian quest Mar 13, 2025, 1:54 AM

#

misty igloo could you describe params vs acc? I can show you what it would look like but I d...

x = log(active params), y = avg acc

misty igloo Mar 13, 2025, 2:02 AM

#

obsidian quest x = log(active params), y = avg acc

define active? Like do you want to double tied embeddings?

obsidian quest Mar 13, 2025, 2:02 AM

#

misty igloo define active? Like do you want to double tied embeddings?

non-embedding params (so related to inference flops)

#

actually can use [inference flops] vs [avg acc]

gusty condor Mar 13, 2025, 2:41 AM

#

keen tartan <@803473343705514025> I saw you pushed the missing RWKV6 World 3B multilingual r...

@brisk bronze did you test RWKV-7 1.5B and 2.9B on English evals?

misty igloo Mar 13, 2025, 2:44 AM

#

obsidian quest non-embedding params (so related to inference flops)

Okay so no embed and no lm head either

gusty condor Mar 13, 2025, 2:45 AM

#

lm head are active parameters

misty igloo Mar 13, 2025, 2:54 AM

#

nvm I am still sick and clearly not thinking well

#

Someone else better do this chart

#

@brisk bronze maybe you can take care of it tomorrow?

#

Should be ez to copy our existing google plot to make it

gusty condor Mar 13, 2025, 3:29 AM

#

I can do this chart

brisk bronze Mar 13, 2025, 3:31 AM

#

I've run them but with 0.4.7 not 0.4.3 so they should probably be re-run
https://github.com/jannalulu/lm-evaluation-harness/tree/main/results

GitHub

lm-evaluation-harness/results at main · jannalulu/lm-evaluation-har...

A framework for few-shot evaluation of language models. - jannalulu/lm-evaluation-harness

gusty condor Mar 13, 2025, 3:34 AM

#

shall we run everything with 0.4.7?

brisk bronze Mar 13, 2025, 3:37 AM

#

gusty condor shall we run everything with 0.4.7?

I don't really see the point? Pawsx and the bos_token are both getting fixed in 0.4.8

gusty condor Mar 13, 2025, 3:38 AM

#

Shall we run 0.4.8?

gusty condor Mar 13, 2025, 4:08 AM

#

@misty igloo So we do have some reason to run 0.4.8, since Paws-X and bos_token are fixed, and enhanced reproducibility as it's the newest version. And we can run glue with averaging too as requested by Bo,

brisk bronze Mar 13, 2025, 4:40 AM

#

gusty condor <@1007072846960410685> So we do have some reason to run 0.4.8, since Paws-X and ...

yeah but didn't we have trouble reproducing qwen results or something

gusty condor Mar 13, 2025, 5:01 AM

#

No trouble!

#

I can rerun them

dawn pewter Mar 13, 2025, 12:14 PM

#

typo, the minimum of w_t is exp(-exp(-0.5)). I fixed it

gusty condor Mar 13, 2025, 12:39 PM

#

Thank you!

#

Bad news: two of our authors are currently sick (both Bo and Smerky).

gusty condor Mar 13, 2025, 1:10 PM

#

@brisk bronze @keen tartan
Now I found a big problem: <bos> is added for RWKV-7 but not for other models like Qwen and Llama, so it's not a fair comparison.
But actually, RWKV-7 adding a [0] can enhance the performance of lambada by 0.6% but harms performance of arc by 2-3%.
I think a fair comparison should be conducted without [0] for all models. This also matches RWKV-FLA performance.
w/o [0]:

"arc_challenge": {
    "alias": "arc_challenge",
    "acc,none": 0.43430034129692835,
    "acc_stderr,none": 0.014484703048857371,
    "acc_norm,none": 0.4658703071672355,
    "acc_norm_stderr,none": 0.014577311315231023
  },
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.7706228956228957,
    "acc_stderr,none": 0.008627087045485938,
    "acc_norm,none": 0.7584175084175084,
    "acc_norm_stderr,none": 0.008783247004042158
  }

w/ [0]:

  "arc_easy": {
      "acc,none": 0.7584175084175084,
      "acc_stderr,none": 0.008783247004042158,
      "acc_norm,none": 0.7079124579124579,
      "acc_norm_stderr,none": 0.009330705616569084,
      "alias": "arc_easy"
    },
    "arc_challenge": {
      "acc,none": 0.40784982935153585,
      "acc_stderr,none": 0.01436109728844968,
      "acc_norm,none": 0.42406143344709896,
      "acc_norm_stderr,none": 0.014441889627464344,
      "alias": "arc_challenge"
    },

misty igloo Mar 13, 2025, 2:15 PM

#

gusty condor <@533592838529744917> <@371036620008194048> Now I found a big problem: `<bos>...

Lol this is exactly what I concluded last year so I used all the lm eval cmdline results for that paper, not giving rwkv a BOS

gusty condor Mar 13, 2025, 2:16 PM

#

OK so don't give RWKV a bos then

misty igloo Mar 13, 2025, 2:16 PM

#

And used bfloat16 not float32

keen tartan Mar 13, 2025, 2:17 PM

#

lm_eval adds automatically BOS token for Gemma family of models.

misty igloo Mar 13, 2025, 2:17 PM

#

keen tartan lm_eval adds automatically BOS token for Gemma family of models.

Interesting

keen tartan Mar 13, 2025, 2:18 PM

#

There is a comment in source. let me try to reference it.

#

https://github.com/EleutherAI/lm-evaluation-harness/blob/4890e881031a8ff00fd3136f938c4cf1ae101de4/lm_eval/models/huggingface.py#L222

#

Line 222

#

"...part of the Gemma family--a BOS token will be used as Gemma underperforms without it." is what gets logged right under it.

misty igloo Mar 13, 2025, 2:23 PM

#

gusty condor <@1007072846960410685> So we do have some reason to run 0.4.8, since Paws-X and ...

Its probably slightly better this way with a newer version, but I don't think it matters too much if it's annoying or slow to do. As for Glue, the new version doesn't give the non-weighted average as an output; you have to compute it manually anyway, which we can do just as easily using the existing results

keen tartan Mar 13, 2025, 2:28 PM

#

Relevant pull requests:

misty igloo Mar 13, 2025, 2:29 PM

#

I'm agnostic about BOS token usage... I think it's fine but as ZhangRC points out and I found last year, it helps some evals and hurts others so it kind of doesn't make a difference overall for RWKV

gusty condor Mar 13, 2025, 2:31 PM

#

Adding a [0] gives -0.4% overall

young sparrow Mar 13, 2025, 2:44 PM

#

I don't think that it's important to have the same thing for every model.

#

Think of it this way: when you have two chat models with different chat prompts, is it more fair to use the chat prompt model A expects for both models because it's the same input, or is it more fair to give each model the chat prompt it expects.

gusty condor Mar 13, 2025, 4:11 PM

#

misty igloo Its probably slightly better this way with a newer version, but I don't think it...

Not really. Should be done next morning

gusty condor Mar 14, 2025, 12:37 PM

#

Abstract deadline within one week!
Full paper deadline within 2 weeks!

misty igloo Mar 14, 2025, 2:40 PM

#

@everyone

The COLM abstract submission deadline is on March 20
We need authors to DM me their openreview ID or email address used for their openreview account.
If you don't have an openreview account, you need to open one and get it approved ASAP
If you are not currently listed as an author, and think you should be, now is the time to let us know. Authorship will be extended only to those who have contributed significantly to the paper by supplying experimental data that are included therein and/or doing significant writing. (but not for just having fixed some spelling or reworded a few things)

fresh mulch Mar 14, 2025, 3:25 PM

#

Table 1 feels cluttered with scalar annotations. Could we move them elsewhere or drop them without misleading the reader or losing nuance?

misty igloo Mar 14, 2025, 3:27 PM

#

they're important so we can't drop them, but maybe there's a better way to indicate this?

fresh mulch Mar 14, 2025, 3:28 PM

#

I would say create a separate column for which variables are scalar, but we're almost at width as it is

young sparrow Mar 14, 2025, 3:55 PM

#

The S and I variables are the only matrices right? Perhaps it would be cleaner to use bold to indicate "not a scalar" and note in the caption that S and I are matrices

fresh mulch Mar 14, 2025, 4:03 PM

#

we already mess with the notation for consistency with sec 4 so the latter would probably not be great. I think bold "not a scalar" works best considering boldface for vectors is convention in some fields

#

related consistency question: is it Delta Net or DeltaNet

misty igloo Mar 14, 2025, 4:07 PM

#

@obsidian quest I added multilang and eng acc vs inference active params charts to the paper... the english one is a bit messy

young sparrow Mar 14, 2025, 4:07 PM

#

@misty igloo I think it would be valuable to show the paper to someone who hasn't worked with RWKV much but follows this space and see how accessible the methodological explanation is to them. One of the things I consistency hear from people who work with Mamba and not RWKV is that finding the exposition inaccessible is a major reason why they use Mamba

misty igloo Mar 14, 2025, 4:07 PM

#

young sparrow <@1007072846960410685> I think it would be valuable to show the paper to someone...

I agree - @fresh mulch gave some initial feedback on that but more would be helpful

young sparrow Mar 14, 2025, 4:08 PM

#

Maybe just drop a draft in #research and ask for feedback on this point as a starting point

misty igloo Mar 14, 2025, 4:08 PM

#

@granite pike could provide that feedback if they're up for it

young sparrow Mar 14, 2025, 4:09 PM

#

The quality of the diagrams has substantially improved which y'all should be proud of 🙂

fresh mulch Mar 14, 2025, 4:09 PM

#

+1. Being (formerly, still kind of) that person, not having a clear picture of how RWKV works made me lean towards Mamba.

#

The question of why RWKV isn't as popular as Mamba came up some time ago. I still believe most of it is accessibility - particularly things like blogposts, etc. that spread the word to the "lay user", i.e. someone who won't read the paper but would use RWKV in their applications

fresh mulch Mar 14, 2025, 4:17 PM

#

fresh mulch The question of why RWKV isn't as popular as Mamba came up some time ago. I stil...

Actually, I think it would be helpful to prepare a blogpost or X thread or similar to release concurrently with the paper, like "here's the technical report and here's a simpler intuitive explanation". E.g. Songlin did this for GDN

young sparrow Mar 14, 2025, 4:21 PM

#

I've left some comments on the first half of the paper and will be back to do more later

keen tartan Mar 14, 2025, 4:44 PM

#

misty igloo <@870137517020688415> I added multilang and eng acc vs inference active params ...

Perhaps try to combine English and multilingual benchmarks in one chart as suggested previously to make it more robust.

misty igloo Mar 14, 2025, 4:46 PM

#

keen tartan Perhaps try to combine English and multilingual benchmarks in one chart as sugge...

I put improved versions in the paper

keen tartan Mar 14, 2025, 4:46 PM

#

I check.

fresh mulch Mar 14, 2025, 5:09 PM

#

i tried compressing and revising sections 1-3, will be back later for more. things that still stand out to me:

scalars in table 1, as before
table 1's caption is really unwieldily long
we use a lot of terminology in ways that would be familiar to someone in the space that might not be immediately obvious to an outside reader, such as using DeltaNet-specific terms in the introduction and the general idea of key-value retrieval in Section 2 (though maybe the latter is more obvious to people)
section 2 flows well but still feels delineated at the "Concurrent work" paragraph - subsection break here?
section 3 "Architecture" feels like it should be a subsection of section 4, or I don't see why it deserves its own section. It describes architectural changes over other methods, so I feel like it belongs at the beginning of the part where we describe the architecture in technical detail
section 4.1.1, after the big table of parameter definitions, could use some better structuring
I really love the new figures, they're super simple!

keen tartan Mar 14, 2025, 5:27 PM

#

Should we already aim to compress the main part of the paper into 9 pages or at least plan ahead?

keen tartan Mar 14, 2025, 6:31 PM

#

fresh mulch The question of why RWKV isn't as popular as Mamba came up some time ago. I stil...

I am suggesting a project web page for the paper like it is nowadays very popular. Considering even some interactive visual elements.

#

Could be a simple Github page dedicated to the RWKV-7 Goose Model.

#

As well as hands-on tutorials on how to get started.

remote elbow Mar 14, 2025, 6:35 PM

#

keen tartan I am suggesting a project web page for the paper like it is nowadays very popula...

https://www.rwkv.com/

RWKV Language Model

The RWKV Language Model

keen tartan Mar 14, 2025, 6:36 PM

#

remote elbow https://www.rwkv.com/

Yeah, we have the official website, true!

#

That is a good place for tutorials.

#

Probably just fleshing out the official website is the best way.

#

Who is managing it right now?

#

We also have the wiki: https://wiki.rwkv.com

RWKV Language Model

#

It is just a bit outdated but a good starting point

misty igloo Mar 14, 2025, 7:26 PM

#

keen tartan Should we already aim to compress the main part of the paper into 9 pages or at ...

I originally did this but there seems to be no apetite for it for the arxiv version so let's wait.

fresh mulch Mar 14, 2025, 8:43 PM

#

Updating and fleshing out website and wiki is a good idea. Also making it easy for people to get started, ie fewest steps to a (preferably customizable) working model with a walkthrough. Does the fla-hub kernel work with HF Transformers?

misty igloo Mar 14, 2025, 10:57 PM

#

fresh mulch Updating and fleshing out website and wiki is a good idea. Also making it easy f...

fla-hub has its own HF implementations that use its kernels yes

#

they will probably become the official RWKV HF implementations, at least temporarily

#

Good news: Will Merrill had some time to go through and do an initial pass merging and polishing the proofs in Appendix D. I think he still wants to do another pass, but it's something that could wait for v2.

keen tartan Mar 14, 2025, 11:30 PM

#

@gusty condor The following are the results from experiments to check the impact of RWKV_PAD tokens.
RWKV7-0.1B 11 is with \n as PAD tokens ([11]) which is the default one recommended.
RWKV7-0.1B 0 is with the special <|endoftext|> token as PAD tokens ([0]).
RWKV7-0.1B None is with no PAD tokens at all ([]).

+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| Model           | Tokens | lmb.o | hella  | piqa | arcE | arcC | glue | WG   | sciq | mmlu | avg  |
+=================+========+=======+========+======+======+======+======+======+======+======+======+
| (Name)          | (T)    | acc↑  | acc_n↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ | acc↑ |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 11   | 1.6    | 48.1  | 42.1   | 67.3 | 59.3 | 25.5 | 48.1 | 52.7 | 86.3 | 25.4 | 50.5 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B 0    | 1.6    | 49.0  | 42.2   | 67.1 | 56.6 | 23.6 | 46.3 | 52.6 | 86.2 | 25.8 | 49.9 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+
| RWKV7-0.1B None | 1.6    | 47.4  | 41.9   | 67.5 | 59.1 | 25.2 | 46.3 | 52.2 | 86.1 | 25.5 | 50.1 |
+-----------------+--------+-------+--------+------+------+------+------+------+------+------+------+

#

+-----------------+---------+-------+-------+-------+------+-------+------+------+
| Model           | lmb.m_p | lmb.m | pwasx | xcopa | xnli | xsClz | xwin | avg  |
+=================+=========+=======+=======+=======+======+=======+======+======+
| (Name)          | ppl↓    | acc↑  | acc↑  | acc↑  | acc↑ | acc↑  | acc↑ | acc↑ |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 11   | 166     | 31.6  | 46.1  | 53.3  | 37.6 | 52.6  | 64.1 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B 0    | 167     | 31.6  | 46.5  | 53.0  | 37.4 | 52.5  | 64.0 | 47.5 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+
| RWKV7-0.1B None | 177     | 31.2  | 46.6  | 53.0  | 37.4 | 52.4  | 63.0 | 47.3 |
+-----------------+---------+-------+-------+-------+------+-------+------+------+

#

Using lm_eval version 0.4.8. GLUE subtasks were simply averaged without weighting.
It seems that the default \n as PAD tokens are preferred across benchmarks.

#

Hypothesis: RWKV uses \n kind of like an element of its chat template as this is frequently occurring to separate utterances in its training data.

obsidian quest Mar 15, 2025, 1:33 AM

#

keen tartan <@803473343705514025> The following are the results from experiments to check th...

the pad effect should be less for larger rwkv7 (1.5b 2.9b). you can check that 🙂

moreover please check mamba too

please mention <|endoftext|> is always token 0 for all rwkv models

gusty condor Mar 15, 2025, 4:04 AM

#

young sparrow <@1007072846960410685> I think it would be valuable to show the paper to someone...

I found RWKV's exposition more accessible than mamba. By solely reading the mamba paper I can't formulate mamba in by brain.

obsidian quest Mar 15, 2025, 9:05 AM

#

i am building 100K and 1M random items from world-v3 dataset for reference

#

@gusty condor

keen tartan Mar 15, 2025, 9:46 AM

#

obsidian quest the pad effect should be less for larger rwkv7 (1.5b 2.9b). you can check that �...

I check for other models as well. Gonna also try \n\n for PAD.

obsidian quest Mar 15, 2025, 11:01 AM

#

keen tartan I check for other models as well. Gonna also try `\n\n` for PAD.

pls check mamba 1 & 2 too 🙂

misty igloo Mar 15, 2025, 1:55 PM

#

obsidian quest i am building 100K and 1M random items from world-v3 dataset for reference

is it possible to release the tools you use to put it together? that way other people can easily replicate the entire dataset

fresh mulch Mar 15, 2025, 2:25 PM

#

There are a few parts of the paper that look like intimidating walls of text during a cursory sweep of the paper. Would it be worth breaking these up by \subsection or \paragraph, or is this not an issue?

misty igloo Mar 15, 2025, 3:27 PM

#

fresh mulch There are a few parts of the paper that look like intimidating walls of text dur...

lol aren't you the one who removed the 'concurrent work' subheading in Background

fresh mulch Mar 15, 2025, 3:35 PM

#

misty igloo lol aren't you the one who removed the 'concurrent work' subheading in Backgroun...

it wasn't there originally! I put it in yesterday, then commented it out myself because I didn't know whether we wanted it or not

misty igloo Mar 15, 2025, 3:35 PM

#

gotcha, thats funny

#

there have been so many changes I can't remember which is which 🙂

young sparrow Mar 15, 2025, 3:48 PM

#

@obsidian quest what's the flops/mfu/whatever we get during training?

fresh mulch Mar 15, 2025, 9:10 PM

#

Appendix J, state transitions. What is meant by comparing "the order of O(1)" to "the order of thousands"?

#

also will we mention QRWKV at all in this paper @misty igloo as further proof it works at scale

misty igloo Mar 15, 2025, 9:27 PM

#

fresh mulch also will we mention QRWKV at all in this paper <@1007072846960410685> as furthe...

Heh it actually doesn't work at scale for me there

#

Gets unstable

#

Also I'm hoping to submit qrwkv paper to COLM separately

misty igloo Mar 15, 2025, 9:38 PM

#

fresh mulch Appendix J, state transitions. What is meant by comparing "the order of O(1)" to...

I assume it means average state length per element, maybe in an L2 kind of sense, maybe per column? Seems like it could be worded better

#

@dawn pewter or @gusty condor could you clarify

fresh mulch Mar 16, 2025, 1:42 AM

#

what are "such ideas" that can be traced back to fast weights and hebbian learning? @misty igloo

#

i want to modify this section a bit to motivate (our use of) the delta rule via deltanet, if that's fine with you

misty igloo Mar 16, 2025, 1:56 AM

#

fresh mulch what are "such ideas" that can be traced back to fast weights and hebbian learni...

That was a line @obsidian quest asked to put in earlier in this chat, but it doesn't have to be stated in exactly that way or in that location.

What kind of motivation are you thinking of adding?

#

Fast weights is basically the idea of test time training the state

fresh mulch Mar 16, 2025, 2:00 AM

#

misty igloo That was a line <@870137517020688415> asked to put in earlier in this chat, but ...

reformulating this bit to focus more on whatever deltanet does with the delta rule or something. it's just that the delta rule is not really emphasized anywhere in our background discussion, despite being foundational to the whole architecture

#

we talk about it a lot throughout but to me it reads like we assume the reader is familiar with the delta rule's role in the development of linear attention

misty igloo Mar 16, 2025, 2:02 AM

#

You seem to have it backwards maybe? Delta rule has nothing to do with the development of linear attention

#

Linear Attention is a form of fast weights tho

fresh mulch Mar 16, 2025, 2:03 AM

#

oops, yes, that is backwards

misty igloo Mar 16, 2025, 2:03 AM

#

Obviously this means my explanation in the text is maybe lacking

#

The way I was attempting to construct the narrative was:
Transformers, then Linear Attention and its issues, then delta rule fixes those issues, then we innovate on that

fresh mulch Mar 16, 2025, 2:05 AM

#

I guess my point is this: I like the flow of the discussion of the problem of numerically increasing state, but jumping into delta rule (DeltaNet was the first to...) after that is a shock, and it is not immediately clear to me what the transition is. Is it that the delta rule enables this fix, or it is this fix, or...?

misty igloo Mar 16, 2025, 2:05 AM

#

Yes delta rule is one variety of fix

#

Maybe I didn't make that clear

#

I did say exactly how it solves that issue in the second sentence tho...

#

Maybe I should basically swap the order of sentences 1 and 2

fresh mulch Mar 16, 2025, 2:08 AM

#

oh... does sentence 2 describe delta rule in general (the way it is phrased makes it seem DeltaNet-specific)

#

I'm probably too tired for this right now 🤣

misty igloo Mar 16, 2025, 2:08 AM

#

And I think it's a good point that fast weights applies to linear attention too... not quite sure how to shoehorn that in tho.

misty igloo Mar 16, 2025, 2:09 AM

#

fresh mulch oh... does sentence 2 describe delta rule in general (the way it is phrased make...

Well the delta rule is a general rule, applied to stuff... But deltanet applies it to the state (which in its case is the same kind state as linear attention has)

fresh mulch Mar 16, 2025, 2:11 AM

#

I see, so sentence 4 (basically what you just said) is describing the process sentence 2 describes?

misty igloo Mar 16, 2025, 2:12 AM

#

not sure I understand that 2 vs 4 comparison, but generally any messiness like that is bc I was trying to fit in the things Blink required us to say about it

fresh mulch Mar 16, 2025, 2:13 AM

#

ic okay. yeah the more I talk about this the more confused I get lmao

misty igloo Mar 16, 2025, 2:14 AM

#

that's not great 🙂 means I probably have some fixing to do

#

I'd like to make sure it makes sense to readers, even if they have no delta rule background

#

not sure if explaining linear attention is outside the scope of the paper tho

fresh mulch Mar 16, 2025, 2:23 AM

#

linear attention at large probably (definitely?) is

gusty condor Mar 16, 2025, 5:29 AM

#

misty igloo I assume it means average state length per element, maybe in an L2 kind of sense...

No, it means WKV state entries.

obsidian quest Mar 16, 2025, 5:35 AM

#

obsidian quest i am building 100K and 1M random items from world-v3 dataset for reference

https://huggingface.co/BlinkDL/temp-latest-training-models/tree/main/data_sample

data_sample is random subsample of world dataset. note: due to technical reasons (very complicated due to my horrible messy code), some distill instruct data are not included, and only subsamples of these instruct data are included: flan, Buzz-V12, WebInstructSub, SKGInstruct, PIPPA, COIG-PC-core

gusty condor Mar 16, 2025, 5:38 AM

#

I think jsonl could be better

obsidian quest Mar 16, 2025, 5:41 AM

#

gusty condor I think `jsonl` could be better

ok you can build them from my binidx

#

for depth 1 & long length, i think we need better optimizer or data design (can try curriculum learning) for rwkv7 to grok @crystal hull

gusty condor Mar 16, 2025, 7:11 AM

#

obsidian quest ok you can build them from my binidx

Built. Imagine RWKV learned with this quality of data can surpass llama3 and Qwen2.5🚀

#

And very little Chinese (less than 1%)

gusty condor Mar 16, 2025, 8:02 AM

#

@misty igloo @obsidian quest https://huggingface.co/datasets/rwkv-x-dev/rwkv-world-v3-subsample

rwkv-x-dev/rwkv-world-v3-subsample · Datasets at Hugging Face

obsidian quest Mar 16, 2025, 10:09 AM

#

obsidian quest for depth 1 & long length, i think we need better optimizer or data design (can ...

and how many heads are you using? we need at least 5 heads for single layer rwkv7 to solve S5. so can use multiple small heads 🙂

#

@crystal hull

obsidian quest Mar 16, 2025, 10:28 AM

#

pls add an alternative version diag(w)(I - ckak) where one is free to use c=2 (this version was used in othello and found to be useful)

keen tartan Mar 16, 2025, 11:58 AM

#

@gusty condor subsampled 100k as jsonl: https://huggingface.co/datasets/hevok/rwkv-world-v3-subsample-100k Maybe combine as different subsets in one dataset

keen tartan Mar 16, 2025, 1:03 PM

#

Got it working. Just need a yaml configuration defining the subsets in the README: https://huggingface.co/datasets/hevok/rwkv-world-v3-subsample

keen tartan Mar 16, 2025, 1:43 PM

#

Made index a subset as well: https://huggingface.co/datasets/hevok/Goose-World-v3

misty igloo Mar 16, 2025, 2:45 PM

#

gusty condor Built. Imagine RWKV learned with this quality of data can surpass llama3 and Qwe...

I don't think that we should be supplying this data with the paper if it does not properly represent the actual data used to train the models, due to the limitations Blink mentioned.

#

@young sparrow in your opinion is this better than nothing, or is including something that doesn't fully match worse than not providing it at all?

#

(some distill instruct data are not included, and only subsamples of these instruct data are included: flan, Buzz-V12, WebInstructSub, SKGInstruct, PIPPA, COIG-PC-core)

misty igloo Mar 16, 2025, 2:54 PM

#

gusty condor No, it means WKV state entries.

like each element? the specific meaning of the big O notation here is confusing to me

#

I'd like to understand it better since I think there is similar notation used here:
The $\tilde{k}_t$ in the formula can be regarded as a "normalized key", a design to ensure that the state of $\bm{wkv}$ contains columns of $O(1)$ size.
and I had changed that from 'entries' to 'columns'

#

the columns in the state represent values, basically - so I'm not sure that a per-element analysis is really the best metric around keeping things normalized. A vector kept in the usual form in pytorch has L2 Norm of sqrt(vector_dim)

obsidian quest Mar 16, 2025, 3:48 PM

#

misty igloo I don't think that we should be supplying this data with the paper if it does no...

wait for me to provide a patch

gusty condor Mar 16, 2025, 3:54 PM

#

misty igloo like each element? the specific meaning of the big O notation here is confusing ...

O(1): not growing over context length, and no outliers

misty igloo Mar 16, 2025, 4:45 PM

#

For COLM we are apparently required to designate one of our authors as a 'reciprocal reviewer', and I'm not qualified to be that person:

Reviewers must have research experience equivalent to a second-year graduate student in machine learning or a related field. They must have been a primary author* on at least two peer-reviewed conference or journal papers published in a related venue (e.g., ACL, NAACL, EMNLP, ICML, NeurIPS, ICLR, JMLR, TMLR, CVPR, ICCV – this is not an exhaustive list).

Please let us know if you're an author of the RWKV7 paper who meets the criteria above and would be willing to do this for us. This is a requirement - we need somone to do it in order to submit to COLM 2025.

Update: I think we have this covered now - thanks to everyone for reaching out!

gusty condor Mar 16, 2025, 5:26 PM

#

@paper dove

young sparrow Mar 16, 2025, 5:37 PM

#

obsidian quest wait for me to provide a patch

What is going to be the result of this patch? A genuinely representative sample of the data?

#

@misty igloo @gusty condor @obsidian quest If the data isn't going to get released you can't say that the "RWKV v3 World public corpus" is a contribution of the paper

misty igloo Mar 16, 2025, 5:41 PM

#

young sparrow <@1007072846960410685> <@803473343705514025> <@870137517020688415> If the data i...

how about a detailed description of?

young sparrow Mar 16, 2025, 5:42 PM

#

That's not a contribution to the scientific literature

gusty condor Mar 16, 2025, 5:45 PM

#

young sparrow What is going to be the result of this patch? A genuinely representative sample ...

Yes, and I think at least 1% of the total amount is required

young sparrow Mar 16, 2025, 5:51 PM

#

You also can't refer to it as an "open source corpus"

keen tartan Mar 16, 2025, 5:53 PM

#

We could take inspiration from the Allen AI Institute's OLmo (Open Language Model) Project.

#

They tried to address open source as best as they could

young sparrow Mar 16, 2025, 5:54 PM

#

They released their dataset

keen tartan Mar 16, 2025, 5:54 PM

#

Dolma?

young sparrow Mar 16, 2025, 5:54 PM

#

Yes

#

And the way they talk about the licensing of their dataset has mislead a lot of people into thinking it's openly licensed

keen tartan Mar 16, 2025, 5:54 PM

#

Yes, I am looking into how they did it and try to follow their guide.

young sparrow Mar 16, 2025, 5:55 PM

#

I don't know what you mean by that

keen tartan Mar 16, 2025, 5:55 PM

#

young sparrow I don't know what you mean by that

I am thinking about it.

young sparrow Mar 16, 2025, 5:59 PM

#

What they did was release the data and wrap the entire thing, as a collection, in a database license. That database license is open source and the way they did messaging around it lead people to think that the data was openly licensed.

keen tartan Mar 16, 2025, 6:00 PM

#

young sparrow What they did was release the data and wrap the entire thing, as a collection, i...

I now understand. We should avoid such pitfalls. Thus, also learn from their mistakes. I think it is a great project though.

young sparrow Mar 16, 2025, 6:04 PM

#

I will not let us fall into those pitfalls 🙂

young sparrow Mar 16, 2025, 6:07 PM

#

misty igloo For COLM we are apparently required to designate one of our authors as a 'recipr...

If I'm getting authorship I'm happy to do it. If you need me to do some writing to qualify I can add some of the things I've suggested in comments.

misty igloo Mar 16, 2025, 6:08 PM

#

young sparrow If I'm getting authorship I'm happy to do it. If you need me to do some writing ...

Thanks, apparently we can list Will Merrill - so I think we're in a good place now

misty igloo Mar 16, 2025, 6:09 PM

#

young sparrow That's not a contribution to the scientific literature

Describing the dataset in such a way that people can replicate it isn't a contribution to the scientific literature?

#

Maybe the phrasing needs to be a bit clearer around enabling replicability?

#

This is pretty much the exact same thing we did in the last paper, so I'm not sure why it's not valid this time

young sparrow Mar 16, 2025, 6:16 PM

#

One sec (edit: actually I gotta run, be back later)

jovial meteor Mar 16, 2025, 6:23 PM

#

obsidian quest for depth 1 & long length, i think we need better optimizer or data design (can ...

does rwkv7 show grokking without softmax?

misty igloo Mar 17, 2025, 2:21 AM

#

@obsidian quest please pull https://github.com/RWKV/RWKV-LM to sync with your upstream repo

obsidian quest Mar 17, 2025, 2:29 AM

#

misty igloo <@870137517020688415> please pull https://github.com/RWKV/RWKV-LM to sync with y...

done

gusty condor Mar 17, 2025, 2:47 AM

#

Shall we put RWKV7 code into RWKV-v7 folder?

misty igloo Mar 17, 2025, 3:57 AM

#

young sparrow One sec (edit: actually I gotta run, be back later)

No problem, let me know when you can. Also if you followed up on the 'missing' three (?) datasets please let me know where that ended up. We're trying to get the paper on arxiv as soon we can, and I want to make sure we have this dataset stuff ironed out to your satisfaction.

gusty condor Mar 17, 2025, 4:02 AM

#

young sparrow I will not let us fall into those pitfalls 🙂

I think we shouldn't let the dataset issue to delay other valuable information in the paper from unveiling to the public. The dataset can be dealt later, but many people outside this channel are longing for the paper.

obsidian quest Mar 17, 2025, 4:20 AM

#

just call it dataset preview, for now. will fix it when i am less busy

gusty condor Mar 17, 2025, 5:51 AM

#

All 3 missing datasets were found.

Wikipedia: Loader not working anymore #1103039376184852622 message
Guanaco #1103039376184852622 message
Books3 #1103039376184852622 message
@misty igloo

pure pike Mar 17, 2025, 7:23 AM

#

gusty condor All 3 missing datasets were found. 1. Wikipedia: Loader not working anymore htt...

Guanaco is taken down because of Josepheus. Has a reputation of rugpulling on datasets

keen tartan Mar 17, 2025, 3:03 PM

#

Have been grinding though all the RWKV World v3 corpus components and made sure that it is possible to download and sample from each component.

#

Here is a the updated annotated dataset: https://huggingface.co/datasets/hevok/Goose-World-v3

hevok/Goose-World-v3 · Datasets at Hugging Face

#

Good news is: There are no major obstacles for reconstruction.

misty igloo Mar 17, 2025, 3:05 PM

#

cool!

keen tartan Mar 17, 2025, 3:05 PM

#

Just a few tiny details are lacking that would be helpful to eliminate ambiguity.

misty igloo Mar 17, 2025, 3:05 PM

#

should I be copying this to the official RWKV HF

keen tartan Mar 17, 2025, 3:05 PM

#

misty igloo should I be copying this to the official RWKV HF

Yes, I think moving it to official RWKV HF.

#

I could just rename it if I am member of RWKV HF.

#

So it keeps the statistics from the original one as it already had quite some traction.

misty igloo Mar 17, 2025, 3:07 PM

#

I can't give you that access, unfortunately

keen tartan Mar 17, 2025, 3:07 PM

#

How about I move it to another org repo and you move it from there from org to org.

#

I could just create an org and add you to it.

misty igloo Mar 17, 2025, 3:07 PM

#

sure if that's doable!

keen tartan Mar 17, 2025, 3:07 PM

#

So migrating it 2 times.

#

I will do it.

misty igloo Mar 17, 2025, 3:08 PM

#

but then you wont be able to edit it any more

keen tartan Mar 17, 2025, 3:10 PM

#

oh

#

Moved it to temporary Organization for now: https://huggingface.co/datasets/Goose-World/Goose-World-v3

Goose-World/Goose-World-v3 · Datasets at Hugging Face

misty igloo Mar 17, 2025, 3:12 PM

#

ok well it's fine, let's just wait and add it in the next version of the paper

#

that way you have time to edit it a bit more

keen tartan Mar 17, 2025, 3:13 PM

#

https://huggingface.co/organizations/Goose-World/share/GErbhZWlqMcLrZVYxaOeyyCVdmQrXhwGcx

Hugging Face – The AI community building the future.

misty igloo Mar 17, 2025, 3:13 PM

#

thanks!

#

the subsamples are still in your account

keen tartan Mar 17, 2025, 3:14 PM

#

I will move those too.

#

They are also in the main one as subsets.

#

It has 3 subsets: index, 100k, and 1m

#

Moved all and assigned you admin rights.

misty igloo Mar 17, 2025, 3:20 PM

#

I guess we just gotta update the links once we move orgs

#

@keen tartan do you want me to put it in RWKV now, or wait so you can keep editing

#

is there some way to add in the up/down sampling frequency info from the Eagle/Finch paper in as a column here for those that weren't just used as-is?

keen tartan Mar 17, 2025, 3:21 PM

#

misty igloo <@371036620008194048> do you want me to put it in RWKV now, or wait so you can k...

Let me check.

#

I can provide the code to generate the tables.

#

There is column for world version already.

#

Looking into how to add up/down sampling frequency column too.

misty igloo Mar 17, 2025, 3:28 PM

#

the amounts are listed in the attached wiki.txt for the Eagle/Finch paper ... not sure its possible to include this

keen tartan Mar 17, 2025, 3:28 PM

#

I check.

misty igloo Mar 17, 2025, 3:28 PM

#

and oscar.txt

#

yeah seems tough to do, only reasonable way would be to maybe pre-process those datasets to create the filtered versions

#

and provide those separately as components

#

but if we did that it'd make the whole thing quite reproducible I think!

#

since those are the only specially sampled items

keen tartan Mar 17, 2025, 3:34 PM

#

misty igloo since those are the only specially sampled items

Is it only for the Wikipedia and OSCAR23.01 datasets that certain languages were randomly subsampled, right?

misty igloo Mar 17, 2025, 3:35 PM

#

keen tartan Is it only for the Wikipedia and OSCAR23.01 datasets that certain languages were...

yeah

keen tartan Mar 17, 2025, 3:35 PM

#

Then it seems doable.

misty igloo Mar 17, 2025, 3:35 PM

#

I mean, review the Eagle/Finch paper to be sure, but that's my recollection

keen tartan Mar 17, 2025, 3:38 PM

#

I will do so.

#

HuggingFace Hub is based on Git. So contributors outside of the organization should be able to make pull requests (called Discussions).

#

https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions

Pull requests and Discussions

misty igloo Mar 17, 2025, 3:43 PM

#

cool

#

I just dont want to make it harder for you to edit while you're still doing it a bunch

keen tartan Mar 17, 2025, 3:44 PM

#

I am flexible. Whenever you think it is adequate to move it. I can work with Git. Just someone in the org needs to approve requests.

#

Wait.

#

"Discussions and Pull Requests are currently enabled for this dataset. Members of the community can propose changes to this repository."

#

Only members can make pull requests. I misinterpreted it.

misty igloo Mar 17, 2025, 3:48 PM

#

'members of the community'

#

not sure that means anyone or members of the org

keen tartan Mar 17, 2025, 3:49 PM

#

Yeah, it is ambiguous.

#

I just test.

#

^^

keen tartan Mar 17, 2025, 3:53 PM

#

misty igloo 'members of the community'

You are right. Any members of the HuggingFace community can open discussions / make pull requests

misty igloo Mar 17, 2025, 4:06 PM

#

@everyone no changes to the manuscript at this time, please - we are going to try to put it on arxiv

gusty condor Mar 17, 2025, 4:52 PM

#

We will update our eval results for arxiv v2.
+2 points on ARC-e and ARC-c each, and small gains in MMLU.

#

We may exceed past Qwen2.5 this time with lm-eval 0.4.8

misty igloo Mar 17, 2025, 5:14 PM

#

gusty condor We will update our eval results for arxiv v2. +2 points on ARC-e and ARC-c each...

I don't think changing the evals in a future revision is a good look. Let's do it now or never.

gusty condor Mar 17, 2025, 5:15 PM

#

@obsidian quest How many tokens, and on which dataset, did you tune v7-world3-2b9-preview into v7-world3-2b9?

#

the former seems to be higher on certain evals like glue, gsm8k, and several others.

gusty condor Mar 17, 2025, 5:48 PM

#

I used the markdown package (which may require lualatex which is incompatible for arxiv). I will change it

sonic horizon Mar 17, 2025, 6:06 PM

#

Hi everyone, Xingjian DU and I are still working on the audio modeling task. We were wondering if there might be any space available in the Evaluation section or appendix, either in this version or a future one, to include our work on this task? Of course, we fully respect your timeline and will align with your schedule.

gusty condor Mar 17, 2025, 6:08 PM

#

Sadly, @misty igloo missed the deadline. This means that we have another 24 hours to go.
So, we should focus on:

evaluations
Audio modeling tasks if applicable.

misty igloo Mar 17, 2025, 6:08 PM

#

sonic horizon Hi everyone, Xingjian DU and I are still working on the audio modeling task. We ...

Yes, please go ahead and insert your audio modeling subsection in the Multimodal Evaluations section

gusty condor Mar 17, 2025, 6:08 PM

#

You have 23 hours left.

misty igloo Mar 17, 2025, 6:09 PM

#

Haha try to do it much sooner than 23 hours though, please 🙂

gusty condor Mar 17, 2025, 6:15 PM

#

I have to sleep now. I will check the evaluation section.
By the way, @obsidian quest please tell the difference between v7-2b9-preview and v7-2b9-release

misty igloo Mar 17, 2025, 10:06 PM

#

@keen tartan if we're using these new results I'll need evals for Qwen2.5-7B as well for the FLOPs chart

#

also, I don't understand how your glue results have an average.. I thought those aren't given in later LM-eval versions

#

is this really using 0.4.8 for all of these?

#

I don't see the glue overall in the rwkv results for example

misty igloo Mar 17, 2025, 10:19 PM

#

keen tartan I think this is the missing books3 dataset: https://huggingface.co/datasets/Sayl...

Is this the actual missing dataset, or is the missing one the original books3 that included the gutenberg portions?

keen tartan Mar 17, 2025, 10:24 PM

#

misty igloo <@371036620008194048> if we're using these new results I'll need evals for Qwen2...

Oh, ok. I try to get the evals together. Evaluation of a 7B takes a bit longer. But we try to complete in time.

#

I use average function based on @brisk bronze's extracted source when processing the results files. It is not a big deal to calculate it at all.

keen tartan Mar 17, 2025, 10:27 PM

#

misty igloo is this really using 0.4.8 for all of these?

Yes, I have more results already.

#

it is using 0.4.8

#

mom

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval

#

Trying to share relevant evaluations for the paper I run there.

misty igloo Mar 17, 2025, 10:28 PM

#

hmm so 0.4.8 like gives glue averages sometimes but not others? are these the size-weighted ones or non-weighted?

#

bc either way we are going to have to manually calculate the averages for the ones it didn't print them for

keen tartan Mar 17, 2025, 10:29 PM

#

It only gives results for the individual subsets, but it is easy to just average them. The function allows to toogle weight/non-weighting.

#

I share relevant code block.

misty igloo Mar 17, 2025, 10:30 PM

#

but I see glue averages in some of them 🙂

#

code block?

#

you're running these not via the cmdline?

#

that explains it

keen tartan Mar 17, 2025, 10:30 PM

#

I aggregate all results files to make tables and plots.

misty igloo Mar 17, 2025, 10:31 PM

#

I figured you ran the RWKV tests via lm-eval cmdline now, just like the rest of the evals

#

since in 0.4.8 it properly uses the flag for BOS token

keen tartan Mar 17, 2025, 10:31 PM

#

I can do either way.

#

One moment please.

#

def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
    # A helper function that is used to aggregate
    # subtask scores cross-task.

    if not weight_by_size:
        sizes = [1] * len(sizes)

    assert len(metrics) == len(sizes)

    return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)

#

That is the function I am using to calculate the average in post-processing.

#

glue_val_split = {
  'cola': 1043,
  'mnli': 9815, # _matched
  'mnli_mismatch': 9832, # ed
  'mrpc': 408,
  'qnli': 5463,
  'qqp': 40430,
  'rte': 277,
  'sst2': 872,
  'stsb': 1500,
  'wnli': 71,
}

#

These are the individual subtask names and counts.

misty igloo Mar 17, 2025, 10:33 PM

#

which variety is shown in the glue outputs you currently have in this folder?

keen tartan Mar 17, 2025, 10:34 PM

#

It is not in the results files.

misty igloo Mar 17, 2025, 10:34 PM

#

it is actually

keen tartan Mar 17, 2025, 10:34 PM

#

I calcualte it afterwards.

#

Oh

#

is it?

#

perhaps

misty igloo Mar 17, 2025, 10:34 PM

#

yes that's why this question arises 🙂

#

bc it generally doesnt show up there for 0.4.8 cmdline

keen tartan Mar 17, 2025, 10:35 PM

#

https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/RWKV-x070-World-2.9B-v3-20250211-ctx4096/pad_11/0.4.8_2025-03-15T11-47-35.065128_glue.json

#

I only see the statistics of the subtasks.

#

Anyway, calculating the average is not a big deal.

misty igloo Mar 17, 2025, 10:36 PM

#

https://huggingface.co/spaces/hevok/evals/blob/main/lm_eval/Qween__Qwen2.5-3B/Qwen2.5-3B_eng.json

#

this one has the avg under 'glue'

keen tartan Mar 17, 2025, 10:37 PM

#

Hold on.

misty igloo Mar 17, 2025, 10:37 PM

#

and as far as I know, that means it was not run under 0.4.8

keen tartan Mar 17, 2025, 10:37 PM

#

This file was generated from conversion of markdown table from @gusty condor

#

It was from a previous experiment using 0.4.3

misty igloo Mar 17, 2025, 10:38 PM

#

but.. you said that this folder had all 0.4.8 results

keen tartan Mar 17, 2025, 10:38 PM

#

Sorry for the confusion.

misty igloo Mar 17, 2025, 10:38 PM

#

this is why I am trying to make sure everything is done correctly

#

because it's clear to me that there is mismatching data

keen tartan Mar 17, 2025, 10:38 PM

#

Then there should be '0.4.8' in the file name.

misty igloo Mar 17, 2025, 10:39 PM

#

I don't see any files like that in this repo

keen tartan Mar 17, 2025, 10:39 PM

#

in sub folders.

misty igloo Mar 17, 2025, 10:39 PM

#

I mean for anything other than rwkv

keen tartan Mar 17, 2025, 10:39 PM

#

I have not ran it yet for all models.

#

I have results for reference models that I have not yet pushed there yet.

misty igloo Mar 17, 2025, 10:40 PM

#

oh ok

keen tartan Mar 17, 2025, 10:40 PM

#

I was focusing on RWKV

#

For RWKV World models I have all 0.4.8 results already.

misty igloo Mar 17, 2025, 10:40 PM

#

sorry, that was the mixup - I didn't realize that not everything was done yet

keen tartan Mar 17, 2025, 10:40 PM

#

e.g. https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/RWKV-x070-World-2.9B-v3-20250211-ctx4096/pad_11

#

I have results for the reference models like SmolLM2, Llama, and Qwen as well.

#

Trying right now to organize them.

#

I did run those a week ago, but thought they will not be used in the paper.

misty igloo Mar 17, 2025, 10:42 PM

#

yeah I didn't expect us to change to this in the last 24 hrs before publishing

keen tartan Mar 17, 2025, 10:42 PM

#

I don't have Qwen 7B yet.

#

I try to share what I have before going sleeping.

#

@gusty condor and @brisk bronze and anyone else who likes can complement results.

misty igloo Mar 17, 2025, 10:45 PM

#

she's busy until tomorrow, unfortunately, so probably not enough time for her to contribute to those

keen tartan Mar 17, 2025, 10:45 PM

#

misty igloo she's busy until tomorrow, unfortunately, so probably not enough time for her to...

Good to know.

#

By the way I figured out we can speed evaluation with multiple GPUs.

misty igloo Mar 17, 2025, 10:46 PM

#

there are a few ways to do that using the cmdline

keen tartan Mar 17, 2025, 10:46 PM

#

Using accelerate, e.g. ```bash
accelerate launch -m lm_eval --model hf
--tasks lambada_openai,arc_easy
--batch_size 'auto'

misty igloo Mar 17, 2025, 10:46 PM

#

yep

keen tartan Mar 17, 2025, 10:46 PM

#

Also set batch size to 'auto', then it tries to calculate ideal batch size itself.

misty igloo Mar 17, 2025, 10:47 PM

#

that 'auto' bsz tends to break (or it used to) for mmlu

#

but works on many normal evals

#

I also have a version of the RWKV eval harness thing that supported batched inference and is much faster

#

but I don't want to use it here

#

that's why I wanted to use the lm-eval cmdline version for RWKV, so we get multi-gpu acceleration and batching

#

anyway it doesnt matter, since you finished those

keen tartan Mar 17, 2025, 11:07 PM

#

@nova frost Isn't the lm_eval version specified in the output json file's metadata?

nova frost Mar 17, 2025, 11:24 PM

#

keen tartan <@328142664476131330> Isn't the lm_eval version specified in the output json fil...

no but we do log the git_hash

keen tartan Mar 17, 2025, 11:25 PM

#

nova frost no but we do log the `git_hash`

oh, that might be helpful to differentiate them.

nova frost Mar 17, 2025, 11:26 PM

#

yeah you can either checkout or browse https://github.com/EleutherAI/lm-evaluation-harness/tree/<git_hash>

keen tartan Mar 17, 2025, 11:27 PM

#

"git_hash": null, -.-*

nova frost Mar 17, 2025, 11:28 PM

#

damn. lol. I'll add the lm_eval version going forward

#

but that probably meant it wasn't run from a git dir, so installed from pypi

keen tartan Mar 17, 2025, 11:44 PM

#

I cannot tell for sure what version of lm_eval I ran the reference models evaluations for SmolLM2, Llama, Qwen some time ago.

#

We may need to recompute them to be certain.

obsidian quest Mar 18, 2025, 12:07 AM

#

gusty condor I have to sleep now. I will check the evaluation section. By the way, <@87013751...

LR decay 1e-5 to 1e-7, on randomly sampled 100G tokens. slightly improves loss & uncheatable eval

keen tartan Mar 18, 2025, 1:22 AM

#

Tried to organize most of the evaluations I ran: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval
Not sure about the reference model's eval versions. Gonna try to get a few hours sleep.

gusty condor Mar 18, 2025, 2:32 AM

#

obsidian quest LR decay 1e-5 to 1e-7, on randomly sampled 100G tokens. slightly improves loss &...

@misty igloo Should mention in the paper

brisk bronze Mar 18, 2025, 3:58 AM

#

Running the reference models (qwen, llama, smollm) on 0.4.8, results will be here
https://github.com/jannalulu/lm-evaluation-harness/tree/main/results-0.4.8

gusty condor Mar 18, 2025, 7:10 AM

#

Note: perplexity of lambada_multilingual should be the geometric mean over 5 languages, not the arithmetic average. Strange that even lm_eval was mistaken on that.

young sparrow Mar 18, 2025, 7:27 AM

#

gusty condor Note: perplexity of `lambada_multilingual` should be the **geometric mean** over...

What is your source for this?

keen tartan Mar 18, 2025, 8:11 AM

#

I am awake. I focus now on SmolLM2 model series evaluations.

gusty condor Mar 18, 2025, 8:12 AM

#

young sparrow What is your source for this?

I think it's almost obvious, from the definition of perplexity.
The geometric mean of perplexity is equal to the exponential of average negative log likelihood loss.
On the other hand, the arithmetic average of perplexity has no clear semantic meaning.

keen tartan Mar 18, 2025, 8:27 AM

#

brisk bronze Running the reference models (qwen, llama, smollm) on 0.4.8, results will be her...

Awesome! The Qwen ones are almost done. Qwen 3B misses 5-shot MMLU and Qween 0.5B needs also be calculated.

#

Llama 1B and 3B is also needed.

#

SmolLM2 135M, 360M, and 1.7B also required.

gusty condor Mar 18, 2025, 8:29 AM

#

keen tartan SmolLM2 135M, 360M, and 1.7B also required.

You've done that.

keen tartan Mar 18, 2025, 8:29 AM

#

gusty condor You've done that.

I check.

gusty condor Mar 18, 2025, 8:30 AM

#

The empty items in the tables 3 and 4 are currently missed.

#

We should recheck pile models too

keen tartan Mar 18, 2025, 10:19 AM

#

Qwen 2.5 7B sciq and 5-shot MMLU is also missing. Any one running those? I could attempt, but my runs take always so long to finish.

keen tartan Mar 18, 2025, 11:15 AM

#

I try Qwen 2.5 7B sciq

keen tartan Mar 18, 2025, 11:49 AM

#

Done, trying now 5-shot MMLU (but it seems to take over 5h). I am abounding 7B and focus on the smaller models first for now.

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/Qwen__Qwen2.5-7B/0.4.8

gusty condor Mar 18, 2025, 1:01 PM

#

keen tartan Done, trying now 5-shot MMLU (but it seems to take over 5h). I am abounding 7B a...

Qwen2.5 7B mmlu 5-shot=74.2 confirmed by both https://arxiv.org/pdf/2412.15115 and https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/Qwen__Qwen2.5-7B/results_2025-03-06T22-02-41.880180.json

GitHub

lm-evaluation-harness/results-0.4.3/Qwen__Qwen2.5-7B/results_2025-0...

A framework for few-shot evaluation of language models. - jannalulu/lm-evaluation-harness

misty igloo Mar 18, 2025, 1:28 PM

#

sonic horizon Hi everyone, Xingjian DU and I are still working on the audio modeling task. We ...

is the tab:audiorwkv_results table ready for this? It's not currently in the document

#

We will need that in the next few hours in order for your experiments to be a part of the arxiv pre-print in this version

#

The COLM deadline is also soon, and I will need your open review ID's and/or emails used to sign up with open review

#

I also think you need more explanation of how your "approach enhances RWKV-7's capabilities to interpret and process complex, high-dimensional spectrogram features"

#

If you make claims in the paper they need to come with evidence.

#

The text currently does not describe anything at all about how or what AudioRWKV-7 does, except that it uses spectograms.

#

Considering the timeline here I am going to comment this out of the paper for now. If you think you have what's needed before say 4pm UTC today, let us know here and we can consider if there is time to put it in.

#

This is a very late addition, and I'm not guaranteeing that it will be able to become a part of the paper. That will depend on both when you have the full writeup ready, and what the quality is like.

nova frost Mar 18, 2025, 1:38 PM

#

keen tartan Done, trying now 5-shot MMLU (but it seems to take over 5h). I am abounding 7B a...

this seems on the high side. Are you batching correctly?

keen tartan Mar 18, 2025, 1:40 PM

#

nova frost this seems on the high side. Are you batching correctly?

I set it to --batch_size="auto".

#

Used an NVIDIA L40S with 48Gb memory. It occupied almost all the VRAM, so I assumed it was batching it correctly.

#

For another run with Qwen-2.5 3B I got OOM for the same lm_eval command args but different smaller dual T4 GPU. Gonna try setting batch size to 1 or using single GPU in this case.

nova frost Mar 18, 2025, 1:50 PM

#

yeah auto can sometimes be unreliable

#

PRs welcome if anyone can improve on it: https://github.com/EleutherAI/lm-evaluation-harness/blob/fa1ce2c665aa4d079a822fbb6fae905d531aca1f/lm_eval/models/huggingface.py#L736

keen tartan Mar 18, 2025, 1:51 PM

#

nova frost PRs welcome if anyone can improve on it: https://github.com/EleutherAI/lm-evalua...

I am interested in contributing to lm-eval! Need to recover my GitHub account first...

nova frost Mar 18, 2025, 1:54 PM

#

can also do auto:N so that it recomputes the batch size N number of times. But this is mostly helpful if you're running multiple tasks (so more variation is seq lengths)

keen tartan Mar 18, 2025, 1:54 PM

#

nova frost can also do `auto:N` so that it recomputes the batch size N number of times. But...

Oh, gonna try that too. Thanks! Default is auto:1.

gusty condor Mar 18, 2025, 2:07 PM

#

Urgent: who is testing llama3.2?

misty igloo Mar 18, 2025, 2:16 PM

#

it's possible Janna is running those, (but she's in PT timezone, and it's still 7:16am there, and she just got back from a late night flight)

#

last night she said "probably would also do llama and smollm"
I had asked her to coordinate with @keen tartan tho so maybe he knows if they're in progress

keen tartan Mar 18, 2025, 2:18 PM

#

misty igloo it's possible Janna is running those, (but she's in PT timezone, and it's still ...

That is possible. I was very tired/exhausted last night.

keen tartan Mar 18, 2025, 2:18 PM

#

gusty condor Urgent: who is testing llama3.2?

I am testing Llama 3.2 1B and 3B.

#

I can push the results I got so far.

#

it is still not completed all yet.

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/meta-llama__Llama-3.2-1B

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/meta-llama__Llama-3.2-3B

#

lm_eval version is in file names. I move to folders later on.

gusty condor Mar 18, 2025, 2:25 PM

#

gusty condor Urgent: who is testing llama3.2?

I am testing too. ETA 50 min before deadline

misty igloo Mar 18, 2025, 2:34 PM

#

gusty condor I am testing too. ETA 50 min before deadline

Oh, you mean it will complete 50 min before the submission deadline? ugh

#

once we have all the data I have to see if christian can regenerate the flops plots he made

#

I have submitted our abstract to COLM.

#

https://openreview.net/forum?id=ayB1PACN5j

OpenReview

fresh mulch Mar 18, 2025, 2:55 PM

#

misty igloo once we have all the data I have to see if christian can regenerate the flops pl...

this should not be very difficult... if the numbers don't drastically change, the same formatting should work, and that's most of the complexity in making the graphs

#

hey, wait, llama3.2 is not in the FLOPS charts

gusty condor Mar 18, 2025, 2:56 PM

#

Great! We are almost done.

misty igloo Mar 18, 2025, 2:57 PM

#

fresh mulch this should not be very difficult... if the numbers don't drastically change, th...

yeah just a matter of your timing availability considering the tight timeline around the 16:00 UTC deadline

#

What are we still missing revised number for? Just SmolLM?

fresh mulch Mar 18, 2025, 2:58 PM

#

I have a meeting for the next hour but after that will be available to update charts, which should work out

gusty condor Mar 18, 2025, 2:58 PM

#

misty igloo What are we still missing revised number for? Just SmolLM?

SmolLM numbers are done, see table 3 and 4

#

only llama 3.2 3b left

misty igloo Mar 18, 2025, 2:59 PM

#

gusty condor SmolLM numbers are done, see table 3 and 4

okay I'll start plugging in the data into my google sheet, and Christian can copy from that later

gusty condor Mar 18, 2025, 2:59 PM

#

gusty condor only llama 3.2 3b left

and eta 1h

misty igloo Mar 18, 2025, 2:59 PM

#

we dont use llama on this sheet so no problem

gusty condor Mar 18, 2025, 3:32 PM

#

@misty igloo Please redact links in the abstract for COLM for anonymity!

misty igloo Mar 18, 2025, 3:32 PM

#

thanks, forgot that

#

updated.

misty igloo Mar 18, 2025, 3:51 PM

#

there is no new pawsx test in @brisk bronze 's output for Qwen

#

also, I need some numbers for lambada.m on Qwen 7B if we are calculating it some new way

gusty condor Mar 18, 2025, 3:53 PM

#

We present RWKV-7 "Goose", a new sequence modeling architecture featuring constant memory usage, constant inference time per token, state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. To accomplish this, RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. We also present an extended open source 3.1 trillion token multilingual corpus. We trained a set of models from 0.19 billion to 2.9 billion parameters on this dataset and find they exhibit exceptional performance across a range of common benchmarks.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.

#

We present RWKV-7 "Goose", a new sequence modeling architecture. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To test RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. These models exhibit state-of-the-art downstream performance on multilingual tasks, and near SoTA English language performance at the 3 billion parameter scale despite being trained on dramatically fewer tokens than top models in its class. Still, RWKV-7 models remains at constant memory usage and constant inference time per token.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.

Which version is better?

sonic horizon Mar 18, 2025, 4:00 PM

#

misty igloo Considering the timeline here I am going to comment this out of the paper for no...

Thanks for you remind and we are still working on that. That version is just a place holder and an updated version with more model details is underwriting. I will finish it ASAP

misty igloo Mar 18, 2025, 4:03 PM

#

sonic horizon Thanks for you remind and we are still working on that. That version is just a p...

please wait to add it until after 1600 UTC - we will be submitting the current version of the paper without it to arxiv at that time

#

we can still potentially add it later for v2

gusty condor Mar 18, 2025, 4:06 PM

#

misty igloo there is no new pawsx test in <@533592838529744917> 's output for Qwen

Please refer to hevok's tests

misty igloo Mar 18, 2025, 4:10 PM

#

@gusty condor chart looks wrong for Qwen 3B multilingual

#

could you check that all the numbers in the manuscript are really correct for that one?

#

#

or maybe 1.5B numbers are too high

keen tartan Mar 18, 2025, 4:12 PM

#

misty igloo also, I need some numbers for lambada.m on Qwen 7B if we are calculating it some...

I attempt to test Qwen-2.5 7B lambada.m.

misty igloo Mar 18, 2025, 4:13 PM

#

keen tartan I attempt to test Qwen-2.5 7B lambada.m.

I think the numbers exist, I just need to know the final avg bc it seems like @gusty condor changed the formula

keen tartan Mar 18, 2025, 4:13 PM

#

misty igloo I think the numbers exist, I just need to know the final avg bc it seems like <@...

All right. Indeed, we have it already. Use geometric mean for averaging.

misty igloo Mar 18, 2025, 4:14 PM

#

they are here https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.8/qwen__qwen2.5-7B/results_2025-03-18T02-36-49.068653.json

keen tartan Mar 18, 2025, 4:16 PM

#

I try.

misty igloo Mar 18, 2025, 4:18 PM

#

Where are these new results for RWKV Pile coming from? I don't see them anywhere

brisk bronze Mar 18, 2025, 4:19 PM

#

misty igloo there is no new pawsx test in <@533592838529744917> 's output for Qwen

oh forgot to run that, will run it now

misty igloo Mar 18, 2025, 4:19 PM

#

brisk bronze oh forgot to run that, will run it now

it got run already, see Hevok's data above

brisk bronze Mar 18, 2025, 4:20 PM

#

misty igloo it got run already, see Hevok's data above

oh

keen tartan Mar 18, 2025, 4:20 PM

#

import numpy as np

#define custom function
def g_mean(x):
    a = np.log(x)
    return np.exp(a.mean())

#calculate geometric mean
g_mean([41.524835786735544, 3.70873656895629, 67.94895756237318, 23.454130938244965, 31.073140477732952])

Output: 23.793823761545397

gusty condor Mar 18, 2025, 4:20 PM

#

keen tartan All right. Indeed, we have it already. Use *geometric mean* for averaging.

That's only for perplexity

misty igloo Mar 18, 2025, 4:21 PM

#

uh right

#

sorry I guess we dont need it for avg

#

I just gotta average them all

keen tartan Mar 18, 2025, 4:21 PM

#

lol

misty igloo Mar 18, 2025, 4:21 PM

#

sorry, doing a million edits right now - this is nuts trying to change this whole sheet and all its sub evals at the last minute

#

without making mistakes

#

@gusty condor Where are these new results for RWKV Pile coming from? I don't see the source data anywhere

#

is this just recalculating glue via normal avg?

keen tartan Mar 18, 2025, 4:27 PM

#

Seems, like table 3 and 4 is fully completed. Is anything missing by now?

#

I gonna try to double check values.

misty igloo Mar 18, 2025, 4:27 PM

#

I think I have everything done on the google sheet

#

but I'm still concerned that Qwen line looks bad

keen tartan Mar 18, 2025, 4:28 PM

#

I look at Qwen 3B now.

fresh mulch Mar 18, 2025, 4:28 PM

#

@misty igloo good for me to transfer numbers to mine?

misty igloo Mar 18, 2025, 4:28 PM

#

fresh mulch <@1007072846960410685> good for me to transfer numbers to mine?

at least provisionally... qwen multilingual seems like some weirdness but otherwise it should be correct

fresh mulch Mar 18, 2025, 4:29 PM

#

qwen has the same behavior with a relative dip at 3B multilingual in the previous data, but less pronounced ig

#

plus that's the easy one to change

misty igloo Mar 18, 2025, 4:30 PM

#

it may be correct, but it looks sus to me

gusty condor Mar 18, 2025, 4:31 PM

#

keen tartan I look at Qwen 3B now.

Seems correct. Checking pawsx again

gusty condor Mar 18, 2025, 4:33 PM

#

misty igloo is this just recalculating glue via normal avg?

Yes, and it turns out no difference of lm-eval 0.4.3 and 0.4.8 on these english benchmarks

misty igloo Mar 18, 2025, 4:36 PM

#

could also be that 1.5B is the one that's wrong, which makes 3B look like it dips

gusty condor Mar 18, 2025, 4:38 PM

#

misty igloo could also be that 1.5B is the one that's wrong, which makes 3B look like it dip...

or 7b

fresh mulch Mar 18, 2025, 4:38 PM

#

hey, who removed subfigure captions on 3 and 4? we need those for some of the crossreferences

misty igloo Mar 18, 2025, 4:38 PM

#

fresh mulch hey, who removed subfigure captions on 3 and 4? we need those for some of the cr...

I did - janna had asked if we needed it

#

I'll comment it back in, sorry

gusty condor Mar 18, 2025, 4:39 PM

#

fresh mulch hey, who removed subfigure captions on 3 and 4? we need those for some of the cr...

You haven't updated figure 4 yet

fresh mulch Mar 18, 2025, 4:39 PM

#

it is updated if you recompile

gusty condor Mar 18, 2025, 4:40 PM

#

Nope, looks like something wrong with your plot. Now RWKV7 should be higher than Qwen2.5 at 3B

fresh mulch Mar 18, 2025, 4:41 PM

#

smerky's average sheet says 71.0 average RWKV7 2.9B and 71.4 average qwen2.5 3B

gusty condor Mar 18, 2025, 4:41 PM

#

which sheet

fresh mulch Mar 18, 2025, 4:41 PM

#

@misty igloo

fresh mulch Mar 18, 2025, 4:42 PM

#

gusty condor Nope, looks like something wrong with your plot. Now RWKV7 should be higher than...

which data are you getting this idea from?

gusty condor Mar 18, 2025, 4:42 PM

#

fresh mulch which data are you getting this idea from?

@keen tartan suggested [11] for pad

keen tartan Mar 18, 2025, 4:42 PM

#

Why do we show Qwen2.5-7B in for eng in table 3 but not for multilingual in table 4? I think we should considering to comment it out from the table 3 as we have not a RWKV model yet of this class to compare with.

misty igloo Mar 18, 2025, 4:43 PM

#

keen tartan Why do we show Qwen2.5-7B in for eng in table 3 but not for multilingual in tabl...

there is no reason to show it in the tables, it's there in the charts to show how it changes with further scaling

gusty condor Mar 18, 2025, 4:44 PM

#

I put that in for reference. Should definitely comment out

keen tartan Mar 18, 2025, 4:44 PM

#

Yeah, I can see this. Was also toying around in thought with extrapolating how RWKV7 7B and 14B would perform

gusty condor Mar 18, 2025, 4:44 PM

#

Apologize for confusion.

misty igloo Mar 18, 2025, 4:48 PM

#

gusty condor Nope, looks like something wrong with your plot. Now RWKV7 should be higher than...

Please be specific about what you're comparing - the plots in the document are outdated

fresh mulch Mar 18, 2025, 4:48 PM

#

misty igloo Please be specific about what you're comparing - the plots in the document are o...

the plots in the document as of latest recompile are updated to the latest data in your sheet

#

so the data we are working with does not appear to support that claim

misty igloo Mar 18, 2025, 4:49 PM

#

I agree it looks wrong

#

checking

#

@fresh mulch I updated the RWKV7 1.5 and 2.9B numbers

#

they were old

fresh mulch Mar 18, 2025, 4:54 PM

#

ah okay

#

english only or both

misty igloo Mar 18, 2025, 4:54 PM

#

eng, checking multi now

#

the multi look ok

fresh mulch Mar 18, 2025, 4:55 PM

#

fixed, uploaded

misty igloo Mar 18, 2025, 4:56 PM

#

@gusty condor do you think we can put SoTA now for both, instead of 'near SoTA' english

#

I'm a little leery of making the claim of SoTA on english

#

because we don't establish a new SoTA, except on a per tokens trained basis

#

which should matter but... I just don't want to overclaim

gusty condor Mar 18, 2025, 4:58 PM

#

SoTA-level?

#

we are somehow on par with sota

misty igloo Mar 18, 2025, 4:59 PM

#

gusty condor ``` We present RWKV-7 "Goose", a new sequence modeling architecture. RWKV-7 intr...

I don't mind this reordering, but it leaves the juiciest part until the very end of the paragraph

#

It would be nice to lead with our best foot forward: that we have the best 3B LLM for way less training, and demolish everything on multilingual

gusty condor Mar 18, 2025, 5:00 PM

#

Bo might have some different opinions: Architecture is the juiciest part, the model serves as a tool to demonstrate the architecture.

misty igloo Mar 18, 2025, 5:00 PM

#

It's good for the first sentence to include the best results

misty igloo Mar 18, 2025, 5:00 PM

#

gusty condor Bo might have some different opinions: Architecture is the juiciest part, the mo...

I agree, but the only way anyone can judge the architecture is via the results

gusty condor Mar 18, 2025, 5:01 PM

#

gusty condor we are somehow on par with sota

but not significant enough to claim the new sota

misty igloo Mar 18, 2025, 5:02 PM

#

How about something like this:

We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and trained four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset.

To foster openness, reproduction, and adoption, we release our models and dataset component listing on Hugging Face, and our training and inference code on GitHub; all under the Apache 2.0 License.

gusty condor Mar 18, 2025, 5:04 PM

#

To test -> to demonstrate

misty igloo Mar 18, 2025, 5:05 PM

#

'with LLMs' bothers me a bit.. not sure how to rephrase that

#

maybe 'with released LLMs'?

gusty condor Mar 18, 2025, 5:05 PM

#

along with language models

#

Nevertheless, RWKV-7 models use ...

misty igloo Mar 18, 2025, 5:11 PM

#

okay are we ready for publishing?

gusty condor Mar 18, 2025, 5:11 PM

#

Yes!

fresh mulch Mar 18, 2025, 5:12 PM

#

LGTM

misty igloo Mar 18, 2025, 5:14 PM

#

@gusty condor can I remove your footnotesize on the multilang table?

gusty condor Mar 18, 2025, 5:15 PM

#

OK, that is up to you

#

Is compilation successful?

misty igloo Mar 18, 2025, 5:18 PM

#

trying that now

gusty condor Mar 18, 2025, 5:18 PM

#

post error msg asap so that we can debug

misty igloo Mar 18, 2025, 5:21 PM

#

Submission processed OK

#

📎 rwkv7_arxiv_preview_250318_r1.pdf

#

take a look and let me know if anyone sees anything wrong

gusty condor Mar 18, 2025, 5:24 PM

#

🎉 🪿 🎉 🥳

crystal hull Mar 18, 2025, 5:24 PM

#

https://x.com/maxmbeck/status/1901908266382143959

Maximilian Beck (@maxmbeck) on X

📢🔔I am excited to share the details on our optimized xLSTM architecture for our xLSTM 7B model!🚨

We optimized the architecture with two goals in mind:

- Efficiency (in Training and Inference)
and
- Stability

🧵(1/7)

#

misty igloo Mar 18, 2025, 5:25 PM

#

anyone got an idea of what ACM class we are?

#

I.2.7 maybe

#

I.2.7 Natural Language Processing

#

https://www.acm.org/publications/computing-classification-system/1998/i.2.7

#

I guess that's what I'll put

keen tartan Mar 18, 2025, 5:26 PM

#

Sounds alright

gusty condor Mar 18, 2025, 5:28 PM

#

I think I.2.0 (this is where general architectures should live). But I.2.7 is still great.

keen tartan Mar 18, 2025, 5:28 PM

#

Yeah, as it can be applied to other modulaties as well, not only NLP.

misty igloo Mar 18, 2025, 5:28 PM

#

I can put multiple

#

I'll do both

#

Categories cs.CL, cs.AI, and cs.LG okay?

#

Computation and Language (cs.CL)
Cross lists (optional):
Artificial Intelligence (cs.AI)
Machine Learning (cs.LG)

#

Article submitted

keen tartan Mar 18, 2025, 5:31 PM

#

Goose-tastic! (or even better: Honk-tastic!)

misty igloo Mar 18, 2025, 5:31 PM

#

🎉

fresh mulch Mar 18, 2025, 5:31 PM

#

🥳

#

time to compress for colm?

misty igloo Mar 18, 2025, 5:32 PM

#

yup, time for that annoying process

gusty condor Mar 18, 2025, 5:32 PM

#

We shall have a good rest

misty igloo Mar 18, 2025, 5:32 PM

#

don't edit the current doc for that, we will make a new one

#

Great job, everyone!!!!

gusty condor Mar 18, 2025, 5:33 PM

#

I go sleeping then 🛌 😴

misty igloo Mar 18, 2025, 5:33 PM

#

you deserve a good sleep!!! gnite! Great work!

obsidian quest Mar 18, 2025, 6:12 PM

#

great work 🙂
please test mamba for <|endoftext|> effect as i predict it will be strong too.

gusty condor Mar 19, 2025, 1:06 AM

#

Time's up! How is it going?

misty igloo Mar 19, 2025, 1:25 AM

#

https://arxiv.org/abs/2503.14456

arXiv.org

RWKV-7 "Goose" with Expressive Dynamic State Evolution

We present RWKV-7 "Goose", a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B mo...

#

It just went out two min ago!

#

how do we submit it to HF daily papers? maybe tweet at Akhaliq?

#

oh its here https://huggingface.co/papers/submit but I somehow don't have a paper listed so maybe someone else here can who does

Hugging Face – The AI community building the future.

gusty condor Mar 19, 2025, 1:59 AM

#

Done!

willow condor Mar 19, 2025, 2:42 AM

#

Have you posted to r/LocalLlama and/or r/MachineLearning?

misty igloo Mar 19, 2025, 2:47 AM

#

Nope! Plz do so if you can

willow condor Mar 19, 2025, 3:02 AM

#

Sure.

willow condor Mar 19, 2025, 3:11 AM

#

misty igloo Nope! Plz do so if you can

https://www.reddit.com/r/MachineLearning/comments/1jenmqz/r_rwkv7_goose_with_expressive_dynamic_state/

https://www.reddit.com/r/LocalLLaMA/comments/1jenouw/rwkv7_goose_with_expressive_dynamic_state/

From the MachineLearning community on Reddit: [R] RWKV-7 "Goose" wi...

Posted by Wooden-Deer-1276 - 1 vote and 0 comments

From the LocalLLaMA community on Reddit: RWKV-7 "Goose" with Expres...

Posted by Wooden-Deer-1276 - 1 vote and 0 comments

gusty condor Mar 19, 2025, 4:10 AM

#

We are competing with 2 papers, #2 of which looks so clickbaity. Yet such papers receive lots of upvotes. This is unfair😂

gusty condor Mar 19, 2025, 4:59 AM

#

Now we are #2 and Impossible Videos is #1.

gusty condor Mar 19, 2025, 5:21 AM

#

One more upvote and we are #1!

crystal hull Mar 19, 2025, 5:51 AM

#

Screenshot_2025-03-19_at_11.21.30_AM.png

keen tartan Mar 19, 2025, 10:40 AM

#

It got nominated #1 paper of the day on HF. Of course. RWKV7 is more fundamental than a "merely deep fake generator". A "bimodal" benchmark is also no real competitor.

#

We shall may consider adding some illustrative visuals to the abstract page for next version.

spring epoch Mar 19, 2025, 10:56 AM

#

willow condor https://www.reddit.com/r/MachineLearning/comments/1jenmqz/r_rwkv7_goose_with_exp...

your account seems to have been suspended

#

also r/localllama is horrible with their moderation, I always have difficulty posting about RWKV

keen tartan Mar 19, 2025, 11:00 AM

#

spring epoch your account seems to have been suspended

@willow condor You can recover your account. Contact Reddit support and submit an appeal.

#

https://support.reddithelp.com/hc/en-us/requests/new?ticket_form_id=360000600232

willow condor Mar 19, 2025, 11:12 AM

#

keen tartan <@885342473210167346> You can recover your account. Contact Reddit support and s...

Thanks!

dusty skiff Mar 19, 2025, 12:28 PM

#

does rwkv work with attention as a hybrid?

gusty condor Mar 19, 2025, 12:29 PM

#

Yes, but that adds no benefit. You will barely see decreased loss or benchmark improvements.

misty igloo Mar 19, 2025, 1:01 PM

#

gusty condor Yes, but that adds no benefit. You will barely see decreased loss or benchmark i...

I think you will see a benefit. But I don't have experimental evidence for v7 yet

gusty condor Mar 19, 2025, 1:10 PM

#

@paper dove did some: Adding one layer of attention to L12/D768 RWKV-7 decreases loss by around 0.0008 (not significant).

misty igloo Mar 19, 2025, 1:38 PM

#

Interesting!

dusty skiff Mar 19, 2025, 2:25 PM

#

I'm not having the best results with rwkv, it's worse than attention in my experiments

keen tartan Mar 19, 2025, 2:27 PM

#

dusty skiff I'm not having the best results with rwkv, it's worse than attention in my exper...

What did you try?

dusty skiff Mar 19, 2025, 2:28 PM

#

wdym

keen tartan Mar 19, 2025, 2:28 PM

#

dusty skiff wdym

I mean what experiments did you conduct and what was the result compared to the expected results?

#

Which models did you use for instance?

dusty skiff Mar 19, 2025, 2:30 PM

#

I tried it on my custom dataset for language modeling, but it's totally different than typical LM datasets. I compared it to transformer with rope, value residual and muon optimizer

keen tartan Mar 19, 2025, 2:30 PM

#

dusty skiff I tried it on my custom dataset for language modeling, but it's totally differen...

So you trained model from scratch or fine-tuned?

dusty skiff Mar 19, 2025, 2:30 PM

#

the result is it's worse 0.3 in loss

#

from scratch

#

3.5m model

keen tartan Mar 19, 2025, 2:32 PM

#

There lots of things to consider for training.

keen tartan Mar 19, 2025, 2:32 PM

#

dusty skiff 3.5m model

That is very small model. Interesting!

#

What trainer have you used? The RWKV-LM repo?

dusty skiff Mar 19, 2025, 2:32 PM

#

no, mine code

keen tartan Mar 19, 2025, 2:33 PM

#

dusty skiff no, mine code

I see. If you like to share,we can provide feedback on getting better results.

gusty condor Mar 19, 2025, 2:33 PM

#

dusty skiff the result is it's worse 0.3 in loss

There must be something wrong with the code. 0.3 in loss is too significant.

dusty skiff Mar 19, 2025, 2:33 PM

#

I literally replaced attention layer with Rwkv7Attention from fla

#

RWKV7Attention(
mode="chunk",
hidden_size=hidden_size,
head_dim=64,
num_heads=None,
decay_low_rank_dim=64,
gate_low_rank_dim=128,
a_low_rank_dim=64,
v_low_rank_dim=32,
# v_low_rank_dim=16,
norm_eps=1e-5,
fuse_norm=True,
layer_idx=layer_idx
)

gusty condor Mar 19, 2025, 2:34 PM

#

The initialization of FLA-RWKV7 does not function properly.

Parameter Initializations Proper parameter initialization is crucial for ensuring training stability and achieving optimal performance for language models. RWKV-7 employs a carefully designed initialization strategy tailored to its architecture. The detailed initialization scheme is beyond the scope here but can be found in the official code repository. We emphasize that using the recommended initialization is essential for replicating the results in this paper. Deviations from the prescribed initialization may lead to performance degradation.

dusty skiff Mar 19, 2025, 2:34 PM

#

good to know lmao

#

so riddle me this

#

how was this trained? https://huggingface.co/fla-hub/rwkv7-191M-world

fla-hub/rwkv7-191M-world · Hugging Face

keen tartan Mar 19, 2025, 2:36 PM

#

It was converted from RWKV checkpoint.

dusty skiff Mar 19, 2025, 2:36 PM

#

ah ok

keen tartan Mar 19, 2025, 2:36 PM

#

It was trained with code from this repo: https://github.com/BlinkDL/RWKV-LM

GitHub

GitHub - BlinkDL/RWKV-LM: RWKV (pronounced RwaKuv) is an RNN with g...

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it'...

#

Check the RWKV-v5 folder. There is the training code for RWKV7. I know it is a bit confusing.

dusty skiff Mar 19, 2025, 2:39 PM

#

where is initialization code?

keen tartan Mar 19, 2025, 2:40 PM

#

RWKV-v5/src/model.py

gusty condor Mar 19, 2025, 2:41 PM

#

https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L1351

GitHub

RWKV-LM/RWKV-v5/src/model.py at main · BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it'...

#

This function is extremely obfuscated, but the main purpose is:

initialize down projections with 0
initialize embedding with very small numbers
orthogonally initialize up projections, r, k, v and output head with relatively small gains
initialize token shifting with some magic numbers

obsidian quest Mar 19, 2025, 2:44 PM

#

moreover use LayerNorm for rwkv7 (not RMSnorm)

dusty skiff Mar 19, 2025, 2:45 PM

#

yeah I use ln

obsidian quest Mar 19, 2025, 2:51 PM

#

moreover you can modify https://github.com/BlinkDL/modded-nanogpt-rwkv
(note this is a variation of rwkv7)

GitHub

GitHub - BlinkDL/modded-nanogpt-rwkv: RWKV-7: Surpassing GPT

RWKV-7: Surpassing GPT. Contribute to BlinkDL/modded-nanogpt-rwkv development by creating an account on GitHub.

#

you can verify this
https://x.com/BlinkDL_AI/status/1855245097094517181

BlinkDL (@BlinkDL_AI) on X

RWKV-7 can also reach 2.27xx in 3200 steps (originally 5100 steps)😀reproducible code & log: https://t.co/cuH0pItsPy 🚀 #RWKV #RNN

dusty skiff Mar 19, 2025, 2:52 PM

#

I mean it's kind of interesting because on some other dataset which was more prone to overfit on some implicit concepts, it behaved better

obsidian quest Mar 19, 2025, 2:53 PM

#

0.3 loss difference certainly means something is wrong 😂

obsidian quest Mar 19, 2025, 2:54 PM

#

dusty skiff I mean it's kind of interesting because on some other dataset which was more pro...

if it is not better than your transformer, the code is buggy

dusty skiff Mar 19, 2025, 2:54 PM

#

dusty skiff I mean it's kind of interesting because on some other dataset which was more pro...

yeah but in this case it was better by 0.3

obsidian quest Mar 19, 2025, 2:54 PM

#

are you comparing train loss, or val loss?

dusty skiff Mar 19, 2025, 2:55 PM

#

val

obsidian quest Mar 19, 2025, 2:55 PM

#

how about train loss

dusty skiff Mar 19, 2025, 2:55 PM

#

same story

#

byte-level tokenizer btw

obsidian quest Mar 19, 2025, 2:57 PM

#

got train loss curve comparison?

obsidian quest Mar 19, 2025, 2:57 PM

#

gusty condor This function is extremely obfuscated, but the main purpose is: - initialize dow...

@dusty skiff please do these first

dusty skiff Mar 19, 2025, 2:59 PM

#

yeah I have to figure out the code, or maybe you've got some idea how I can modify this

    def _initialize_weights(self, module: nn.Module):
        if getattr(module, "_is_hf_initialized", False):
            return
        if isinstance(module, nn.Linear):
            nn.init.xavier_uniform_(module.weight, gain=2 ** -2.5)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        if isinstance(module, nn.Parameter):
            nn.init.xavier_uniform_(module, gain=2 ** -2.5)
        module._is_hf_initialized = True

dusty skiff Mar 19, 2025, 3:00 PM

#

obsidian quest got train loss curve comparison?

sorry for the mess haha

dusty skiff Mar 19, 2025, 3:28 PM

#

should I use this?

            # !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
            # self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
            # self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.output.weight.data.zero_()

#

I'm lost lol

#

idk I have this and there's little difference in loss

fresh mulch Mar 19, 2025, 3:51 PM

#

dusty skiff should I use this? ```python # !!! initialize if you are using RWKV_...

yeah uncomment this block and there's another block you need to uncomment

#

ctrl+f for the first comment line

obsidian quest Mar 19, 2025, 3:55 PM

#

dusty skiff idk I have this and there's little difference in loss

dont use if name == 'xxx'
use if 'xxx' in name

#

and print all names, and print() sth inside if, to make sure these ifs are called

misty igloo Mar 19, 2025, 3:57 PM

#

@dusty skiff let's move this out of the paper writing channel and into the rwkv discord or rwkv channel here

obsidian quest Mar 19, 2025, 3:57 PM

#

you should see dramatic better loss after these

#

#rwkv

misty igloo Mar 19, 2025, 3:57 PM

#

but generally speaking, wrt to the papers, we really need to provide an easy to use training code (FLA?) with proper inits

#

or else everyone will have this experience

#

the RWKV-Block repo could become this, but someone needs to devote time to making sure it's really perfect first

#

and that someone will not be me 🤣

#

I think improving the FLA code specifically is important, since that's probably what people will try first
@gusty condor I don't know if you have time to help fix that but it'd be great if you do

#

I currently copied the FLA models to the official RWKV HF, so it's the default implementation right now that people find

obsidian quest Mar 19, 2025, 4:02 PM

#

misty igloo I currently copied the FLA models to the official RWKV HF, so it's the default i...

does it have correct initialization

misty igloo Mar 19, 2025, 4:06 PM

#

obsidian quest does it have correct initialization

nope, but @random granite expressed interest in getting it to

#

I think the problem is their setup for all the FLA models isn't currently well suited towards special inits and needs some changes to support that (to be fair, our code for that is a horrible mess)

#

@sonic horizon When do you expect to have the full AudioRWKV experimental results and additional baselines ready? Please keep in mind that the final COLM submission date is about a week away.

#

If featured, just know that it will almost certainly end up in an appendix for that paper. We are an extreme premium for space, as the entire paper must fit in 9 pages.

young sparrow Mar 19, 2025, 4:52 PM

#

Great work on the paper everyone 🙂

sonic horizon Mar 19, 2025, 5:11 PM

#

misty igloo <@1153096857355042837> When do you expect to have the full AudioRWKV experimenta...

We are training a larger model to match the parameter scale of the new baselines. Since the audio embedding training starts from scratch, it takes several days to complete. Of course we will pay attention to the COLM due.

obsidian quest Mar 19, 2025, 9:43 PM

#

https://x.com/leloykun/status/1902390779747955045

leloy! (@leloykun) on X

Correct me if I'm wrong, but shouldn't the product in Equation 18 be on the right side?

misty igloo Mar 19, 2025, 11:25 PM

#

obsidian quest https://x.com/leloykun/status/1902390779747955045

uhm looks like yes? (I hate the notation we used)

#

will update the paper

gusty condor Mar 20, 2025, 2:14 AM

#

Yes

gusty condor Mar 20, 2025, 3:09 AM

#

misty igloo I think improving the FLA code specifically is important, since that's probably ...

The main problem is that FLA's initialization conflicts with RWKV-LM's initialization. If some layers initialized with RWKV, others handled by FLA, the model can't train properly.

obsidian quest Mar 20, 2025, 11:59 AM

#

i added a table to https://github.com/BlinkDL/RWKV-LM

let's fill in all details for a version in paper

obsidian quest Mar 20, 2025, 12:18 PM

#

Todo:

#1103039376184852622 message
#1103039376184852622 message
#1103039376184852622 message
#1103039376184852622 message

keen tartan Mar 20, 2025, 12:38 PM

#

obsidian quest great work 🙂 please test mamba for <|endoftext|> effect as i predict it will be...

I ran a quick experiment to test Mamba2 with and without add_bos_token flag and found no difference in accuracy and no significant difference in perplexity.
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m
https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-370m

Perhaps the lm_eval's add_bos_token option is buggy for Mamba models as well and did not actually add it.

#

Gonna try again with installing from GitHub directly.

young sparrow Mar 20, 2025, 12:53 PM

#

obsidian quest i added a table to https://github.com/BlinkDL/RWKV-LM let's fill in all details...

I don't see what value adding this to the paper has

keen tartan Mar 20, 2025, 1:00 PM

#

young sparrow I don't see what value adding this to the paper has

I think it is meant for the RWKV-LM repo so that users will not run into issues with wrong initialization when attempting to train models from scratch as the table provides sane recommendations for implementations.

#

@obsidian quest You were right! 43.9% versus 43.5% in accuracy and 16.8 versus 17.1 in perplexity.

No bos:

|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.4392|±  |0.0069|
|              |       |none  |     0|perplexity|↓  |16.8289|±  |0.5443|

With bos:

|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.4353|±  |0.0069|
|              |       |none  |     0|perplexity|↓  |17.1491|±  |0.5539|

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m/0.4.8

gusty condor Mar 20, 2025, 1:28 PM

#

keen tartan <@870137517020688415> You were right! 43.9% versus 43.5% in accuracy and 16.8 ve...

No significant difference. Why right?
RWKV-7 has better scores with BOS.

keen tartan Mar 20, 2025, 1:32 PM

#

gusty condor No significant difference. Why right? RWKV-7 has better scores with BOS.

The effect although small seems to be the reverse of the one for RWKV: no bos is preferred over bos for Mamba2.

gusty condor Mar 20, 2025, 1:35 PM

#

test RWKV-G1 too: that is very significant.

keen tartan Mar 20, 2025, 1:35 PM

#

gusty condor test RWKV-G1 too: that is very significant.

I did already.

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/rwkv7-g1-0.1b-20250307-ctx4096

#

Might need to organize it better.

young sparrow Mar 20, 2025, 1:36 PM

#

keen tartan The effect although small seems to be the reverse of the one for RWKV: no bos is...

The difference of scores there is within the margin of error. The correct conclusion is that there isn't a meaningful preference.

keen tartan Mar 20, 2025, 1:37 PM

#

young sparrow The difference of scores there is within the margin of error. The correct conclu...

Yeah.

keen tartan Mar 20, 2025, 1:53 PM

#

It is a wild-goose chase (pun intended). ^^

obsidian quest Mar 20, 2025, 2:00 PM

#

keen tartan <@870137517020688415> You were right! 43.9% versus 43.5% in accuracy and 16.8 ve...

please test those 142 special problems

keen tartan Mar 20, 2025, 2:19 PM

#

obsidian quest please test those 142 special problems

Which 142 special problems?

#

Perhaps sample size is too small. So testing it on more evaluation tasks might reveal statistical significant difference.

keen tartan Mar 20, 2025, 2:26 PM

#

obsidian quest please test those 142 special problems

Ah I see what you mean.

#

You mean the 142 examples where the answer is the first token as identified by @gusty condor. Gonna check those.

gusty condor Mar 20, 2025, 3:15 PM

#

📎 lambada_sub142.jsonl

misty igloo Mar 20, 2025, 3:23 PM

#

obsidian quest pls add an alternative version diag(w)(I - ckak) where one is free to use c=2 (t...

where do people want to put this? in an appendix on alternate designs? we're still working on fitting the paper into 9 pages for COLM so adding more to the main paper is probably infeasible, but maybe we could add this into Table 1

misty igloo Mar 20, 2025, 3:44 PM

#

@obsidian quest do you have a name for this formula? RWKV7-alt? lol

#

(Also, did you try it for language modeling? I was always hoping we'd move the w outside of the evolution formula, if you recall!)

obsidian quest Mar 20, 2025, 3:55 PM

#

misty igloo where do people want to put this? in an appendix on alternate designs? we're sti...

ok lets call it RWKV-7a

obsidian quest Mar 20, 2025, 3:55 PM

#

misty igloo (Also, did you try it for language modeling? I was always hoping we'd move the w...

it causes more NaNs 😂

gusty condor Mar 20, 2025, 5:01 PM

#

Did you ever capture a replicable NaN?

keen tartan Mar 20, 2025, 5:37 PM

#

Created custom task for special 142 samples of lambada-openai and tested Mamba2 again. No significant effect it seems:

no bos:

|   Tasks    |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 |      1|none  |     0|acc       |↑  | 0.4085|±  |0.0414|
|            |       |none  |     0|perplexity|↓  |14.5471|±  |2.6646|

add_bos:

|   Tasks    |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada-142 |      1|none  |     0|acc       |↑  | 0.4085|±  |0.0414|
|            |       |none  |     0|perplexity|↓  |14.1444|±  |2.6250|

I basically used the following code, but tested Mamba2 rather than SmalLM2: https://colab.research.google.com/drive/1nle-APaWJ12uA-WS9zgLEqAmY6cGWRHm?usp=sharing
I may have missed something perhaps.

Google Colab

#

https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval/state-spaces__mamba2-130m/0.4.8

gusty condor Mar 20, 2025, 11:59 PM

#

That's perfectly normal for mamba 2.

brisk bronze Mar 22, 2025, 11:38 PM

#

we're now #1 on weekly papers on HF too

gusty condor Mar 23, 2025, 3:34 AM

#

https://huggingface.co/papers/month/2025-03 11 votes to be #2 of the month

Daily Papers - Hugging Face

sonic horizon Mar 23, 2025, 2:28 PM

#

@misty igloo Hi, sorry to keep you waiting. I have updated the final experiment results of audio modeling and finish the writing of AudioRWKV, in \paragraph{RWKV for Audio Modeling} , RWKV-7 (preliminary) .

#

If there isn't enough room in the main text, we could include it in the appendix. However, I'm not sure which section would be the most appropriate — could you advise?

misty igloo Mar 23, 2025, 2:30 PM

#

sonic horizon <@1007072846960410685> Hi, sorry to keep you waiting. I have updated the final ...

Awesome! Those results look great!

misty igloo Mar 23, 2025, 2:31 PM

#

sonic horizon If there isn't enough room in the main text, we could include it in the appendix...

For COLM I will definitely have to put it in the appendix. But it might be okay to leave it in the main paper for the Arxiv version - I'll take a look and move it if needed.

#

Are DeepRes and HST-AT the transformer based architectures?

#

I'm unclear on the difference between HST-AT and HST-AT pretrained

#

Also, could you explain what is meant by

Note that we did not use the ensemble trick in this experiment, resulting in a slight drop in performance compared with results reported in \citet{rwkv6_colm}.
I'm not sure I understand what the ensemble trick is or what results you're saying were in the rwkv5/6 paper

sonic horizon Mar 23, 2025, 2:37 PM

#

misty igloo Are DeepRes and HST-AT the transformer based architectures?

DeepRes is based on Deep Residual Network , HST-AT is transformer based. HST-AT pretrained means its weights are initiated by pretrained vision models , which is a common used trick to improve performance.

#

For the ensemble trick, it provides a bigger ensemble result by using models with different patch settings. We used it in the audio modeling section in RWKV6.

obsidian quest Mar 23, 2025, 2:40 PM

#

how many heads are you using? we need at least 5 heads for single layer rwkv7 to solve S5. so can use multiple small heads @crystal hull

p.s. note i chose v^T k instead of k^T v because it fits the L2 loss

sonic horizon Mar 23, 2025, 2:40 PM

#

I can add these details in the writing .

misty igloo Mar 23, 2025, 2:41 PM

#

obsidian quest how many heads are you using? we need at least 5 heads for single layer rwkv7 to...

I already added the RWKV7a formula to Table 1, not sure where else we should mention it
The proofs do currently show a version with c=2, and then show how to remove it while maintaining proof correctness

obsidian quest Mar 23, 2025, 2:42 PM

#

misty igloo I already added the RWKV7a formula to Table 1, not sure where else we should men...

please mention RWKV-7a is found to be useful for othello @iron parrot

misty igloo Mar 23, 2025, 2:48 PM

#

sonic horizon For the ensemble trick, it provides a bigger ensemble result by using models wi...

I'm a little confused bc I don't remember an audio section in the Eagle Finch paper, and can't find it in there now either?
(If it's not in that paper, maybe you could cite your repo for those results instead of the colm rwkv6 paper?)

misty igloo Mar 23, 2025, 2:49 PM

#

sonic horizon I can add these details in the writing .

Great! Thanks

sonic horizon Mar 23, 2025, 2:56 PM

#

misty igloo I'm a little confused bc I don't remember an audio section in the Eagle Finch pa...

I just checked and found that the results are in Arxiv v4 of Eagle Finch paper. If this version hasn’t been widely shared, I can remove the part of 'comparing with RWKV6' so as not to confuse the reader.

misty igloo Mar 23, 2025, 3:01 PM

#

sonic horizon I just checked and found that the results are in Arxiv v4 of Eagle Finch paper. ...

oh you could just cite the arxiv paper

#

but this thing where the results are worse isn't good

misty igloo Mar 23, 2025, 3:03 PM

#

sonic horizon For the ensemble trick, it provides a bigger ensemble result by using models wi...

Could you train the RWKV7 version with the ensemble trick? It's important to show that it does better than the RWKV6 version, and will presumably improve your results vs the other models

sonic horizon Mar 23, 2025, 3:12 PM

#

misty igloo Could you train the RWKV7 version with the ensemble trick? It's important to sho...

Yes , we can do this. But it may take some more time.

misty igloo Mar 23, 2025, 3:15 PM

#

sonic horizon Yes , we can do this. But it may take some more time.

I guess the other option would be to train the v6 without the ensemble trick, to show that RWKV7 is an improvement
Not sure if that's faster, but of course the better result is preferable

#

I think it's quite important to show some apples to apples comparison with v6 though

#

Otherwise you haven't really demonstrated anything about v7, which is the goal of putting this into the paper

sonic horizon Mar 23, 2025, 3:17 PM

#

misty igloo I guess the other option would be to train the v6 without the ensemble trick, to...

That's a good idea. We can train RWKV6 without trick.

gusty condor Mar 23, 2025, 3:25 PM

#

No tricks in training RWKV7 please, or use tricks for both.

gusty condor Mar 24, 2025, 4:06 PM

#

Time to work for COLM submission!

misty igloo Mar 24, 2025, 4:21 PM

#

gusty condor Time to work for COLM submission!

@tropic minnow and I already have it mostly done

misty igloo Mar 24, 2025, 4:40 PM

#

sorry, posted the wrong paper link a moment ago... will get one asap

fresh mulch Mar 24, 2025, 4:59 PM

#

fair point about variable definitions, though this notation is standard isn't it

#

i guess we still ought to define our terms before using them

fresh mulch Mar 24, 2025, 5:21 PM

#

also is one of these $\kappa_t$ supposed to be $\hat{\kappa}_t$

silent urchinBOT Mar 24, 2025, 5:21 PM

#

Christian Azinn

misty igloo Mar 24, 2025, 5:47 PM

#

fresh mulch i guess we still ought to define our terms before using them

I have updated the paper accordingly.

misty igloo Mar 24, 2025, 5:47 PM

#

fresh mulch also is one of these $\kappa_t$ supposed to be $\hat{\kappa}_t$

doesn't really change anything to list it one way or the other.. its just a normalized version

tropic minnow Mar 24, 2025, 7:07 PM

#

view colm paper https://www.overleaf.com/read/vhrvqrmmztgj#d06246

#

this plot (right side) might benefit to nonoverlapping text witth grid, and making rwkv more orange, less yellow

young sparrow Mar 24, 2025, 7:14 PM

#

Omitting Mamba and RWKV-Pile on the left looks weird at a glance. I know it's because of the minimal multilingual content in the pile, but you should explicitly say that in the caption so someone who glances at the plot has that context. If there are numbers for those models, I would recommend including them even if they're bad tbh. Most plots should be optimized to be easily digestible at a glance / to people skimming

fresh mulch Mar 24, 2025, 7:15 PM

#

tropic minnow this plot (right side) might benefit to nonoverlapping text witth grid, and maki...

"nonoverlapping text with grid" ie the rwkv7-pile caption?

fresh mulch Mar 24, 2025, 7:15 PM

#

young sparrow Omitting Mamba and RWKV-Pile on the left looks weird at a glance. I know it's be...

caption it is, unless we have numbers @misty igloo (i forget who did evals)

misty igloo Mar 24, 2025, 7:18 PM

#

fresh mulch caption it is, unless we have numbers <@1007072846960410685> (i forget who did e...

We don't have those numbers

#

If you make edits, please do so only on the arxiv version

#

I will port them to the COLM document after validating the final choices made

#

otherwise it becomes really hard to track what changed

fresh mulch Mar 24, 2025, 7:20 PM

#

makes sense. i also need to change axis titles and fix alignment. will do in an hour or so

misty igloo Mar 24, 2025, 7:40 PM

#

Woohoo! I finally got it all to fit in 9 pages with all the figures and tables we need.

obsidian quest Mar 25, 2025, 8:01 AM

#

"However, training for multi-query associative recall (MQAR) is highly unstable and strongly dependent on initialization and hyperparameter settings
some guy read this and say RWKV7 is bad at MQAR so we dont provide MQAR chart 😂

#

so let's add chart for this

#

in this style (show 1024 & 2048 if possible)

gusty condor Mar 25, 2025, 3:36 PM

#

obsidian quest ```"However, training for multi-query associative recall (MQAR) is highly unstab...

This is proved by some paper #1103039376184852622 message

#

I just want to avoid suppressing the baseline for other models, as shown by xLSTM paper. The default initialization of MQAR is clearly suboptimal for RWKV-7 and a few other models, but without knowing their correct initialization and implementation I decide to not put them in at all.

obsidian quest Mar 25, 2025, 3:44 PM

#

let's simply give all models better initialization

gusty condor Mar 25, 2025, 3:48 PM

#

and lr too (I used transferable lr https://arxiv.org/abs/2407.05872 for RWKV-7 based on observations so I didn't sweep on the whole LR interval)

arXiv.org

Scaling Exponents Across Parameterizations and Optimizers

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parame...

misty igloo Mar 25, 2025, 9:41 PM

#

@iron parrot did you use RWKV7a with c=2 in the Othello experiments? I added a couple of sentences there - please review and expand on whether or not this was what the code did.

iron parrot Mar 26, 2025, 6:17 AM

#

misty igloo <@701460149134688386> did you use RWKV7a with c=2 in the Othello experiments? I ...

I checked and it's accurate. I tried both c=1 and c=2, I can include the loss curve comparison in the paper if needed

dawn pewter Mar 26, 2025, 11:28 AM

#

From the results in Appendix C, c=1.545239211892605 ( 1+exp(-exp(-0.5)) ) is the maximum value of c that ensures stability.

misty igloo Mar 26, 2025, 4:00 PM

#

Everyone, please read through the COLM paper https://www.overleaf.com/read/vhrvqrmmztgj#d06246
and let us know if there is anything that's wrong

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

obsidian quest Mar 26, 2025, 4:02 PM

#

gusty condor This is proved by some paper https://discord.com/channels/729741769192767510/110...

However, training for multi-query associative recall (MQAR) is highly unstable and strongly dependent on initialization and hyperparameter settings. We observe significant variability in performance under identical configurations across different studies
this is not true for rwkv so we shouldnt mention it, otherwise people think it's rwkv issue

#

lets fix table 10

misty igloo Mar 26, 2025, 4:04 PM

#

obsidian quest lets fix table 10

this is not in the COLM paper yet

#

but I will fix it now

misty igloo Mar 26, 2025, 4:04 PM

#

obsidian quest ```However, training for multi-query associative recall (MQAR) is highly unstabl...

I have now removed this paragraph

misty igloo Mar 26, 2025, 4:41 PM

#

obsidian quest lets fix table 10

Fixed and added to COLM Appendix

obsidian quest Mar 26, 2025, 4:44 PM

#

misty igloo Fixed and added to COLM Appendix

got link for COLM version? 🙂

misty igloo Mar 26, 2025, 4:51 PM

#

obsidian quest got link for COLM version? 🙂

https://www.overleaf.com/read/vhrvqrmmztgj#d06246

willow condor Mar 27, 2025, 2:46 AM

#

Typo: a product a product of elementary transition matrices

Screenshot_2025-03-27_at_10.46.38_AM.png

misty igloo Mar 27, 2025, 3:20 AM

#

these proofs are undergoing revisions right now, and are probably the last thing that will change before I publish the final COLM version

misty igloo Mar 27, 2025, 8:11 PM

#

Proof revisions integrated and COLM and ArXiV versions submitted.

#

(Updated ArXiV version supposedly going out March 31)

willow condor Mar 29, 2025, 4:49 AM

#

RWKV 7 can be made Turing Complete using permutation matrices and state dependent (not just data dependent) transition matrices.

I think the next RWKV should include matrices that aren't just diagonal but rather subdiagonal etc., which would reduce parallelizability for maximal expressivity. End the war with "DeltaFunction"s.

willow condor Mar 29, 2025, 9:46 AM

#

To expand, I mean explicitly give RWKV a way to simulate cellular automata in a continuous, differentiable way. For example, the formula for calculating Rule 110 (Turing-complete) is state + (state @ right) - state * (state @ right) * (1 + (state @ left)) where left, right are the last dimension left and right shifted versions of state (equivalent to multiplying by a subdiagonal matrix or a superdiagonal matrix)

#

Rule 110, when the state and everything else is bound between (0, 1), displays interesting converging properties where in 3D it converges to 1/phi for all coordinates, while if instead of right shifting or left shifting and treating the edges as constants a or b, a acts like the learning rate and b as the point which the rule converges to. See this Desmos graph if interested:

https://www.desmos.com/3d/mhlntzgruo

obsidian quest Mar 29, 2025, 11:22 AM

#

nobody complains about transformer expressiveness 😂 we should improve rwkv's memory first

crystal hull Mar 29, 2025, 11:38 AM

#

https://x.com/julien_siems/status/1905628609714286687

Julien Siems (@julien_siems) on X

1/9 There is a fundamental tradeoff between parallelizability and expressivity of Large Language Models. We propose a new linear RNN architecture, DeltaProduct, that can effectively navigate this tradeoff. Here's how!

gusty condor Mar 29, 2025, 12:14 PM

#

I analyzed the download data (only counting non-quantized models), and the results are roughly as follows:

Organization	Downloads	Likes
meta-llama	26,369,349	41,742
Qwen	21,092,745	25,817
deepseek-ai	12,927,530	36,137
HuggingfaceTB	2,439,032	3,107
RWKV (incuding FLA)	70,705	537

Vision-Language Models (VLMs) are very popular. The top models for both Qwen and HuggingfaceTB are VLMs.
For Qwen, Llama, and RWKV, their most popular models are all 7B-sized.
Based on this data, RWKV should release a 7B model as soon as possible.

misty igloo Mar 29, 2025, 1:23 PM

#

This is why I've been doing the conversions. I have a 7B model distilled from Qwen 72B that we can release this week with the arxiv version of the RADLADS conversion paper.

#

If people want to look at the RADLADS (QRWKV) paper before I put it on arxiv, here's a link: https://www.overleaf.com/read/ytntsmbjwtdr#8bd0d4

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

willow condor Mar 29, 2025, 1:38 PM

#

obsidian quest nobody complains about transformer expressiveness 😂 we should improve rwkv's m...

any thoughts on making the weight matrix non-linear, by test time training an mlp instead?

#

maybe a large memory bank which is sparsely activated? im sure these ideas have come up before

remote elbow Mar 29, 2025, 1:53 PM

#

willow condor maybe a large memory bank which is sparsely activated? im sure these ideas have ...

sounds like pkm https://arxiv.org/abs/1907.05242

arXiv.org

Large Memory Layers with Product Keys

This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exac...

willow condor Mar 29, 2025, 2:17 PM

#

remote elbow sounds like pkm https://arxiv.org/abs/1907.05242

similar to the idea they describe. the keys should be read/write accessible for RWKV 7 somehow to aid the state in storing intermediates etc., which is likely a direction for improvement?

remote elbow Mar 29, 2025, 3:15 PM

#

willow condor similar to the idea they describe. the keys should be read/write accessible for ...

What if you had two states and did product keys on that
pkm is

# mostly copied from https://github.com/facebookresearch/memory/blob/main/lingua/product_key/memory.py but I removed some stuff for simplicity
def pkm(q, keys1, keys2, topk, values):
    nkeys = keys1.shape[0]
    q1, q2 = q.chunk(2, dim=-1)
    scores1, indices1 = torch.topk(q1.mT@keys1, topk, dim=-1)
    scores2, indices2 = torch.topk(q2.mT@keys2, topk, dim=-1)
    # cartesian product on best candidate keys
    all_scores = (
      scores1.view(bs, topk, 1).expand(bs, topk, topk)
      + scores2.view(bs, 1, topk).expand(bs, topk, topk)
    ).view(
      bs, -1
    )  # (bs, topk ** 2)
    all_indices = (
      indices1.view(bs, topk, 1).expand(bs, topk, topk)
      * nkeys
      + indices2.view(bs, 1, topk).expand(bs, topk, topk)
    ).view(
       bs, -1
    )  # (bs, topk ** 2)

    # select overall best scores and indices
    scores, best_indices = torch.topk(
        all_scores, k=topk, dim=2, largest=True, sorted=True
    )  # (bs, topk)
    indices = all_indices.gather(2, best_indices)  # (bs, topk)
    return F.embedding_bag(values, indices, per_sample_weights=scores)

rwkv7 handles the state S like

# from https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v7/rwkv_v7_numpy.py
...
S = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r
...

maybe you could do

S, ind = pkm(x, k1, k2, topk, bigS)
bigS[ind] = S * w.mT - S @ kk * (kk*a).mT + v * k.mT
y = S @ r

#

it's probably not correct as is but something similar maybe

obsidian quest Mar 30, 2025, 3:29 AM

#

obsidian quest Mar 30, 2025, 3:31 AM

#

remote elbow What if you had two states and did product keys on that pkm is ```py # mostly c...

@sly agate tried state MoE, and prefix+suffix state tuning

willow condor Mar 30, 2025, 3:46 AM

#

What about TTT mlps? Titans tried that I think and they had good results. Intuitively sparse activations would prevent catastrophic forgetting as the gradients simply wouldn't propagate to irrelevant information.

sly agate Mar 30, 2025, 4:29 AM

#

I tried pseudo State MoE. Fixed gating. Suffix tuning works well for multi-turn QA

misty igloo Mar 30, 2025, 1:35 PM

#

obsidian quest

fixed (in the manuscript)

#

I'll re-export for arxiv now

remote elbow Mar 30, 2025, 5:44 PM

#

sly agate I tried pseudo State MoE. Fixed gating. Suffix tuning works well for multi-turn ...

does pseudo state moe work?

sly agate Mar 31, 2025, 1:47 AM

#

remote elbow does pseudo state moe work?

My method is an attempt to use multiple trained states(Prefix + Suffix) simultaneously during inference.

So it is not MoE.(thats why i call pseudo moe)

It works for my purposes (characterization, knowledge, agent).

By adding routers, we can achieve state sparsity, which may bring us closer to State-MoE.

I previously experimented with the non-state MLPSparse MoE on LoRA.

v7 0.4B(World v2.9) + Router + 4MLPLoRA(r=256) = 0.6B
Due to the dynamic LoRA merge, there were problems with the inference speed, but as a benchmark (Japanese), it improved slightly.
The basic design of MoE is based on Flock of Finches, and the HashRouter has been removed.

obsidian quest Mar 31, 2025, 1:48 AM

#

pls add these to paper appendix

sly agate Mar 31, 2025, 2:07 AM

#

Thanks to FLA, RWKV v6 and v7 can perform 384 batch inferences on a single RTX4090. This means that there is almost no degradation in inference speed even when inferring multiple states simultaneously.

@obsidian quest about multiple-State-inference? or Prefix + Suffix Tuning?
Multiple state inference is experimental and cannot be guaranteed to be mathematically correct.(But the implementation is simple)

obsidian quest Mar 31, 2025, 3:12 AM

#

add all as experiments 🙂

willow condor Mar 31, 2025, 4:51 AM

#

I have an idea. What if, we had an external memory that is separate from the state but which can only be read from in a way that automatically changes it? This is more similar to how human memory works where recalling a fact increases its strength, and would allow for better parallelization.

k = key generated from state
v = expected value generated from state

return dot(memory @ k, v) * v to state
memory += k^Tv

or something more generalized.

misty igloo Apr 1, 2025, 3:02 AM

#

Updated arxiv paper is live.

tough crane Apr 1, 2025, 8:53 AM

#

misty igloo If people want to look at the RADLADS (QRWKV) paper before I put it on arxiv, he...

Thank you for your excellent drafts. Is there a follow up plan to extend this method to multi-lingual and/or math reasoning to study further applications?

@misty igloo

misty igloo Apr 1, 2025, 11:18 AM

#

tough crane Thank you for your excellent drafts. Is there a follow up plan to extend this me...

We have a converted QwQ model, but I haven't tried it specifically with multi-lingual or math! You could see how the QwQ model works - it's available on our featherless.ai platform or at https://huggingface.co/featherless-ai/Qwerky-QwQ-32B

#

As mentioned in the paper, I generally found that post-training with a different dataset resulted in a 'confused' model. But maybe there are workarounds for this that could be discovered.

tough crane Apr 1, 2025, 4:06 PM

#

misty igloo We have a converted QwQ model, but I haven't tried it specifically with multi-li...

is this draft submitted to COLM?

misty igloo Apr 1, 2025, 4:08 PM

#

tough crane is this draft submitted to COLM?

Yeah. I'm just getting all the open source parts together so I can put it on arxiv

gusty condor Apr 1, 2025, 4:15 PM

#

btw please add a link to the RWKV paper for all models in https://huggingface.co/collections/RWKV/rwkv-v7-67d43835efa125006183fece

RWKV v7 - a RWKV Collection

pure pike Apr 2, 2025, 8:48 AM

#

misty igloo Yeah. I'm just getting all the open source parts together so I can put it on arx...

https://huggingface.co/featherless-ai/Qwerky-QwQ-32B/discussions/1 someone is already asking for the source code and data 🤣

featherless-ai/Qwerky-QwQ-32B · Is the source code for this conver...

obsidian quest Apr 2, 2025, 2:08 PM

#

Here are @sly agate 's experiment log
https://docs.google.com/document/d/1sgX-BpM6RYW0eym_ucPN--WTLs7bTODMhho_k7sTl0I/edit?tab=t.bng3px5w2lfb#heading=h.ne4yo6k6bcp1

Google Docs

OpenMOSE(MASAHIRO SHIMOES)'s Experiments

Pseudo-MoE Technique: Introduction to Multi Recurrent State Sampling (MRSS) Abstract This paper introduces Multi Recurrent State Sampling (MRSS), a novel pseudo-Mixture of Experts approach for enhancing inference diversity in recurrent neural network architectures. By strategically combining mu...

#

@misty igloo

misty igloo Apr 2, 2025, 2:45 PM

#

obsidian quest Here are <@1078605512138043403> 's experiment log https://docs.google.com/docume...

I think this is different than state offset tuning. Very interesting that it seems to work well.

# normal matrix-evolution recurrence
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
  
# offset tuning recurrence
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ (h + self.offset)

# offset tuning removed from the recurrence (kernel)
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
# post-kernel step
y = y + r @ self.offset
 
# OpenMOSE method
for t in range(timesteps):
  h = G @ h + k.mT @ v
  y = r @ h
y = y * (1 + self.time_offset_y)
y = groupnorm(y)
y = self.output(y * g)
# plus another change after the output
y = y * (1 + self.time_offset)

gusty condor Apr 2, 2025, 3:18 PM

#

misty igloo As mentioned in the paper, I generally found that post-training with a different...

This is why we should open at least a subsample!

misty igloo Apr 2, 2025, 3:19 PM

#

gusty condor This is why we should open at least a subsample!

sorry, this discussion was about RADLADS not RWKV7
So the dataset in question is the unknown Qwen dataset, not RWKV World v3

#

We do provide a world v3 subsample (though it's imperfect) see https://huggingface.co/datasets/RWKV/RWKV-World-Listing

gusty condor Apr 2, 2025, 3:21 PM

#

misty igloo sorry, this discussion was about RADLADS not RWKV7 So the dataset in question is...

I think it's at least 1/3 code + Fineweb + Fineweb-Edu-CN + OpenWebMath + ProofPile-2 + C4

#

(Simulating Qwen pretraining distribution)

#

However, I have no idea of the instruction data (likely proprietary; I heard from some Zhihu user that Qwen and DeepSeek own a same proprietary English instruction dataset)

willow condor Apr 2, 2025, 4:56 PM

#

@obsidian quest have you considered test time training something similar to this instead of delta rule (30%@1 on arc agi with minimal inductive biases): https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html#how-to-derive-our-solution-method

iliao2345

ARC-AGI Without Pretraining

#

i have tried using their arch after replacing all of the multitensor stuff with pure vectors and lobotomizing many softmax cummax layers to generalize it, but it is hard to get the symmetry and weight tying back. seems like it would perfectly compliment rwkv, so i was wondering if you or someone else already knew about this and had tried to incorporate it into in context gradient descent models

misty igloo Apr 2, 2025, 7:00 PM

#

willow condor <@870137517020688415> have you considered test time training something similar t...

let's try to keep this channel for paper related discussion, and use either the eleuther 'rwkv' channel or or rwkv discord for architecture ideas

#

this blogpost did get brought up in the RWKV discord previously

#

happy to discuss there more

willow condor Apr 3, 2025, 12:26 AM

#

misty igloo let's try to keep this channel for paper related discussion, and use either the ...

understood, thanks

dusty skiff Apr 3, 2025, 8:19 PM

#

obsidian quest nobody complains about transformer expressiveness 😂 we should improve rwkv's m...

I do, but even transformers can handle pure ICL tasks, rwkv cannot

#

you won't improve memory, it's just fundamentally impossible with such parallelization. Memory also requires expressiveness, but we won't achieve this without making the models sequential at least to some degree

#

that's why no one will get true length extrapolation with 1 forward pass over 13421512532k tokens bullshit

gusty condor Apr 6, 2025, 12:16 PM

#

RWKV-7 paper just got its first citation: https://arxiv.org/pdf/2503.21614

gusty condor Apr 7, 2025, 6:42 AM

#

Second citation: https://arxiv.org/pdf/2504.03289

obsidian quest Apr 7, 2025, 9:59 AM

#

cleaned rwkv7 training reference code
https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v7/train_temp

GitHub

RWKV-LM/RWKV-v7/train_temp at main · BlinkDL/RWKV-LM

RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it'...

gusty condor Apr 7, 2025, 4:17 PM

#

My code is more aggressive:

https://github.com/Triang-jyed-driung/rwkv7mini (completely restructured dataset loading)
https://github.com/Triang-jyed-driung/my-pretrain (pretraining code, applicable for HF-compatible models, including RWKV-7 FLA, and supports pytorch-lightning from 1.9.5 to 2.5.1)

GitHub

GitHub - Triang-jyed-driung/rwkv7mini: RWKV-7 mini

RWKV-7 mini. Contribute to Triang-jyed-driung/rwkv7mini development by creating an account on GitHub.