RWKV-papers | EleutherAI | Page 9

quaint quiver Jan 30, 2025, 8:04 AM

#

There’s also https://openreview.net/forum?id=BGnm7Lo8oW

OpenReview

Towards Learning to Reason at Pre-Training Scale

Prompting a Large Language Model (LLM) to output Chain-of-Thought (CoT) reasoning improves performance on complex problem-solving tasks. Moreover, several popular approaches exist to "self-improve"...

#

Ya I mean if ur gonna post train it I assumed u would put this thinking between the question and answer

#

Also prompt and chat template I’m not sure is too relevant to the idea of latro

#

How do u plan on doing it to pretraining data

#

Where will the CoT be placed

sinful breach Jan 30, 2025, 8:22 AM

#

quaint quiver Where will the CoT be placed

if you're willing to compute 2 forward passes, perhaps where both loss and the predicted token entropies across all tokens that have minimum probability at least p is high

#

restrict it to settings where the next token isn't necessarily ambigious, a small set of choices (at most 1/p), and where the model is uncertain among those choices and could perhaps reason which would be better. At the same time, might still end up being very wasteful as you'd be trying to reason about stuff like which of 2 synonyms is a better choice

gusty condor Jan 30, 2025, 9:51 AM

#

quaint quiver Where will the CoT be placed

Anywhere (the model is reinforced to learn where to put its thought)

quaint quiver Jan 30, 2025, 9:51 AM

#

How would u do that efficiently on pretraining data?

#

Just seems like between the question and answer is 99% of the time most optimal

#

Easiest thing to train and model to learn quickly

#

Better user experience

#

And u keep prefill efficiency

gusty condor Jan 30, 2025, 9:54 AM

#

quaint quiver How would u do that efficiently on pretraining data?

Just read my PDF

quaint quiver Jan 30, 2025, 9:54 AM

#

Oh ok didn’t realise lol

quaint quiver Jan 30, 2025, 10:55 AM

#

Hm tbh still don’t see how u would do it efficiently from the pdf

#

Although maybe the inefficiency is fine

gusty condor Jan 30, 2025, 1:49 PM

#

Not really efficient, but worth trying

obsidian quest Jan 30, 2025, 4:27 PM

#

this is correct, however for some strange reason, the training will nan after some time if we apply 1.6x

i noticed this before. dont have time to debug it yet lol

#

similar to this

alpine ferry Jan 31, 2025, 12:39 PM

#

are any of the Goose models (1.5B, 3B or 7B) on huggingface to experiment with? I was thinking of running some long context experiments

misty cedar Jan 31, 2025, 4:39 PM

#

1b5 can do niah at least to 32k, bigger v7 models coming soon

misty igloo Jan 31, 2025, 5:57 PM

#

misty cedar 1b5 can do niah at least to 32k, bigger v7 models coming soon

to clarify, it can only do this once finetuned

misty igloo Jan 31, 2025, 5:57 PM

#

alpine ferry are any of the Goose models (1.5B, 3B or 7B) on huggingface to experiment with? ...

https://huggingface.co/fla-hub/rwkv7-1.5B-world

fla-hub/rwkv7-1.5B-world · Hugging Face

#

the others arent done yet, but 3b will be done feb 10

gusty condor Feb 19, 2025, 6:00 AM

#

I decide to put this table into the introduction of RWKV-7's paper. However, I don't understand exactly how TTT-linear and Titans update their states. I think TTT involves a mini-batch gradient descent, but I have no idea how to write the state evolution formula in a suitable format.

misty cedar Feb 19, 2025, 6:50 AM

#

gusty condor I decide to put this table into the introduction of RWKV-7's paper. However, I d...

no idea, but I am pretty sure rwkv 5 and 6 dont have diag(w) in them, I think its just w

grim grotto Feb 19, 2025, 9:50 AM

#

see eq 7 of https://arxiv.org/pdf/2407.04620 for TTT (except exclude the $x_t$). Since TTT is quite different from all the other techniques (since it essentially maintains a state for 16 steps) so maybe just pretend the mini-batch size is 1 and add a comment as a footnote?

Effectively it would be

$$S_t = S_{t-1} (I - 2\eta k_t k_t^T) + 2 \eta v_t k_t^T$$

or equivalently

$$S_t = S_{t-1} -2\eta(S_{t-1} k_t - v_t) k_t^T$$

Where $\eta$ is a scalar
(You may have to switch the transposes based on convention of row vs column vectors? For instance $v_t^T k_t$ in RWKV-6 would result in a single scalar using column vectors conventions, but it should actually be a matrix, so I assume you are using row vector conventions)

silent urchinBOT Feb 19, 2025, 9:53 AM

#

thiscord13

grim grotto Feb 19, 2025, 10:00 AM

#

Titans seems to be almost exactly the same (also using minibatch), except it has

$$S_t = S_{t-1} (w_t I - 2\eta_t k_t k_t^T) + 2 \eta_t v_t k_t^T$$

Where $w_t$ and $\eta_t$ are learnable scalars

silent urchinBOT Feb 19, 2025, 10:00 AM

#

thiscord13

grim grotto Feb 19, 2025, 10:06 AM

#

Actually $w_t$ and $\eta_t$ may be input-dependent vectors so you would have to wrap these with diag

silent urchinBOT Feb 19, 2025, 10:06 AM

#

thiscord13

dawn pewter Feb 20, 2025, 3:02 AM

#

#

What is the range of a_t?

gusty condor Feb 20, 2025, 4:39 AM

#

(0,1)

gusty condor Feb 20, 2025, 11:51 AM

#

I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, @paper dove , @iron parrot , @dawn pewter and I are working on it. Does anyone have suggestions on the current paper?

gusty condor Feb 20, 2025, 4:10 PM

#

@steady ether Have you tested RWKV-7 MQAR?
It seems that the special initialization of RWKV models are not used, which may affect performance.

steady ether Feb 20, 2025, 5:12 PM

#

gusty condor <@995416401697321032> Have you tested RWKV-7 MQAR? It seems that the special ini...

I tested it a while back. What’s the special initialization you’re referring to? I might have missed that—can you clarify? Here’s the current code:

https://github.com/guangyusong/zoology_fork/blob/rwkv7/zoology/mixers/rwkv7.py

misty igloo Feb 20, 2025, 5:41 PM

#

gusty condor I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, <@10720...

I think we should try to submit to COLM ~~, and if we submit a preprint to arxiv first that will be disallowed due to anonymity periods~~

young sparrow Feb 20, 2025, 5:54 PM

#

misty igloo I think we should try to submit to COLM ~~, and if we submit a preprint to arxiv...

I am quite confidant that this is false. That would be a much more stringent policy than they had last year, flies in the face of mainstream attitudes in ML, and there's nothing I can find on their website indicating it.

misty igloo Feb 20, 2025, 5:59 PM

#

young sparrow I am quite confidant that this is false. That would be a much more stringent pol...

oh good, I must have misremembered

#

COLM will use the following policy, adapted from NeurIPS: 'Non-anonymous preprints (on arXiv, social media, websites, etc.) are permitted. We recommend you indicate “preprint”, rather than the “final” option int he template. Reviewers will be instructed not to actively look for such preprints, but encountering them will not constitute a conflict of interest.
Yep, looks fine!

misty igloo Feb 20, 2025, 6:07 PM

#

steady ether I tested it a while back. What’s the special initialization you’re referring to?...

this part

            # !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
            # self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
            # self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.output.weight.data.zero_()

also, the 'suggestion' values for the LORAs would be good to follow, as those are what are actually used for the models

steady ether Feb 20, 2025, 6:14 PM

#

Ah, I completely forgot. Thanks!

misty igloo Feb 20, 2025, 10:14 PM

#

Added dataset details to the paper.
@obsidian quest is this the correct URL for Buzz-V12? https://huggingface.co/datasets/H-D-T/Buzz-V1.2

H-D-T/Buzz-V1.2 · Datasets at Hugging Face

misty igloo Feb 20, 2025, 10:52 PM

#

also, @obsidian quest could you describe what hardware and batchsizes etc. were used for the training

misty igloo Feb 21, 2025, 12:28 AM

#

and one more question: when you continued the World v2.0 models on World v2.1, how exactly did that work? It was just an additional 0.3T tokens trained? I know for RWKV-7 World v3.0 you trained again on the whole 3.1T World 3.0 corpus...

gusty condor Feb 21, 2025, 2:02 AM

#

steady ether Ah, I completely forgot. Thanks!

Uncommenting this would just not work - Initialization is handled by def _init_weights( at line 73 of model.py.

#

I suggest you further add this at line 81:

    if 'rwkv' in block_type.lower():
        # initialize embedding and head
        ...
        return

#

@steady ether Add Channel mix too. Your code did not use Channel Mix, and it is very different from GLU

gusty condor Feb 21, 2025, 2:33 AM

#

You didn't handle properly the v_first term either.

obsidian quest Feb 21, 2025, 3:35 AM

#

misty igloo Added dataset details to the paper. <@870137517020688415> is this the correct UR...

yes

obsidian quest Feb 21, 2025, 3:36 AM

#

misty igloo and one more question: when you continued the World v2.0 models on World v2.1, h...

full world v2.1 again

misty igloo Feb 21, 2025, 3:52 AM

#

obsidian quest full world v2.1 again

oh, so it was 1.1T World 2.0, then 1.4T World 2.1, then 3.1T World v3? the models have seen a total of 5.6T tokens?

gusty condor Feb 21, 2025, 3:53 AM

#

Exactly!

misty igloo Feb 21, 2025, 3:53 AM

#

ok, ~~will~~ have updated the manuscript accordingly 😉

#

I added an intro and did a bunch of checking and edits, will try to add a background section tomorrow and edit more things that I know aren't correct yet

#

Also added trained models section

obsidian quest Feb 21, 2025, 4:01 AM

#

misty igloo oh, so it was 1.1T World 2.0, then 1.4T World 2.1, then 3.1T World v3? the model...

although i think this might be slightly weaker than 5.6T + full LR schedule 🙂 just poor man's compute saving method

misty igloo Feb 21, 2025, 4:02 AM

#

obsidian quest although i think this might be slightly weaker than 5.6T + full LR schedule 🙂 j...

sure, I'm just describing what was done accurately
though really I don't know that multiple epochs is bad

#

especially when it was only 1-3 epochs, spread out by trillions each time

#

@gusty condor are you going to train a RWKVMusic for v7?

obsidian quest Feb 21, 2025, 4:11 AM

#

misty igloo sure, I'm just describing what was done accurately though really I don't know th...

multiepoch is fine. just that architecture upgrade & LR restart has some cost

#

so i think it's more like 1.1/2 + 1.4/2 + 3.1 🙂

gusty condor Feb 21, 2025, 4:14 AM

#

misty igloo ok, ~~will~~ have updated the manuscript accordingly 😉

Not really, some RWKV-7 models are trained from scratch.

misty igloo Feb 21, 2025, 4:19 AM

#

gusty condor Not really, some RWKV-7 models are trained from scratch.

well I can only know what Bo says... which models were trained from scratch that are not Pile?

gusty condor Feb 21, 2025, 4:20 AM

#

0.1B is trained from scratch (likely)

#

I don't know if 0.4B is converted from RWKV-5

misty igloo Feb 21, 2025, 4:22 AM

#

@obsidian quest which models were trained from scratch? and which were converted from v5 and v6 and which ones are from world v2 etc?
0.1B - from scratch? just world v3?
0.4B - are all the others from v6 world2.1 upgraded?
1.5B
2.9B
and were those v6 world 2.1 all from v6 world2? or from v5

gusty condor Feb 21, 2025, 4:23 AM

#

misty igloo <@870137517020688415> which models were trained from scratch? and which were con...

0.1B: 1.0T
0.4B: ? + 2.0T
1.5B, 3B: likely v6 world v2.1 + 3.1T

obsidian quest Feb 21, 2025, 4:32 AM

#

misty igloo <@870137517020688415> which models were trained from scratch? and which were con...

all updated from previous models

#

0.1 from v5 world, 0.4 from v5 world 2, 1.5 2.9 from v6 world 2.1

misty igloo Feb 21, 2025, 4:33 AM

#

so for 0.1B and 0.4B did you upgrade to v7, then train world v3 directly for those? so they are only 1.1T + 3.1T?

gusty condor Feb 21, 2025, 4:36 AM

#

0.1B is likely world v1

misty igloo Feb 21, 2025, 4:36 AM

#

gusty condor 0.1B is likely world v1

he just wrote above that it was from v5 world 2...

gusty condor Feb 21, 2025, 4:39 AM

#

There is no world v2 0.1B model

misty igloo Feb 21, 2025, 4:43 AM

#

gusty condor There is no world v2 0.1B model

That's a very compelling point

#

RWKV-4 paper does not show the number of tokens or contents of World dataset

#

does anyone know this info?

#

I guess we can live without the contents, but the token count would be good to show

#

#

@obsidian quest can you provide the values for this chart for v7 World 3 training

#

(I think I have the config for Pile)

gusty condor Feb 21, 2025, 6:24 AM

#

misty igloo RWKV-4 paper does not show the number of tokens or contents of World dataset

World v1 is 0.59T

gusty condor Feb 21, 2025, 6:25 AM

#

misty igloo <@870137517020688415> can you provide the values for this chart for v7 World 3 t...

We also need the data behind the loss plot

#

Something like this

misty igloo Feb 21, 2025, 6:54 AM

#

gusty condor World v1 is 0.59T

let me know if you think this is correct

gusty condor Feb 21, 2025, 8:27 AM

#

I think it's better using RWKV-7 more than Goose

#

World v1: 0.59

#

world v2: 1.12

#

v2.1: 1.42

#

v3: 3.119

misty igloo Feb 21, 2025, 7:49 PM

#

good to use the same precision consistently, so I think it should be either one or two decimal places

#

I have updated it to use a single decimal place for now

#

@gusty condor let's discuss whether to use "state of the art" versus "state of the open" - what are the closed source models at these scales against which we are competing, and can we find equivalent benchmarks for them?

#

I'm not 100% certain on even state of the open, but I haven't seen any models that beat RWKV-7 at the 3B scale

#

maybe some hybrids might? we need to check this

fresh mulch Feb 22, 2025, 12:06 AM

#

gusty condor I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, <@10720...

My 4 cents:
Do we plan a section on speed/memory benchmarks like sec. 9 of the Eagle/Finch paper? I see it is currently commented in the LaTeX source.

I would also suggest we reformat Sec. 4.1.1 for clarity, because we introduce a dozen or so RWKV-specific variables and it's easy to forget the first few times around. I find myself frequently referring back to it for variable meaning and faster lookup would be great.

Similarly I would like to see intuitive explanations for some design choices throughout Sec. 4, and connections of ways in which Goose design choices can be considered similar (or different!) to other linear attention architectures, like how Eagle/Finch Sec. 4 did it, to contextualize the work in the broader linear attention landscape. (Maybe this will be covered in the background Sec. 2)

Spitballing on this last one, but I also wonder whether we can come up with any simple explanations and visuals. Technical media makes good papers, but simple media makes good blogposts, which in turn makes good (maybe even viral) publicity. I really like Figure 2 in this regard and wonder if we can expand on that.
For example Fig 1 from Attention hits about as many Google Search results as the whole query "RWKV". If we have something that is easily accessible in that regard, I think it will do wonders for RWKV's publicity. (Figure 4 is too intimidating, IMO.)

#

on another note is anyone testing Goose for music or audio modeling atm? if not I'd like to contribute that

misty igloo Feb 22, 2025, 12:14 AM

#

fresh mulch My 4 cents: Do we plan a section on speed/memory benchmarks like sec. 9 of the E...

thanks for the feedback, I will add the intuitive explanations into an appendix when I write the background Sec 2 because they are hard to get past reviewers without experimental substantiation

#

Definitely contribute some audio modeling if you can! Maybe you'd like to get the linaspeech code and rework it for v7? Or if @gusty condor doesn't have time to do it he could share the music modeling code with you to attempt that one

#

Good point on 4.1.1 - I did some reworking here already but more is needed. As usual, the problem is balancing total pagecount with readability. For arxiv it doesn't matter, so I could just add in a reference sheet, but I think it'd be nicer for the full paper to match somewhat

#

I will also include a code version in the appendix, with legible naming and comments

misty igloo Feb 22, 2025, 2:08 AM

#

fresh mulch My 4 cents: Do we plan a section on speed/memory benchmarks like sec. 9 of the E...

took some work since I'm bad at LaTeX, but does this help for 4.1.1?

obsidian quest Feb 22, 2025, 2:20 AM

#

https://x.com/BlinkDL_AI/status/1893123273871036670

by #1129309171137916948 message

BlinkDL (@BlinkDL_AI) on X

RWKV internal world model from its RNN state🙂

fresh mulch Feb 22, 2025, 3:13 AM

#

misty igloo Definitely contribute some audio modeling if you can! Maybe you'd like to get th...

I'll take a look at linaspeech. Mostly I wanted to replicate sec.s 10.1 and 11 from the Eagle/Finch paper to capture the generational leap but I'll see whatever I can make work

fresh mulch Feb 22, 2025, 3:15 AM

#

misty igloo took some work since I'm bad at LaTeX, but does this help for 4.1.1?

I like this more, yeah. I have some more nitpicks (e.g. ~~using alpha in the definition of replacement boosted key before we define it two lines down~~ no, actually, this makes sense) but I imagine this is a difficult problem to balance as you mentioned, and of course it's easier to point out problems than to fix them

#

other ideas include semantically spacing the lines, e.g. putting a \vspace{0.5em} between g, d, v, a and r, k, v

misty igloo Feb 22, 2025, 3:17 AM

#

fresh mulch I like this more, yeah. I have some more nitpicks (e.g. ~~using alpha in the def...

Keep telling me problems and I'll try to find fixes!

fresh mulch Feb 22, 2025, 3:21 AM

#

hm well it is really weird to me that we use two different font 'v's (well, \nu and v). For instance it is very difficult to tell them apart on my phone screen - could we do \bar{v} or something clearer for value without residual?

misty igloo Feb 22, 2025, 3:29 AM

#

fresh mulch hm well it is really weird to me that we use two different font 'v's (well, \nu ...

Good call. Was too cute using a different Greek letter that looks like v

gusty condor Feb 22, 2025, 3:39 AM

#

fresh mulch hm well it is really weird to me that we use two different font 'v's (well, \nu ...

They are really very different ... In phisics classes we wete taught $E = h \nu$, not $E = h v$.

silent urchinBOT Feb 22, 2025, 3:39 AM

#

Zhang Ruichong

gusty condor Feb 22, 2025, 3:40 AM

#

$\nu$ has a sharper angle at the bottom.

silent urchinBOT Feb 22, 2025, 3:40 AM

#

Zhang Ruichong

fresh mulch Feb 22, 2025, 3:41 AM

#

silent urchin **Zhang Ruichong**

It also has the serif on the top left. But it is for the sake of clarity: for example, if I lean back in my chair, or open Discord on a phone, I cannot distinguish them in this image.

gusty condor Feb 22, 2025, 3:50 AM

#

misty igloo Definitely contribute some audio modeling if you can! Maybe you'd like to get th...

It's just because that I'm wanting a different style from RWKV-5/6 paper. I think RWKV-7 state visualization and probing would be interesting.

#

That is more valuable than music modeling ( @iron parrot can do that)

misty igloo Feb 22, 2025, 6:45 AM

#

it was my mistake, I shouldn't have tried to use a similar looking letter and will change it to have a tilde or hat or something instead

#

same issue with kappa versus k

#

I just wanted something that looked like a k, since its related... but maybe it's better as a completely different letter

#

the problem is these things go through a few steps, so for example we already needed kappa hat

#

same issue with alpha and a btw

#

I'm really not sure what I would replace them with though

#

blink calls kappa 'kk' in the code lol

gusty condor Feb 22, 2025, 7:40 AM

#

RWKV-7 MQAR

L=512  KV=64  D=64   98.43%
L=512  KV=64  D=128 >99%
L=512  KV=64  D=256 >99%
L=512  KV=64  D=512 >99%

L=1024 KV=128 D=64   95.01%
L=1024 KV=128 D=128 >99%
L=1024 KV=128 D=256 >99%
L=1024 KV=128 D=512 >99%

L=2048 KV=256 D=64   72.93%
L=2048 KV=256 D=128  94.97%
L=2048 KV=256 D=256  98.97%
L=2048 KV=256 D=512 >99%

https://wandb.ai/rwkv_tune/zoology-rwkv-7

W&B

rwkv_tune

Weights & Biases, developer tools for machine learning

gusty condor Feb 23, 2025, 4:11 AM

#

@misty igloo What is "relaxed replacement semantics" in the abstract?

misty igloo Feb 23, 2025, 4:13 AM

#

gusty condor <@1007072846960410685> What is "relaxed replacement semantics" in the abstract?

The variation of key amount replaced between in context learning rate and 1.0

#

We can rephrase if you like

steady ether Feb 23, 2025, 4:28 AM

#

gusty condor RWKV-7 MQAR ``` L=512 KV=64 D=64 98.43% L=512 KV=64 D=128 >99% L=512 KV=6...

The inits did make a noticeable difference!

gusty condor Feb 23, 2025, 4:53 AM

#

@steady ether How many learning rates are tested?
I use $$ LR = \frac {(1.0, 2.0, 4.0)}{\sqrt{\mathrm{d_model}} \cdot \mathrm{sequence_length}} $$

#

Is our batch size aligned?

silent urchinBOT Feb 23, 2025, 4:58 AM

#

Zhang Ruichong

misty cedar Feb 23, 2025, 5:01 AM

#

does your inner monologue just run in latex?

gusty condor Feb 23, 2025, 5:06 AM

#

silent urchin **Zhang Ruichong**

See https://arxiv.org/abs/2407.05872 (and some others) why I use such a formula

arXiv.org

Scaling Exponents Across Parameterizations and Optimizers

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parame...

#

After testing d_model = 64 at sequence length 1024, I transferred the LR across all runs

steady ether Feb 23, 2025, 5:27 AM

#

gusty condor Is our batch size aligned?

The batch size looks correct. Everything else is just zoology defaults from page 67 (https://arxiv.org/pdf/2312.04927). Let's go with the run you posted.

gusty condor Feb 23, 2025, 6:21 AM

#

#

@misty igloo You modified my formula here

#

Groupnorm is swallowed

misty igloo Feb 23, 2025, 6:26 AM

#

gusty condor <@1007072846960410685> You modified my formula here

Yes, you modified my original formula and made it very complex 🙂
I moved the LayerNorm (GroupNorm) to the prior section, where it belongs because it is per head (see eq 7)
I think this is much simpler than the new formula you added

#

This way we can keep everything per head in 4.1.2 and full vectors of size D in 4.1.3

gusty condor Feb 23, 2025, 6:27 AM

#

4.1.3 is not really dimension D

#

This is summed per-head:
$$ \langle r_{t, j} \mathrm{diag}(\rho_j), k_{t, j} \rangle $$

silent urchinBOT Feb 23, 2025, 6:28 AM

#

Zhang Ruichong

misty igloo Feb 23, 2025, 6:29 AM

#

I don't think adding tons of subscripts is making the paper better

#

it's just harder to follow

#

when the reality is that its a bunch of hadamard products and an addition

#

oh sorry

#

yes I made a mistake here

#

and I got busy and forgot to correct it

#

give me a few minutes to look over it - I'll put it back if I don't find a better solution

#

but it would be nice if 4.1.3 was less complex looking

gusty condor Feb 23, 2025, 6:35 AM

#

More subscripts in this paper https://arxiv.org/pdf/2406.06484

misty igloo Feb 23, 2025, 6:38 AM

#

I love songlin's papers but they are very challenging for non-mathematicians to read through

#

my original formula that was in the paper before it got changed was more like this

#

putting the heads together doesn't really happen until the very last step, right before multiplying with W_o

gusty condor Feb 23, 2025, 6:41 AM

#

It's 2025 and people are getting used to these

misty igloo Feb 23, 2025, 6:42 AM

#

but the actual situation is that everything is per head until W_o... everything except tokenshift can be considered and written per head

#

so there is simply no need for head subscripts in any part of the paper

#

I'm trying to keep things simple here so that it can be quickly and easily understood

gusty condor Feb 23, 2025, 6:45 AM

#

#

This is a per-head sum

misty igloo Feb 23, 2025, 6:46 AM

#

gusty condor This is a per-head sum

per head:
r (u \odot k)^T
would be that same sum, right?

#

its an inner product

gusty condor Feb 23, 2025, 6:46 AM

#

Yes, inner product weighted by r_k

#

Also this c

#

Why do you remove that

misty igloo Feb 23, 2025, 6:47 AM

#

Why did you add it? It's not part of RWKV-7

#

I know it's part of some proofs

#

so I left it in the proofs as an extension

gusty condor Feb 23, 2025, 6:48 AM

#

This is naturally extendible and exists in some of @iron parrot 's experiments.

#

Should be considered a hyperparameter

misty igloo Feb 23, 2025, 6:49 AM

#

I don't think it should be in the official formulas if it's not in any RWKV-7 code that has ever publicly existed

#

I agree that it should be listed as an extension in the paper though

#

when people read the paper and see the code they shouldn't be surprised that the code does not have parts that are in the paper

#

if you still think it should be in the main formulas, we can just ask Blink if he wants it to be a part of the official RWKV-7 definition or listed as an extension

gusty condor Feb 23, 2025, 6:52 AM

#

C being set to 1 is more like a compromise

#

Originally it was 2

#

That caused NaNs in rc2

misty igloo Feb 23, 2025, 6:53 AM

#

you don't think it will be a problem for readers that the code literally doesn't have this in it?

#

I think it's pretty bad when I read a paper and the code does not conform to it

#

extremely confusing to the reader when that happens

gusty condor Feb 23, 2025, 6:55 AM

#

misty igloo you don't think it will be a problem for readers that the code literally doesn't...

we can add this in the code, instead of removing this from the paper

misty igloo Feb 23, 2025, 6:56 AM

#

gusty condor we can add this in the code, instead of removing this from the paper

okay, let's just ask Blink if he wants to do that
I'm definitely fine with having it in the main formulas if it's in the code (in that case it would actually have to be in the formulas!)

#

@obsidian quest what do you think? should we put this additional c parameter (it would be 1.0 in all existing models) into the codebases?

iron parrot Feb 23, 2025, 7:00 AM

#

I ran some loss tests on PG19 with different models. Surprisingly, the loss doesn't seem to get better with longer context lengths, even with newer models

#

Looks like this is something specific to this dataset

misty igloo Feb 23, 2025, 7:03 AM

#

iron parrot I ran some loss tests on PG19 with different models. Surprisingly, the loss does...

interesting - did you try it on any other non-RWKV model? maybe it never gets better over ctxlen for other models, either

iron parrot Feb 23, 2025, 7:06 AM

#

misty igloo interesting - did you try it on any other non-RWKV model? maybe it never gets be...

That's what I suspect. I'm currently testing RWKV on the Proof Pile dataset, and the loss goes down as context length increases

misty igloo Feb 23, 2025, 7:08 AM

#

iron parrot That's what I suspect. I'm currently testing RWKV on the Proof Pile dataset, and...

yes, I suppose PG19 was a poor choice 😭 - I ran that experiment at the very last minute for the Eagle/Finch paper, I think at Blink's request but I forget if Blink asked that I use that dataset or I chose it. Good to know that it's probably not ideal!

misty igloo Feb 23, 2025, 7:31 AM

#

@gusty condor I made a correction to your bonus formula... maybe this is a mistake in the model, but the code uses \tilde{k} for it
There is also a mistake where it does not apply the gate before the output matrix. I have corrected that as well but it does not fit nicely with your head indexing

iron parrot Feb 23, 2025, 7:32 AM

#

I'm about to run NIAH tests. Which NIAH variant should we use, or should we go with RULER instead?

gusty condor Feb 23, 2025, 7:58 AM

#

just simple passkey in garbage - at least rwkv-7 solved

misty igloo Feb 23, 2025, 8:09 AM

#

@brisk bronze ran this NIAH style passkey in garbage test and has results - she has an updated version of @iron parrot 's mamba repo that uses exact token counts instead of an approximation based on the average tokenizer bytes per token, as well as some other updates

#

I asked her to add these to the paper a couple days ago but I guess she hasn't gotten around to it yet

misty igloo Feb 23, 2025, 8:11 AM

#

gusty condor <@1007072846960410685> What is "relaxed replacement semantics" in the abstract?

updated to "a relaxed value replacement rule"

iron parrot Feb 23, 2025, 8:23 AM

#

misty igloo <@533592838529744917> ran this NIAH style passkey in garbage test and has result...

Sounds good, let's go with her version then

iron parrot Feb 23, 2025, 10:37 AM

#

long-context loss tests on the Proof Pile dataset

misty igloo Feb 23, 2025, 2:38 PM

#

iron parrot Sounds good, let's go with her version then

We also finetuned a version for longer context by training it on 128k data and it increased its NIAH scores. Let me try to get you that in case you'd like to try it on proof pile

iron parrot Feb 23, 2025, 2:43 PM

#

misty igloo We also finetuned a version for longer context by training it on 128k data and i...

Would love to test it, as it seems the current world models are a bit 'overfitted' to 4k context lengths.

misty igloo Feb 23, 2025, 2:52 PM

#

iron parrot Would love to test it, as it seems the current world models are a bit 'overfitte...

this is the 1.5B model that we extended - https://huggingface.co/m8than/rwkv7-1b5-128k

m8than/rwkv7-1b5-128k · Hugging Face

obsidian quest Feb 23, 2025, 2:56 PM

#

https://x.com/BlinkDL_AI/status/1893676178206072946

BlinkDL (@BlinkDL_AI) on X

I am training G1 0.1/0.4/1.5/2.9B ("Goose One" 🪿) simultaneously on world-3.5 (5.16T tokens), continuing from previous RWKV-7 "Goose" world-3 checkpts. Release soon🙂even L12-D768 can reason.

misty igloo Feb 23, 2025, 3:12 PM

#

obsidian quest https://x.com/BlinkDL_AI/status/1893676178206072946

Is the data listed in the rwkv news channel everything you used? We can include in the paper

obsidian quest Feb 23, 2025, 3:19 PM

#

will provide latest list soon

gusty condor Feb 23, 2025, 5:08 PM

#

I suggest we open a separate new github link for all our experiments:

RWKV-7 training code (should only include RWKV-7)
MQAR testing
lm-eval code
state visualization

young sparrow Feb 23, 2025, 5:22 PM

#

Does the new RWKV have a working HF implementation yet?

brisk bronze Feb 23, 2025, 6:03 PM

#

misty igloo I asked her to add these to the paper a couple days ago but I guess she hasn't g...

yeah will add this today, sorry for the delay!

misty igloo Feb 23, 2025, 6:05 PM

#

young sparrow Does the new RWKV have a working HF implementation yet?

Yes in fla-hub there are models

#

I have older ones too, and we have an upcoming simplified Rwkv-Blocks repo that implements it too, but I recommend the fla-hub versions at this point

#

https://huggingface.co/collections/fla-hub/rwkv7-6790fd37b4b6137b088a0d8a

RWKV7 - a fla-hub Collection

gusty condor Feb 24, 2025, 1:56 AM

#

young sparrow Does the new RWKV have a working HF implementation yet?

There is a slight performance degradation

misty igloo Feb 24, 2025, 2:14 AM

#

gusty condor There is a slight performance degradation

In inference or training?

gusty condor Feb 24, 2025, 2:31 AM

#

Inference mainly

#

https://github.com/fla-org/flash-linear-attention/issues/198

GitHub

[Bug] RWKV-7 conversion results in subtly degraded performance, pos...

Checklist I have checked FAQs and existing issues for similar problems Please report this bug in English to ensure wider understanding and support Describe the Bug RWKV-7 in FLA format: Tasks Versi...

misty igloo Feb 24, 2025, 6:00 AM

#

gusty condor Inference mainly

can we put in non-FLA pure pytorch inference code to fix the problem, like in my original repo:
https://huggingface.co/SmerkyG/RWKV7-Goose-0.4B-Pile-HF/blob/02778effb99287d220d5d9494af4acf2af686296/modeling_rwkv7.py#L358

gusty condor Feb 24, 2025, 6:44 AM

#

misty igloo I have older ones too, and we have an upcoming simplified Rwkv-Blocks repo that ...

I created it.

#

But I suspect the main problem lies in the logit head and outputs, which increases ppl a littie

misty igloo Feb 24, 2025, 6:48 AM

#

gusty condor But I suspect the main problem lies in the logit head and outputs, which increas...

hmm what aspect of those do you suspect increases it? just curious so I don't make that mistake in the future

gusty condor Feb 24, 2025, 7:04 AM

#

misty igloo hmm what aspect of those do you suspect increases it? just curious so I don't ma...

Haven't inspected it yet

misty igloo Feb 24, 2025, 7:05 AM

#

gusty condor Haven't inspected it yet

oh, one thing that used to matter in v6 was doing the normalization in fp32

#

I spent about 24 hours tracking that one down at one point

#

like you have to call torch.nn.functional.layer_norm or group_norm with the weights upcasted to float potentially (even though they are stored as bf16) so that the calculation is done in float precision

#

that could easily be the issue

#

it's not currently being done in o = self.g_norm(rearrange(o, '... h d -> ... (h d)'))

#

see this line in ChatRWKV for proof
https://github.com/BlinkDL/ChatRWKV/blob/626367863cf5860268c2fda81a5d43d423a69ebf/rwkv_pip_package/src/rwkv/model.py#L657

gusty condor Feb 24, 2025, 12:34 PM

#

GLUE looks like problematic

#

There are several subtasks in math and CS of MMLU that RWKV-7 lags behind Qwen by over 20%

obsidian quest Feb 24, 2025, 12:41 PM

#

qwen2.5 has much higher MMLU comparing with llama3.2 too
they have lots of synthetic data

gusty condor Feb 24, 2025, 12:46 PM

#

I think there is some slight data leakage. Qwen-2.5's mmlu can be deducted by around 6%.

gusty condor Feb 24, 2025, 3:03 PM

#

gusty condor GLUE looks like problematic

what else should be included? I think RWKV-6 2.1, any other models?

misty igloo Feb 24, 2025, 3:05 PM

#

gusty condor GLUE looks like problematic

This was done using the FLA hf so maybe the fix will help a bit. Glue was having some issues so I need to double check the methodology there. @brisk bronze where did we end up on that, did you manually average the accuracy based entries?

brisk bronze Feb 24, 2025, 3:07 PM

#

misty igloo This was done using the FLA hf so maybe the fix will help a bit. Glue was having...

Glue results for 2.9B is 55.19 using FLA implementation. The stats in the picture are not based on fla

gusty condor Feb 24, 2025, 3:08 PM

#

misty igloo This was done using the FLA hf so maybe the fix will help a bit. Glue was having...

No. It was I who tested with RWKV-7 pip.

misty igloo Feb 24, 2025, 3:14 PM

#

gusty condor No. It was I who tested with RWKV-7 pip.

oh ok, janna was running these evals yesterday so I didn't know who put them in

gusty condor Feb 24, 2025, 3:17 PM

#

brisk bronze Glue results for 2.9B is 55.19 using FLA implementation. The stats in the pictur...

It might be related to pad tokens

#

{
  "model": "/home/zhangping/zrc/RWKV-x070-World-2.9B-v3-20250211-ctx4096",
  "tasks": [
    "glue"
  ],
  "num_fewshot": 0,
  "results": {
    "glue": {
      "f1,none": 0.6928835730390357,
      "f1_stderr,none": 0.0036436664119938256,
      "acc,none": 0.6684581943782754,
      "acc_stderr,none": 0.0016703858434417523,
      "mcc,none": 0.05185503773957725,
      "mcc_stderr,none": 0.032600944586408685,
      "alias": "glue"
    },
    "cola": {
      "mcc,none": 0.05185503773957725,
      "mcc_stderr,none": 0.032600944586408685,
      "alias": " - cola"
    },
    "mnli": {
      "acc,none": 0.39449821701477333,
      "acc_stderr,none": 0.004933523584717906,
      "alias": " - mnli"
    },
    "mnli_mismatch": {
      "acc,none": 0.4044955248169243,
      "acc_stderr,none": 0.004949946753591583,
      "alias": " - mnli_mismatch"
    },
    "mrpc": {
      "acc,none": 0.7794117647058824,
      "acc_stderr,none": 0.020553105287596057,
      "f1,none": 0.8534201954397395,
      "f1_stderr,none": 0.016157946331836814,
      "alias": " - mrpc"
    },
    "qnli": {
      "acc,none": 0.5678198791872597,
      "acc_stderr,none": 0.006702886134456929,
      "alias": " - qnli"
    },
    "qqp": {
      "acc,none": 0.8064803363838734,
      "acc_stderr,none": 0.0019647755361788884,
      "f1,none": 0.6912635151132507,
      "f1_stderr,none": 0.003676786809851292,
      "alias": " - qqp"
    },
    "rte": {
      "acc,none": 0.7472924187725631,
      "acc_stderr,none": 0.026157719758464693,
      "alias": " - rte"
    },
    "sst2": {
      "acc,none": 0.893348623853211,
      "acc_stderr,none": 0.010458867008246837,
      "alias": " - sst2"
    },
    "wnli": {
      "acc,none": 0.5352112676056338,
      "acc_stderr,none": 0.0596130578497224,
      "alias": " - wnli"
    }
  }
}

#

These are my tests, definitely better

misty igloo Feb 24, 2025, 3:22 PM

#

we are using FLA HF for NIAH etc. too so we really need that repo to work properly

brisk bronze Feb 24, 2025, 3:23 PM

#

gusty condor ``` { "model": "/home/zhangping/zrc/RWKV-x070-World-2.9B-v3-20250211-ctx4096",...

yeah yours does better compared to fla, looks like cola was the most different. Everything else is a percentage point or so
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results/fla-hub__rwkv7-2.9B-world/results_2025-02-24T02-53-42.327679.json

strange gazelle Feb 24, 2025, 3:23 PM

#

misty igloo we are using FLA HF for NIAH etc. too so we really need that repo to work proper...

What is NIAH?

misty igloo Feb 24, 2025, 3:23 PM

#

strange gazelle What is NIAH?

needle in a haystack

gusty condor Feb 24, 2025, 3:25 PM

#

brisk bronze yeah yours does better compared to fla, looks like cola was the most different. ...

and qnli, qqp too

#

Reference:
https://github.com/howard-hou/VisualRWKV/blob/main/VisualRWKV-v6/v6.0/eval/run_lm_eval.py
(change to RWKV_PAD = [0] to align with rwkv fla)

GitHub

VisualRWKV/VisualRWKV-v6/v6.0/eval/run_lm_eval.py at main · howard-...

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks. - howard-hou/VisualRWKV

misty igloo Feb 24, 2025, 3:33 PM

#

This thing of using RWKV_PAD before the text is pretty weird, and has been discussed as problematic before.
If we are going to do that for evals we should simply change the model to have it be in the starting state.
Where in the FLA HF code does it put [0] in the starting state?

gusty condor Feb 24, 2025, 3:43 PM

#

misty igloo This thing of using RWKV_PAD before the text is pretty weird, and has been discu...

bos_token = eos_token = pad_token = 0 = '<|rwkv_tokenizer_end_of_text|>'

#

then automatically handled by lm_eval

misty igloo Feb 24, 2025, 3:43 PM

#

yeah that's fine, as long as we tell that to lm-eval

#

doesn't it have a 'add_bos_token' option

#

https://github.com/EleutherAI/lm-evaluation-harness/blob/a9a0e3caaeecf3fb479c7c224fffd0af30a6ed96/lm_eval/models/huggingface.py#L81C9-L81C22

#

so we would need to set that when running like pretrained=MODEL,add_bos_token=True

#

I don't think it will add it automatically without that commandline setting

#

lets get the bugs fixed in the fla hf implementation so we can be using it for evals, and so that others who use it will not get bad results

brisk bronze Feb 24, 2025, 8:22 PM

#

misty igloo so we would need to set that when running like `pretrained=MODEL,add_bos_token=T...

setting add_bos_token=TRUE didn't change the score for lmbda.o for fla rwkv7 1.5B ftr

misty igloo Feb 24, 2025, 8:35 PM

#

True is not capitalized in python, I dunno if that matters

#

I mean the first letter is but not the rest

#

"True"

#

Also lambada is typically less sensitive to this somehow

#

I find the bos token impacts different evals differently with the average ending up not really different

#

The biggest issue is most likely the group norm bugfix

brisk bronze Feb 24, 2025, 9:26 PM

#

misty igloo The biggest issue is most likely the group norm bugfix

yeah, I don't think the True mattered

obsidian quest Feb 25, 2025, 12:34 AM

#

gusty condor GLUE looks like problematic

add a column: trained tokens 🙂

and show number of activated parameters (then rwkv params will be less)

#

and add https://huggingface.co/spaces/Jellyfish042/UncheatableEval

UncheatableEval - a Hugging Face Space by Jellyfish042

misty igloo Feb 25, 2025, 4:26 AM

#

obsidian quest and add https://huggingface.co/spaces/Jellyfish042/UncheatableEval

I think jellyfish already added this in section 6.2?

misty igloo Feb 25, 2025, 4:48 AM

#

misty igloo <@870137517020688415> what do you think? should we put this additional c paramet...

@obsidian quest to follow up on this, do you want to show the 'c' constant in the main RWKV-7 formulas? or just in the proofs

obsidian quest Feb 25, 2025, 4:58 AM

#

got paper link?

obsidian quest Feb 25, 2025, 5:01 AM

#

misty igloo <@870137517020688415> to follow up on this, do you want to show the 'c' constant...

actually i dont know what is c

misty igloo Feb 25, 2025, 5:04 AM

#

obsidian quest actually i dont know what is c

yeah, it's not in the code at all for rc4a but it's a generalization that maybe appeared in some earlier versions
that's why I wanted to ask your opinion

obsidian quest Feb 25, 2025, 5:04 AM

#

misty igloo yeah, it's not in the code at all for rc4a but it's a generalization that maybe ...

remove "c" then (and change figure 2 too)

misty igloo Feb 25, 2025, 5:05 AM

#

obsidian quest remove "c" then (and change figure 2 too)

I thought so too - @gusty condor can make the argument for it to you if he still thinks it's important

obsidian quest Feb 25, 2025, 5:06 AM

#

we should mention that expanding eigenvalue is useful for https://github.com/Jellyfish042/RWKV_Othello

and that is a slightly different formula

GitHub

GitHub - Jellyfish042/RWKV_Othello: A specialized RWKV-7 model for ...

A specialized RWKV-7 model for Othello(a.k.a. Reversi) that predicts legal moves, evaluates positions, and performs in-context search. Its performance scales with the number of test-time tokens. - ...

misty igloo Feb 25, 2025, 5:07 AM

#

I think that's what the 'c' variable is for - to allow expanded eigenvalue range

obsidian quest Feb 25, 2025, 5:08 AM

#

extending eigenvalue:```orig:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 )
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*a)

new:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*(a.float()*torch.exp(-torch.exp(w.float()))).to(dtype=torch.bfloat16))

or (try both)

a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kktorch.exp(-torch.exp(w.float())).to(dtype=torch.bfloat16), kka)

#

it's a bit different

misty igloo Feb 25, 2025, 5:10 AM

#

yeah, it wasn't my addition 🙂 I just don't like it when the formulas dont match the code because it confuses readers

gusty condor Feb 25, 2025, 5:36 AM

#

I believe semantically the constant c matters. It's just a compromise for training stability that we changed it to 1. Adding that c can better explain the motivation of RWKV-7.

gusty condor Feb 25, 2025, 5:54 AM

#

I can't reproduce Qwen's results on arc-c, winogrande and hellaswag on https://arxiv.org/pdf/2412.15115 .

obsidian quest Feb 25, 2025, 6:07 AM

#

gusty condor I believe semantically the constant c matters. It's just a compromise for traini...

but i am not using it

gusty condor Feb 25, 2025, 6:11 AM

#

You are using c=1

#

It's just a generalization

obsidian quest Feb 25, 2025, 6:27 AM

#

ok please simply remove c

#

because i dont think it is needed

obsidian quest Feb 25, 2025, 6:27 AM

#

obsidian quest extending eigenvalue:```orig: a = torch.sigmoid( self.time_aaaaa + (xa @ self.ti...

however this is useful

gusty condor Feb 25, 2025, 6:33 AM

#

obsidian quest because i dont think it is needed

The proof of RWKV-7's NC1 needs c>1

obsidian quest Feb 25, 2025, 6:34 AM

#

obsidian quest extending eigenvalue:```orig: a = torch.sigmoid( self.time_aaaaa + (xa @ self.ti...

you can use this @gusty condor

gusty condor Feb 25, 2025, 6:35 AM

#

OK, will add a subsection in the appendix to discuss that

misty igloo Feb 25, 2025, 7:04 AM

#

gusty condor The proof of RWKV-7's NC1 needs c>1

we could still keep the c variable in the proof section, and introduce it as an extension that we keep it as 1 in the main model - that was the compromise I struck in my earlier edit

misty igloo Feb 25, 2025, 7:23 AM

#

I added an initial draft of a background section just now.

gusty condor Feb 25, 2025, 5:09 PM

#

There are too much in "others." How much instruction and Chinese novels are there?

#

@obsidian quest could you elaborate for "others"?

obsidian quest Feb 25, 2025, 5:21 PM

#

gusty condor There are too much in "others." How much instruction and Chinese novels are ther...

world-3.0

science+wiki 222.7
math 32.3
law&gov 19.0
fiction 192.6
poetry+lyric 1.7
chat+qa+instruction 110.0
code 258.4
web 1945.2
total 3119.2

young sparrow Feb 25, 2025, 5:27 PM

#

Those numbers don't match the table currently, PSA

obsidian quest Feb 25, 2025, 5:50 PM

#

ok could someone please combine v2 + v2.1 + v3 items and arrange them to approximately match this list and i will fix on top of it because there are so many components

keen tartan Feb 25, 2025, 6:05 PM

#

obsidian quest ok could someone please combine v2 + v2.1 + v3 items and arrange them to approxi...

I could attempt doing it.
https://colab.research.google.com/drive/1Ic9RT-VzqEbdff350xPlXtJufBZJjHOK?usp=sharing
https://docs.google.com/spreadsheets/d/1HnwASXkgL6N3mLJQ5-8nkqJbs-yhJhKNFYJw6gpHoSs/edit

Google Colab

Google Docs

rwkv_world_datasets

misty igloo Feb 25, 2025, 6:59 PM

#

keen tartan I could attempt doing it. https://colab.research.google.com/drive/1Ic9RT-VzqEbdf...

If you need it the refined and checked v2.1 and v3 info is in the current paper draft (I went and got all the urls, cleaned up the names, etc.)

keen tartan Feb 25, 2025, 7:00 PM

#

misty igloo If you need it the refined and checked v2.1 and v3 info is in the current paper ...

Where is the current paper draft?

misty igloo Feb 25, 2025, 7:01 PM

#

https://www.overleaf.com/5753862368yvnbymysbrsf#07fba2

#

if you're going to put this together maybe a google sheet would be best

bronze frost Feb 25, 2025, 7:35 PM

#

I was looking at the Transition Matrix Stability Proof in the paper. Normally a contraction matrix is defined as having norm less than 1, not eigenvalues in (-1,1). It's misleading to say it's a contraction, since being a contraction would imply that the state cannot blow up. However, the state can blow up.

bronze frost Feb 25, 2025, 8:43 PM

#

Additionally, I wrote a ~100 line standalone implementation of RWKV-7 inference in numpy (to avoid hiding things in torch functions). It's verified numerically against the pip rwkv package.

📎 minimal_rwkv7.py

gusty condor Feb 26, 2025, 3:59 AM

#

bronze frost I was looking at the Transition Matrix Stability Proof in the paper. Normally a ...

Sorry, I was mistaken.
Actually, because it's similar to a symmetric matrix, if we fix a, it is indeed a contraction, since spectral norm is equal to the eigenvalue with largest absolute value for symmetric matrices.
It's just because that a isn't really fixed, it is a_t. Since I removed the subscripts in the problem statement, I just forgot that dynamic dependence😭

misty igloo Feb 26, 2025, 7:43 AM

#

I'm not sure who is adding the chat examples, but we should discuss this before you do. Using the "base Gradio 7B model" (whatever that is) is not appropriate for a paper that does not include any 7B model.

#

I'm also not sure we want to show chat examples in this paper.

gusty condor Feb 26, 2025, 7:51 AM

#

I don't want that.

#

We should include more technical stuff.

misty igloo Feb 26, 2025, 7:51 AM

#

Agreed. I think we are playing with the big boys now and are way past needing to show that it can talk nicely.

#

I added a draft appendix section on design decisions, walking people through how it all works and why.

gusty condor Feb 26, 2025, 7:59 AM

#

misty igloo Agreed. I think we are playing with the big boys now and are way past needing to...

Now it can even rival Qwen-2.5 as a base model (not a chat model)

young sparrow Feb 26, 2025, 1:05 PM

#

@misty igloo @gusty condor I was going to read through the paper and do a suggestions / editing pass today. Is there anything in particular you'd like me to focus on?

obsidian quest Feb 26, 2025, 1:28 PM

#

@keen tartan

📎 rwkv_world_datasets.xlsx

obsidian quest Feb 26, 2025, 2:30 PM

#

pls add lora dimensions suggestions as in RWKV-LM

#

more suggestions: wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18

obsidian quest Feb 26, 2025, 3:17 PM

#

add RWKV-4 1.5b to Compression rate% eval

obsidian quest Feb 26, 2025, 3:37 PM

#

v7 0.1/1.5/2.9/0.4 loss curves

#


0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-4

1.5 - bsz 480 lr_init 4e-4 => bsz 672 lr_init 4.5e-4 => bsz 1152 lr_init 6.1e-4

2.9 - bsz 640 lr_init 4e-4 => bsz 1008 lr_init 5e-4 => bsz 1120 lr_init 5.4e-4 => bsz 2016 lr_init 8e-4

all - wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18 // lr_final 1e-5

iron parrot Feb 26, 2025, 3:53 PM

#

obsidian quest add RWKV-4 1.5b to Compression rate% eval

done

obsidian quest Feb 26, 2025, 4:08 PM

#

let's try v7 for sudoku too @iron parrot

obsidian quest Feb 26, 2025, 4:17 PM

#

obsidian quest pls add lora dimensions suggestions as in RWKV-LM

gusty condor Feb 26, 2025, 4:56 PM

#

obsidian quest v7 0.1/1.5/2.9/0.4 loss curves

Can you add final loss too? And what exactly is the learning rate curve?

#

TODO:
5. Add limitations and acknowledgements

obsidian quest Feb 26, 2025, 5:08 PM

#

gusty condor Can you add final loss too? And what exactly is the learning rate curve?

green = LR curve

gusty condor Feb 26, 2025, 5:10 PM

#

I mean, is there a formula for it?

obsidian quest Feb 26, 2025, 5:10 PM

#

obsidian quest ```0.1 - bsz 240 lr_init 6e-4 0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-...

this. cosine decay. i change LR & BSZ when number of compute nodes changes.

neon tree Feb 26, 2025, 5:58 PM

#

how could contribute to the manuscript (from fla groups berk

obsidian quest Feb 26, 2025, 6:02 PM

#

neon tree how could contribute to the manuscript (from fla groups <:berk:75011147648375...

https://www.overleaf.com/project/66f18b0e9f309a35970802df

young sparrow Feb 26, 2025, 6:02 PM

#

@obsidian quest It looks like I don't have permissions to view the overleaf

#

Can you add me?

obsidian quest Feb 26, 2025, 6:03 PM

#

i am not its owner 😂

young sparrow Feb 26, 2025, 6:05 PM

#

Who is?

misty igloo Feb 26, 2025, 6:07 PM

#

young sparrow Who is?

https://www.overleaf.com/5753862368yvnbymysbrsf#07fba2

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

#

@tropic minnow is the owner, but that URL will allow you access

keen tartan Feb 26, 2025, 6:18 PM

#

Does Overleaf have a dark/night mode? It is so bright! o.O

keen tartan Feb 26, 2025, 6:37 PM

#

Updated the world datasets itemized lists:
https://colab.research.google.com/drive/1Ic9RT-VzqEbdff350xPlXtJufBZJjHOK#scrollTo=udl8RkeeM-yE
https://docs.google.com/spreadsheets/d/1HnwASXkgL6N3mLJQ5-8nkqJbs-yhJhKNFYJw6gpHoSs/edit?gid=1049532087#gid=1049532087
Was able to figure out citations for most of them.

Google Colab

Google Docs

rwkv_world_datasets

#

I noticed that the DeepMind Mathematics (dm_math) dataset is part of The Pile 1 and was already included in world-v2 I assume, but it seems to be not mentioned in the Eagle & Finch paper. Where should it be placed?

obsidian quest Feb 26, 2025, 6:48 PM

#

keen tartan I noticed that the `DeepMind Mathematics` (dm_math) dataset is part of `The Pile...

mention we added it but missed mentioning

keen tartan Feb 26, 2025, 6:49 PM

#

obsidian quest mention we added it but missed mentioning

I see. I try to incorperate it somehow.

#

Should it perhaps be added as an Errata to the Eagle & Finch paper too?

obsidian quest Feb 26, 2025, 7:41 PM

#

v7 0.4b = v5 0.4b + subsampled 2T tokens from world-3```

misty igloo Feb 26, 2025, 8:26 PM

#

young sparrow <@1007072846960410685> <@803473343705514025> I was going to read through the pap...

That would be great! I don't have a particular area in mind - I did a lot of the abstract/intro/background/description writing in just the past few days, so they are all essentially early drafts. I'm very open to any kind of perspective you can lend on general flow, narrative, and what should be emphasized.

#

Also, a lot of the evals are still preliminary or missing. We are working on some discrepancy issues we've found to ensure everything is really solid.

spiral minnow Feb 26, 2025, 9:26 PM

#

misty igloo That would be great! I don't have a particular area in mind - I did a lot of the...

Do you need any additional hands on paper writing/editing?

fringe egret Feb 27, 2025, 1:56 AM

#

Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is doing it, may I take on this part of the test?

unborn lintel Feb 27, 2025, 2:09 AM

#

Would it make sense to mention the D512 and D576 variants of the 0.1B model?

misty igloo Feb 27, 2025, 3:31 AM

#

fringe egret Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is d...

I'm not sure we need this specific benchmark for the paper, but you're welcome to provide it if you like. It will likely end up in the Appendix if so. If you do decide to add it, you will need updated benchmark results for modern models such as Qwen2.5 3B and Llama3.2 3B and 1.5B, as well as any other top tier models in those sizes.

#

It is mostly there simply because I copied over the results from benchmarks in the Eagle/Finch paper.
Also, if you decide to run it you might consider using our extended context finetunes of 1.5B and 2.9B.

misty igloo Feb 27, 2025, 3:34 AM

#

unborn lintel Would it make sense to mention the D512 and D576 variants of the 0.1B model?

I think this complicates the paper unnecessarily. Did Blink ever even release these?

#

I recall that they are narrower but deeper.

gusty condor Feb 27, 2025, 3:36 AM

#

fringe egret Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is d...

I think it's @brisk bronze who is doing these benchmarks

gusty condor Feb 27, 2025, 3:37 AM

#

misty igloo I think this complicates the paper unnecessarily. Did Blink ever even release th...

Yes, released. See https://huggingface.co/BlinkDL/temp-latest-training-models/tree/main

misty igloo Feb 27, 2025, 3:38 AM

#

gusty condor Yes, released. See https://huggingface.co/BlinkDL/temp-latest-training-models/tr...

which files? I don't see

#

What's your opinion on listing them? I think it may complicate the paper unnecessarily

#

We could add a section on depth versus width ablations, but I'm also not sure that this is really a RWKV specific result

gusty condor Feb 27, 2025, 3:43 AM

#

Oh here it is! https://huggingface.co/BlinkDL/rwkv-7-pile/tree/main

gusty condor Feb 27, 2025, 3:44 AM

#

misty igloo We could add a section on depth versus width ablations, but I'm also not sure th...

Is there previous research about this?

brisk bronze Feb 27, 2025, 3:44 AM

#

gusty condor I think it's <@533592838529744917> who is doing these benchmarks

I wasn't running bamboo.. not sure who is doing it

nova frost Feb 27, 2025, 3:45 AM

#

I think bamboo will be pretty low signal for base models

fringe egret Feb 27, 2025, 3:45 AM

#

misty igloo I'm not sure we need this specific benchmark for the paper, but you're welcome t...

Thank you. I'll first see how the existing methods work.

nova frost Feb 27, 2025, 3:46 AM

#

They mostly have instruct benchmarks in their paper and the tasks are structured in a way that base models will do poorly

misty igloo Feb 27, 2025, 3:46 AM

#

gusty condor Is there previous research about this?

the problem is that it's very niche - we don't have comparison models of other architectures with these changes
so we could show it in its own separate ablations section I suppose, but it wont be relative to other architectures

misty igloo Feb 27, 2025, 3:47 AM

#

fringe egret Thank you. I'll first see how the existing methods work.

sounds like Baber thinks this specific benchmark won't be valuable on base models, so let's skip it (he's in a good position to know, since he works on lm eval harness!)

gusty condor Feb 27, 2025, 3:47 AM

#

Skip it, ok

#

I tested gsm8k and found that it's very sensitive on response format, so I decide to skip it

nova frost Feb 27, 2025, 3:48 AM

#

fringe egret Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is d...

It’s quite apparent here

#

But maybe the instruct models have context extension idk

misty igloo Feb 27, 2025, 3:49 AM

#

Yeah we also have a context extended versions that we just trained, but we already will show NIAH for that

#

@brisk bronze is doing those, with her fork of what was originally jellyfish's revision of the mamba test 🤣

nova frost Feb 27, 2025, 3:52 AM

#

Yeah. I think niah single needle and maybe one other. Multi key/query depending on the framing

misty igloo Feb 27, 2025, 3:53 AM

#

single needle improves quite a lot with the extension

#

like 32k->48k @ 3B scale

obsidian quest Feb 27, 2025, 4:38 AM

#

misty igloo the problem is that it's very niche - we don't have comparison models of other a...

it's relevant for other models too because smollm is deep+narrow as well

misty igloo Feb 27, 2025, 4:40 AM

#

obsidian quest it's relevant for other models too because smollm is deep+narrow as well

smollm 1.7B is 2048 x 24

#

smollm 135M is 576 x 30

obsidian quest Feb 27, 2025, 4:41 AM

#

https://huggingface.co/HuggingFaceTB/SmolLM2-135M/blob/main/config.json it's L30 D576

misty igloo Feb 27, 2025, 4:42 AM

#

yeah sorry typo

#

there's no really good comparison point because we don't have a 'normal' depth smollm2

#

but we can show them all side by side

#

just cant really draw much of a conclusion

#

also SmolLM is not trained on pile

#

@obsidian quest what kind of comparison with SmolLM were you thinking we would show?

obsidian quest Feb 27, 2025, 4:46 AM

#

#

so this choice is from MobileLLM

misty igloo Feb 27, 2025, 4:47 AM

#

gotcha

#

that used cross-layer weight sharing to reduce device RAM usage iiuc

gusty condor Feb 27, 2025, 8:25 AM

#

State visualization for v6. Working on v5 and v7

gusty condor Feb 27, 2025, 9:10 AM

#

SR: stable rank, (Frobenius norm / spectral norm) ^ 2

#

V7

unborn lintel Feb 27, 2025, 12:05 PM

#

Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, normalized the same way as the cryscan webgpu state visualization demo

unborn lintel Feb 27, 2025, 12:06 PM

#

gusty condor V7

Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, colored the same way as the state visuals above

gusty condor Feb 27, 2025, 12:27 PM

#

unborn lintel Visualizations of S @ ones(64,1) for each head, arranged per layer, for some tex...

No, this is too large for the paper

obsidian quest Feb 28, 2025, 7:41 AM

#

Interestingly, the stable rank of the WKV matrix in RWKV-7 has shown to be lower than that
of RWKV-5 and RWKV-6.

this is strange. if you check state visualization, rwkv7 states look much more "random" while rwkv6 states are more like checkboards (rank 1)

gusty condor Feb 28, 2025, 8:53 AM

#

You can rerun those experiments. Actually in some layers of RWKV-7, the state is very concentrated.

lethal oyster Feb 28, 2025, 9:27 AM

#

obsidian quest >> Interestingly, the stable rank of the WKV matrix in RWKV-7 has shown to be lo...

Have you started using Muon optimizer or some other related new optimizer?
This may be relevant:
https://docs.modula.systems/examples/weight-erasure/
https://x.com/jxbz/status/1845146681274478856
https://x.com/ssnl_tz/status/1845179813755224406

obsidian quest Feb 28, 2025, 10:17 AM

#

gusty condor You can rerun those experiments. Actually in some layers of RWKV-7, the state is...

got code?

#

https://cryscan.github.io/web-rwkv-puzzles/#/replay

RWKV Web

An ultra-fast and efficient AI runs directly in your browser.

gusty condor Feb 28, 2025, 10:26 AM

#

lethal oyster Have you started using Muon optimizer or some other related new optimizer? This ...

We tested Muon, but Muon may not be efficient for RWKV's LoRA gates.

gusty condor Feb 28, 2025, 10:31 AM

#

obsidian quest got code?

DMed you

gusty condor Feb 28, 2025, 1:59 PM

#

@misty igloo I got the formula for parameters correct:
$$ #(\mathrm{Params}) = 2DV + 4D + LD \left(12D + 2\left(d_w + d_a +d_v +d_g \right) + 19 \right) - (2Dd_v + D) $$

silent urchinBOT Feb 28, 2025, 1:59 PM

#

Zhang Ruichong

gusty condor Feb 28, 2025, 2:59 PM

#

Please double-check Appendix E. I'm finishing in a few hours!

obsidian quest Feb 28, 2025, 4:38 PM

#

obsidian quest >> Interestingly, the stable rank of the WKV matrix in RWKV-7 has shown to be lo...

please check this @uneven blade 🙂

misty igloo Feb 28, 2025, 4:53 PM

#

gusty condor <@1007072846960410685> I got the formula for parameters correct: $$ \#(\mathrm{P...

looks correct, I had missed v0 earlier - thanks for updating that

#

@obsidian quest did you really use adam_eps=1e-18 for all of the the entire runs?

obsidian quest Feb 28, 2025, 4:55 PM

#

misty igloo <@870137517020688415> did you really use adam_eps=1e-18 for all of the the entir...

yes

#

ok probably sometimes it NaN in 1 step because of this 😂 maybe 1e-16 will avoid this

misty igloo Feb 28, 2025, 4:56 PM

#

obsidian quest ok probably sometimes it NaN in 1 step because of this 😂 maybe 1e-16 will avoid...

lol - did that happen? if so how did you fix it?

obsidian quest Feb 28, 2025, 4:57 PM

#

i just rewind a bit with cleared optimizer states

misty igloo Feb 28, 2025, 4:57 PM

#

obsidian quest i just rewind a bit with cleared optimizer states

also, what are those learning rates shown that deviate from the schedule? and the schedule doesn't look like cosine, what was it?

obsidian quest Feb 28, 2025, 4:57 PM

#

obsidian quest ```0.1 - bsz 240 lr_init 6e-4 0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-...

this. i change bsz because of hardware constraint (number of nodes)

misty igloo Feb 28, 2025, 5:00 PM

#

obsidian quest this. i change bsz because of hardware constraint (number of nodes)

but.. it still doesn't look like a cosine 🙂

obsidian quest Feb 28, 2025, 5:00 PM

#

different cosines patched together

misty igloo Feb 28, 2025, 5:00 PM

#

it looks like a time stretched cosine

#

do you have a formula?

#

visually it looks something like cos(t**2)

obsidian quest Feb 28, 2025, 5:04 PM

#

=(1e-4)*(0.01+0.495*(1+COS(x*PI))) is this cosine decay

#

oh it's because i am using log axis for y

misty igloo Feb 28, 2025, 5:07 PM

#

AH.. ok that makes sense now

#

sorry didnt notice that

gusty condor Feb 28, 2025, 5:09 PM

#

misty igloo <@870137517020688415> did you really use adam_eps=1e-18 for all of the the entir...

I added a reference for that

misty igloo Feb 28, 2025, 5:13 PM

#

@gusty condor I think we should mention that we increase the number of compute nodes as training progresses

#

@obsidian quest how many nodes and what kind of GPU was used total?

gusty condor Feb 28, 2025, 5:34 PM

#

misty igloo <@803473343705514025> I think we should mention that we increase the number of c...

Yes, let's make it an advantage

This approach not only enhances training efficiency but also utilizes GPU resources economically. After smaller models complete their training, additional GPU resources become available for the later stages of training larger models. This cascading resource allocation ensures that computational power is dynamically reallocated, maximizing hardware utilization and reducing idle time.

misty igloo Feb 28, 2025, 5:35 PM

#

Great work! This section is really looking good.
Should we provide the FLOPs counts? I know it has been useful for people in the past, including Quentin

#

And it can be helpful if we want to put in a table comparing total trained FLOPs vs quality, like we had in the Eagle/Finch paper

#

I think that will most clearly show the pareto improvement of RWKV7 over these other heavily trained models

#

We can make it short and simple instead of the longwinded version that we had earlier.

gusty condor Feb 28, 2025, 5:49 PM

#

misty igloo Great work! This section is really looking good. Should we provide the FLOPs cou...

I think the simple formula 6 * model size * training tokens suffices

#

@misty igloo I think there is a paper named "regular languages in nc1" and you can cite that. (assuming that Wu Tianyi's proof is good)

misty igloo Feb 28, 2025, 5:54 PM

#

gusty condor <@1007072846960410685> I think there is a paper named "regular languages in nc1...

@bronze frost and I have been discussing the proof at length

#

we think it needs some revision, but there may be something we can claim that exceeds the abilities of transformers

#

it also may be able to be simplified quite a bit

gusty condor Feb 28, 2025, 5:57 PM

#

Yes I agree

misty igloo Feb 28, 2025, 5:58 PM

#

for example, @bronze frost has a very simple construction for showing that you can create true transpositions (row-pair permutation matrices) with RWKV-7

gusty condor Feb 28, 2025, 5:58 PM

#

I think NC1 can be achieved by just householder matrices (I'm not an expert in complexity theory)

misty igloo Feb 28, 2025, 5:59 PM

#

unfortunately, we think true full permuation matrix requires multiple tokens

gusty condor Feb 28, 2025, 5:59 PM

#

I agree with that

misty igloo Feb 28, 2025, 6:00 PM

#

but within a single token, having a two-row permutation should exceed transformers abilities

#

so for example, we should be able to solve swaps on S5 using only incoming (prefill) tokens, which afaict transformers cannot do

#

supposedly this makes it so that we can correctly claim being NC1 complete under reduction by AC0

#

I'm not even a novice in this stuff, let alone an expert tho 🙂

gusty condor Feb 28, 2025, 6:03 PM

#

Yes, @iron parrot tested on the parity experiment, RWKV-7 can grok while transformers can't

misty igloo Feb 28, 2025, 6:04 PM

#

@young sparrow do you have a complexity theorist you could recommend to help us with this aspect of the paper? I'm muddling through but we really need someone who can easily cut through it all and validate our claims and (not yet rewritten) proofs

misty igloo Feb 28, 2025, 6:08 PM

#

gusty condor Yes, <@701460149134688386> tested on the parity experiment, RWKV-7 can grok whil...

btw we found that with a slightly larger allowable range on decay, the two-row swap permutation matrix would be possible to achieve even with c=1

#

was jellyfish's parity experiment done with c=2?

gusty condor Feb 28, 2025, 6:08 PM

#

Yes, c=2

misty igloo Feb 28, 2025, 6:09 PM

#

(due to normalization, you can use decay instead of c to achieve similar things)

young sparrow Feb 28, 2025, 6:10 PM

#

@misty igloo Will Merrill is the expert on this topic. Let me ping him and set up an introduction.

misty igloo Feb 28, 2025, 6:11 PM

#

young sparrow <@1007072846960410685> Will Merrill is the expert on this topic. Let me ping him...

thanks, that would be great!

#

Yes I've been looking through his papers lately

#

I'm a little worried that I'm at too low a level of comprehension of this stuff to be the right one to discuss with him, but I do have a well informed broad view of what we're trying to achieve somewhat generally, and the mechanisms involved

gusty condor Feb 28, 2025, 6:24 PM

#

COLM 2025

OpenReview submission site opens: February 27, 2025
Abstract deadline: March 20, 2025
Full paper submission deadline: March 27, 2025
Rebuttal period: May 27 to June 10, 2025
Decision notifications: July 7, 2025
Conference dates: October 7-10, 2025

misty igloo Feb 28, 2025, 6:44 PM

#

nor, are you somewhat well versed in complexity enough to help us with the paper? it sounds like you might be

#

If so, let's chat

sinful breach Feb 28, 2025, 6:58 PM

#

misty igloo nor, are you somewhat well versed in complexity enough to help us with the paper...

Have you asked Riccardo by any chance? I think very few people might be versed in this

misty igloo Feb 28, 2025, 7:56 PM

#

sinful breach Have you asked Riccardo by any chance? I think very few people might be versed i...

I don't know him, but he's in the FLA discord right?

sinful breach Feb 28, 2025, 7:59 PM

#

yes

obsidian quest Mar 1, 2025, 12:03 AM

#

obsidian quest ```0.1 - bsz 240 lr_init 6e-4 0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-...

final loss and # nodes

misty igloo Mar 1, 2025, 1:36 AM

#

obsidian quest final loss and # nodes

what GPU? H800?

steady ether Mar 1, 2025, 2:35 AM

#

I found this set of synthetic tasks which seems relevant (https://arxiv.org/pdf/2403.17844). I ran a few of them and v7 is performing quite well. Here's an early plot (only ~10% complete but looking promising)

Also, someone should try the scaling experiments too but that looks like it will cost $$$$

steady ether Mar 1, 2025, 2:57 AM

#

Not sure about settings if anyone can check:

https://github.com/guangyusong/mad-lab/commit/7daaf1f0b143ea21a07f7aa042d7736d114459b1

obsidian quest Mar 1, 2025, 3:02 AM

#

misty igloo what GPU? H800?

yes

misty cedar Mar 1, 2025, 3:36 AM

#

steady ether I found this set of synthetic tasks which seems relevant (https://arxiv.org/pdf/...

That compression accuracy lmao

#

Need to tag ffmpeg guy

steady ether Mar 1, 2025, 3:43 AM

#

It's only like 10% done so we haven't gotten to the hard stuff yet

misty cedar Mar 1, 2025, 3:48 AM

#

Compression here being a repeat after me?

steady ether Mar 1, 2025, 4:01 AM

#

It's encoding a sequence into a token and then decoding it

gusty condor Mar 1, 2025, 4:54 AM

#

steady ether It's encoding a sequence into a token and then decoding it

You can use transferable LR (like what I did) to save time.

misty cedar Mar 1, 2025, 5:58 AM

#

steady ether It's encoding a sequence into a token and then decoding it

Like state tune overfitting but for a single embedding?

gusty condor Mar 1, 2025, 7:05 AM

#

steady ether It's encoding a sequence into a token and then decoding it

Sentence Autoencoder?

misty igloo Mar 1, 2025, 7:39 AM

#

spiral minnow Do you need any additional hands on paper writing/editing?

Thanks! I think we're good for right now since things are probably going to move around a bunch still in order to cram everything into the usual 9 page limit. But once that happens we might need some help massaging it all together so that it flows well. I'll reach out if and when we do!

#

And feedback is always welcome!

#

The writing process has been fairly organized so far, which has been great. I'd like to keep it that way and have editing proceed in an organized manner, with people mostly just adding sections, or working together directly on a section. We may need a wider edit beyond that and what I can provide at some point in the near future, I just want to avoid the 'the whole paper gets rewritten every day' thing that happened towards the end last year.

steady ether Mar 1, 2025, 8:18 AM

#

misty cedar Like state tune overfitting but for a single embedding?

That's a good way to put it.

steady ether Mar 1, 2025, 8:20 AM

#

gusty condor Sentence Autoencoder?

Right, but with random tokens

gusty condor Mar 1, 2025, 8:32 AM

#

I think the arxiv v1 version can be uploaded in a few hours.

unborn lintel Mar 1, 2025, 12:00 PM

#

misty igloo And feedback is always welcome!

imo figures 3, 8, 9, 13, and 16 should be remade with the same theme and bigger font for clarity and cohesion at some point

gusty condor Mar 1, 2025, 12:05 PM

#

unborn lintel imo figures 3, 8, 9, 13, and 16 should be remade with the same theme and bigger ...

They are from 5 different people

unborn lintel Mar 1, 2025, 12:50 PM

#

If they could share the data, I or someone else can remake the graphs using the same theme

#

currently, it seems like there is a mix of excel and python-generated plots...

gusty condor Mar 1, 2025, 1:06 PM

#

unborn lintel currently, it seems like there is a mix of excel and python-generated plots...

Figure 16? all right...

steady ether Mar 1, 2025, 8:02 PM

#

steady ether I found this set of synthetic tasks which seems relevant (https://arxiv.org/pdf/...

MAD tasks for v7 finished, this looks more reasonable now. Still pretty good.

misty igloo Mar 1, 2025, 8:43 PM

#

gusty condor I think the arxiv v1 version can be uploaded in a few hours.

Why are you in such a rush to upload it today? I don't think it's ready.

#

And I'm not comfortable yet with the exact claims we can make for the complexity class, yet that should be something we claim in the abstract.

#

I'm also not sure we have properly shown SOTA that we claim. I am working on a FLOPS chart, which will likely show that we have a new pareto frontier here, which would be a desirable claim.

#

Some other notes: I think you need to show Mamba-2 Pile in section K. It's not fair to compare it to the older model only.

bronze frost Mar 1, 2025, 9:29 PM

#

I also don't think the paper will realistically be ready for arxiv today, but we should get it ready as soon as possible

#

We need everyone to fill in the author contributions section, it's barely started.

misty igloo Mar 1, 2025, 9:45 PM

#

@here Yes, in the spirit of getting ready as soon as possible: If you made significant contributions to the paper and want to be listed as an author, please list your name and affiliations at the top in the authors section and begin putting in the details of your contributions into Appendix A: Author Contributions. I don't think we currently left anyone out of the list of authors, but definitely let me know if we did. Please also let me know your email address.

obsidian quest Mar 2, 2025, 2:54 AM

#

yeah need a few more days

obsidian quest Mar 2, 2025, 2:55 AM

#

steady ether MAD tasks for v7 finished, this looks more reasonable now. Still pretty good.

i think our avg will be the best 🙂

steady ether Mar 2, 2025, 2:58 AM

#

obsidian quest i think our avg will be the best 🙂

It is!

fresh mulch Mar 2, 2025, 3:23 AM

#

I have an AudioRWKV experiment in the oven but I doubt it'll be ready for v1, still trying to figure out Goose state tuning. In the meantime, it looks like sec 2 is still a draft, so I might work on that - do I need to ask to make changes?

misty igloo Mar 2, 2025, 3:40 AM

#

fresh mulch I have an AudioRWKV experiment in the oven but I doubt it'll be ready for v1, st...

I was about to go through the whole thing and edit and move stuff as needed, DM me and we can figure out how to collaborate on it!

obsidian quest Mar 2, 2025, 5:16 AM

#

let's call it a generalized FWP (fast weight programmer) RNN to respect Schmidhuber 😂

#

RWKV-7 is a generalized version because i am using deformed keys etc.

gusty condor Mar 2, 2025, 6:25 AM

#

@brisk bronze please use lm-eval 0.4.3 and fp32 to evaluate mamba2

obsidian quest Mar 2, 2025, 6:45 AM

#

steady ether It is!

lets sort rows by avg

quaint quiver Mar 2, 2025, 7:22 AM

#

steady ether It is!

Would recommend adding gated deltanet here to show the advantage of the vector lr and decay

steady ether Mar 2, 2025, 7:26 AM

#

obsidian quest lets sort rows by avg

Done. Will also experiment with inits, this one was just a 'naive' run so we have room for improvement.

steady ether Mar 2, 2025, 7:28 AM

#

quaint quiver Would recommend adding gated deltanet here to show the advantage of the vector l...

Good point. Let's see if they have results we can borrow. All depends on how much time we have

gusty condor Mar 2, 2025, 8:50 AM

#

@brisk bronze You didn't test https://huggingface.co/state-spaces/mamba2-370m

state-spaces/mamba2-370m · Hugging Face

obsidian quest Mar 2, 2025, 9:36 AM

#

steady ether Done. Will also experiment with inits, this one was just a 'naive' run so we hav...

yeah pls use my init (such as https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py line 847-850, 966-967 and you should see much faster convergence)
compress score looks wrong

#

or share your code so i can check

gusty condor Mar 2, 2025, 12:05 PM

#

I think it's more like wrong LR / Adam epsilon

gusty condor Mar 2, 2025, 12:30 PM

#

@obsidian quest Which ZeRO stage is RWKV-7 trained on?
Is RWKV trained without pipeline parallelism?

obsidian quest Mar 2, 2025, 12:58 PM

#

zero2

misty igloo Mar 2, 2025, 7:02 PM

#

quaint quiver Would recommend adding gated deltanet here to show the advantage of the vector l...

@fresh mulch and I are currently doing ablations against all changes from gated deltanet
they just arent in the manuscript yet
these are the differences we're ablating - please let us know if there are others you think are important to show:

making the gating (decay) w vector-valued instead of scalar
making the removal kk and replacement k keys different from one another
making the in-context learning rate a a vector instead of scalar
adding bonus (last part of code)

misty igloo Mar 2, 2025, 7:03 PM

#

obsidian quest yeah pls use my init (such as https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-...

@fresh mulch I know we discussed this but plz make sure you're following blinks recommendation here

quaint quiver Mar 2, 2025, 7:13 PM

#

misty igloo <@331583891972423690> and I are currently doing ablations against all changes fr...

Oh nice

#

Would say the parametrization of the decay

#

Gated deltanet uses the mamba way iirc so can compare with the rwkv with bias

misty igloo Mar 2, 2025, 7:50 PM

#

quaint quiver Would say the parametrization of the decay

Do you mean something different than reducing the decay to scalar per head? That's what I meant we are ablating in the first bullet point

quaint quiver Mar 2, 2025, 7:51 PM

#

Ya like the calculation of the decay, mamba uses a specific init and multiplication style which gated deltanet use (songlin mentioned this was pretty important)

fresh mulch Mar 2, 2025, 9:36 PM

#

@quaint quiver are you referring to training a gated deltanet for table 7, or using the gated deltanet init for our rwkv7 ablations, or something else?

quaint quiver Mar 2, 2025, 10:32 PM

#

Mainly for table 7 as apparently it was important although could also be done as an ablation

brisk bronze Mar 2, 2025, 11:46 PM

#

gusty condor <@533592838529744917> You didn't test https://huggingface.co/state-spaces/mamba2...

added!

obsidian quest Mar 3, 2025, 4:43 AM

#

more models in https://huggingface.co/spaces/Jellyfish042/UncheatableEval

UncheatableEval - a Hugging Face Space by Jellyfish042

steady ether Mar 3, 2025, 4:45 AM

#

obsidian quest or share your code so i can check

Changing inits did improve to 46. Here's the code:

https://github.com/guangyusong/mad-lab/commit/e23bc62e55645b7f29c8504ac3c2353fb7ddbf1a#diff-ef694b011a1266c4276736e46666ecf46a34718f2da7f2e5713f9a09d2bc400e

obsidian quest Mar 3, 2025, 4:52 AM

#

steady ether Changing inits did improve to 46. Here's the code: https://github.com/guangyuso...

got loss curve comparison? 🙂

steady ether Mar 3, 2025, 4:54 AM

#

obsidian quest got loss curve comparison? 🙂

https://wandb.ai/gpt6/MAD-RWKV7-3/workspace

obsidian quest Mar 3, 2025, 4:58 AM

#

could you explain this 🙂

fresh mulch Mar 3, 2025, 5:08 AM

#

@fringe egret are you waiting on results for reportsumsort/showssort bamboo benchmarks or did every model legitimately just score a 0 on them?

obsidian quest Mar 3, 2025, 5:13 AM

#

pls mention contents in this https://www.rwkv.com/images/RWKV-7.png
and this https://x.com/BlinkDL_AI/status/1861796264620572859/photo/1

BlinkDL (@BlinkDL_AI) on X

More explanation of how RWKV-7 works🙂

#

feel free to change my text


Giving two sequences of vectors $\{k_t\}$ and $\{v_t\}$, RWKV-7 will test-time-train an internal model $v \approx k S^\top$ via in-context gradient descent w.r.t the L2 loss $\mathcal{L} = \frac{1}{2}\Vert\, v - k S^\top\Vert^2$.

The gradient is:
\[\frac{\partial \mathcal{L}}{\partial S} = S k^\top k - v^\top k\]

The gradient descent formula (with dynamic weight decay $w_t$ and learning rate $\eta_t$) is:
\[S_t = S_{t-1} \operatorname{diag}(w_t) - (S_{t-1} k_t^\top k_t - v_t^\top k_t)\operatorname{diag}(\eta_t)\]
which equals:
\[S_t = S_{t-1} \left(\operatorname{diag}(w_t) - k_t^\top k_t\operatorname{diag}(\eta_t)\right) + v_t^\top k_t\operatorname{diag}(\eta_t)\]

In RWKV-7 I use the generalized formula:
\[S_{t} = S_{t-1} (\operatorname{diag}(w_t) + \textbf{a}_t^\top \textbf{b}_t) + \textbf{v}_t^\top \textbf{k}_t\]
where a reasonable choice of initial values is $\textbf{a} = -k$, $\textbf{b} = k\cdot\eta$, $\textbf{v} = v$, $\textbf{k} = k \cdot \eta$.

(update: basically diagonal + rank1 because it's good for parallelization. we can do rankn by adding more terms but it will be slower)

\textbf{RWKV-7 uses $\{k_t, v_t\}$ to test-time-train an internal model and uses $\{r_t\}$ as input for this model.} It overcomes the $\mathsf{TC^0}$ limitation of QKV-softmax-attention transformers (and RWKV-6, Mamba, Mamba-2, xLSTM, GLA, ...), while still being efficiently trainable on GPUs.

Such ideas can be traced back to fast weights (1991) by Jürgen Schmidhuber, delta rule (1959) by Bernard Widrow, hebbian learning (1949) by Donald Hebb. RWKV-7 is a generalized scalable version with more tricks to make it actually great at LLM. Details are in my open-source code.

#

Because the internal model is $v \approx k S^\top$, the output for input $r$ is $r S^\top$, and the pseudocode is:
\begin{lstlisting}
    for t in range(T):
        sab = torch.einsum("ik,k,j->ij", state, a[t], b[t])
        state = state * w[t] + sab + torch.einsum("j,i->ij", k[t], v[t])
        out[t] = torch.einsum("j,ij->i", r[t], state)
\end{lstlisting}
\vspace{-8pt}```

fresh mulch Mar 3, 2025, 5:17 AM

#

this should go into section 3

obsidian quest Mar 3, 2025, 5:22 AM

#

use log scale Y-axis for (a) RMS of RWKV state entries @gusty condor

misty igloo Mar 3, 2025, 5:25 AM

#

fresh mulch <@1214039733773148213> are you waiting on results for reportsumsort/showssort ba...

@fringe egret we were very clear that we don't want bamboo featured in the paper - I'm sorry if that was somehow not communicated properly, but we had a whole public discussion of it here in this channel after you asked andl before you added it and I removed the old one from the paper already

obsidian quest Mar 3, 2025, 5:25 AM

#

Table 3 4 9 17 19, put stronger models on top

obsidian quest Mar 3, 2025, 5:26 AM

#

misty igloo <@1214039733773148213> we were very clear that we don't want bamboo featured in ...

oh why

misty igloo Mar 3, 2025, 5:26 AM

#

obsidian quest oh why

Baber thinks it's not well suited for base models

misty igloo Mar 3, 2025, 5:28 AM

#

nova frost They mostly have instruct benchmarks in their paper and the tasks are structured...

From this discussion

obsidian quest Mar 3, 2025, 5:31 AM

#

however our bamboo results are good so we can show them 🙂

nova frost Mar 3, 2025, 5:35 AM

#

fringe egret Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is d...

was mostly going by this

misty igloo Mar 3, 2025, 6:07 AM

#

nova frost was mostly going by this

yes that is what was in the eagle/finch paper

misty igloo Mar 3, 2025, 6:09 AM

#

obsidian quest however our bamboo results are good so we can show them 🙂

good compared to what? the current results don't even include relevant recent models like mamba 2 or Qwen2.5 or Llama 3.2

#

anyway it really doesn't matter to me whether or not we include bamboo, but if we show bamboo it has to include those models.

misty igloo Mar 3, 2025, 6:13 AM

#

misty igloo sounds like Baber thinks this specific benchmark won't be valuable on base model...

and its weird that it got added again with no further interaction after the previous discussion about it ended with this comment

obsidian quest Mar 3, 2025, 6:41 AM

#

misty igloo good compared to what? the current results don't even include relevant recent mo...

lets test their base models

misty igloo Mar 3, 2025, 6:43 AM

#

obsidian quest lets test their base models

okay, @fringe egret please go ahead and test those base models if you'd like to include this benchmark in the paper

fringe egret Mar 3, 2025, 6:54 AM

#

Okay, I'll try testing it on other base models.

obsidian quest Mar 3, 2025, 6:58 AM

#

figure 3 15 16, please test pg19 (not proofpile) @gusty condor

#

proofpile likely has bad information density. and test rwkv WORLD models

gusty condor Mar 3, 2025, 7:11 AM

#

Yes, I tested World models.

gusty condor Mar 3, 2025, 7:21 AM

#

obsidian quest figure 3 15 16, please test pg19 (not proofpile) <@803473343705514025>

Figure 3 from @iron parrot

obsidian quest Mar 3, 2025, 7:24 AM

#

gusty condor Yes, I tested World models.

ok please test pg19 for figre 15 16

obsidian quest Mar 3, 2025, 7:31 AM

#

steady ether Changing inits did improve to 46. Here's the code: https://github.com/guangyuso...

cant find init code

steady ether Mar 3, 2025, 7:33 AM

#

obsidian quest could you explain this 🙂

I just made a fix

obsidian quest Mar 3, 2025, 7:36 AM

#

oh why 0.8+ but you mentioned 46%

misty igloo Mar 3, 2025, 7:44 AM

#

iron parrot I ran some loss tests on PG19 with different models. Surprisingly, the loss does...

@obsidian quest @gusty condor this was what happened when @iron parrot ran it on PG19

steady ether Mar 3, 2025, 7:45 AM

#

obsidian quest oh why 0.8+ but you mentioned 46%

We group results by config parameters, find the best accuracy run, and average the best-per-config accuracies.

misty igloo Mar 3, 2025, 7:46 AM

#

This was jellyfish's PG19 result - we could use that in figure 3 if you prefer it over proofpile...

gusty condor Mar 3, 2025, 7:46 AM

#

steady ether We group results by config parameters, find the best accuracy run, and average t...

Can you elaborate?

obsidian quest Mar 3, 2025, 7:50 AM

#

misty igloo <@870137517020688415> <@803473343705514025> this was what happened when <@701460...

it's very strange because loss 1024-2047 should definitely be lower than 0-1023

#

probably code bug

#

please test other models too

misty igloo Mar 3, 2025, 7:50 AM

#

obsidian quest it's very strange because loss 1024-2047 should definitely be lower than 0-1023

In the last paper I skipped the first 2048 of PG19 I think because it was all weird formatting stuff a lot

obsidian quest Mar 3, 2025, 7:50 AM

#

ok then we should pick the "middle 16384 token" for each data item

#

if the length = X, pick token X*0.5-8192 to X*0.5+8192

misty igloo Mar 3, 2025, 7:51 AM

#

theyre books so the beginning is often some fairly standard preamble, contents etc

obsidian quest Mar 3, 2025, 7:51 AM

#

obsidian quest if the length = X, pick token ```X*0.5-8192 to X*0.5+8192```

yeah then we should do this. and only for data items with length > 32k

steady ether Mar 3, 2025, 7:53 AM

#

gusty condor Can you elaborate?

How I understood it is:

For each training run config group excluding learning rate and weight decay, we find the highest scoring one, and we take a simple average of them.

There's also some info on page 7 and 11: https://arxiv.org/pdf/2403.17844

sonic relic Mar 3, 2025, 8:02 AM

#

fringe egret Okay, I'll try testing it on other base models.

Hey, I talked with Ruichong, probably you could use some of my help for long context benchmark?

fringe egret Mar 3, 2025, 8:41 AM

#

sonic relic Hey, I talked with Ruichong, probably you could use some of my help for long con...

Oh, thank you so much! I'd appreciate your help with the benchmark. We can discuss the details when you are free.

iron parrot Mar 3, 2025, 9:16 AM

#

obsidian quest if the length = X, pick token ```X*0.5-8192 to X*0.5+8192```

I'll try this first, then see how Mamba performs. I think there's something off about the data distribution in the PG19 dataset.

#

RWKV-7 world seems to be a bit 'overfitted' to 4k ctx, when exceed 4k tokens, the ppl increase.

obsidian quest Mar 3, 2025, 9:21 AM

#

misty igloo In the last paper I skipped the first 2048 of PG19 I think because it was all we...

@iron parrot its due to this reason

iron parrot Mar 3, 2025, 11:22 AM

#

Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 tokens

#

After 16k, RWKV-7's ppl increases

#

#

That's the 'overfitting to 4k' phenomenon I mentioned before

young sparrow Mar 3, 2025, 11:33 AM

#

iron parrot Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 t...

Don't labels these with the range, that makes it very hard to read.

If you're going to report average loss, you should not use a line plot. A histogram would be more appropriate

iron parrot Mar 3, 2025, 11:36 AM

#

young sparrow Don't labels these with the range, that makes it very hard to read. If you're g...

Okay, I will change it to a histogram

obsidian quest Mar 3, 2025, 1:02 PM

#

pls verify Mamba2-1.3B glue 46.1 in table 17. check its individual components vs rwkv7

obsidian quest Mar 3, 2025, 1:37 PM

#

iron parrot Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 t...

looks reasonable

obsidian quest Mar 3, 2025, 1:37 PM

#

iron parrot

still much better than mamba & transformers

obsidian quest Mar 3, 2025, 1:37 PM

#

iron parrot Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 t...

just show this range, up to 16k

gusty condor Mar 3, 2025, 1:59 PM

#

obsidian quest pls verify Mamba2-1.3B glue 46.1 in table 17. check its individual components vs...

@brisk bronze

misty igloo Mar 3, 2025, 3:44 PM

#

@iron parrot we have our context-length extended version of 2.9B available at https://huggingface.co/SmerkyG/RWKV7-2.9B-World3-128k-250225 - seems like you might want to try that one too for PG19

SmerkyG/RWKV7-2.9B-World3-128k-250225 · Hugging Face

misty igloo Mar 3, 2025, 4:11 PM

#

we found that it was quite a bit better than the base model for NIAH as context length grew, so hopefully it should be for PG19 loss as well

young sparrow Mar 3, 2025, 4:36 PM

#

obsidian quest just show this range, up to 16k

This is really bad. You can't deliberately crop the graph at the point performance starts to degrade. I don't see any reason to do this other than to mislead the reader.

obsidian quest Mar 4, 2025, 2:30 AM

#

young sparrow This is really bad. You can't deliberately crop the graph at the point performan...

ok 🙂 then we should compare with other models

obsidian quest Mar 4, 2025, 3:16 AM

#

highlight all RWKV model names in Table 5

pls search for my ID here to see all suggestions

iron parrot Mar 4, 2025, 5:37 AM

#

So far, my test results on the proof pile and PG19 show:
All pile models (v4 v5 v6 v7) show decreasing loss as sequence length increases.
For world models (trained on way more tokens), the behavior varies: v4's loss explodes with longer sequences, v5 and v6's loss decreases then stabilizes, while v7's loss slightly increases after about 16k tokens (long-context fine-tuning can fix this).
This is why I call it some kind of "overfitting", more training actually hurts generalization. The severity ranking is: v4 > v7 > v6 = v5.
I think we should show separate loss charts for pile and world models, discuss this issue, and include comparisons with other models.
@obsidian quest@young sparrow@misty igloo

obsidian quest Mar 4, 2025, 5:41 AM

#

iron parrot So far, my test results on the proof pile and PG19 show: All pile models (v4 v5 ...

ok we can show these. still much better than transformers

iron parrot Mar 4, 2025, 5:42 AM

#

misty igloo <@701460149134688386> we have our context-length extended version of 2.9B availa...

Here are the PG19 test results, the fine-tuned version performs much better after 16k tokens

gusty condor Mar 4, 2025, 5:46 AM

#

Maybe it's time for us to shrink the state size once V7 has better state utilization.

#

https://arxiv.org/pdf/2410.07145 Based on some of this paper's findings

obsidian quest Mar 4, 2025, 6:20 AM

#

when you do ctxlen extension, use LONG data, for much better results

#

such as https://github.com/Lyun0912-wu/LongAttn

GitHub

GitHub - Lyun0912-wu/LongAttn: LongAttn ：Selecting Long-context Tra...

LongAttn ：Selecting Long-context Training Data via Token-level Attention - Lyun0912-wu/LongAttn

keen tartan Mar 4, 2025, 10:06 AM

#

RWKV-7 World 0.1B and 0.4B LM Evaluation Harness Benchmarks

#

#

English focus

#

#

Multilang focus

gusty condor Mar 4, 2025, 10:24 AM

#

keen tartan

MMLU 0-shot or 5-shot?

keen tartan Mar 4, 2025, 10:25 AM

#

0-shot.

gusty condor Mar 4, 2025, 10:26 AM

#

I tested 5-shot (in order to match Qwen's performance in the technical report)

keen tartan Mar 4, 2025, 10:27 AM

#

gusty condor I tested 5-shot (in order to match Qwen's performance in the technical report)

I see. Yes, Qwen models are strong on MMLU.

#

I check.

gusty condor Mar 4, 2025, 10:28 AM

#

Usually 0-shot is 1-2% worse than 5-shot

misty igloo Mar 4, 2025, 2:25 PM

#

iron parrot Here are the PG19 test results, the fine-tuned version performs much better afte...

That's a great result! Glad the fine-tuned version helped!

misty igloo Mar 4, 2025, 2:34 PM

#

obsidian quest when you do ctxlen extension, use LONG data, for much better results

@hushed orchid just bringing this to your attention from Blink for reference on future ctxlen extension attempts

misty igloo Mar 4, 2025, 2:35 PM

#

iron parrot So far, my test results on the proof pile and PG19 show: All pile models (v4 v5 ...

I agree, let's show both and discuss the difference. Interesting that the World models become overfit on a specific length.

#

Transformers, too, (including Qwen) are typically post-trained to increase context length so while this isn't exactly a win for us it's interestingly comparable.

iron parrot Mar 4, 2025, 3:54 PM

#

a bit noisy since pg19 test set only has 100 samples

keen tartan Mar 4, 2025, 4:13 PM

#

MMLU has been recomputed with 5-shot.

gusty condor Mar 4, 2025, 4:52 PM

#

#

@obsidian quest This is on PG19

gusty condor Mar 4, 2025, 5:04 PM

#

keen tartan MMLU has been recomputed with 5-shot.

I think that very few people would be interested in eval results of such small models. But you can add then in the paper.

misty igloo Mar 4, 2025, 5:05 PM

#

gusty condor I think that very few people would be interested in eval results of such small m...

actually we may need these for @brisk bronze and my upcoming FLOPS vs acc plot

#

she was running them too but I think she had some tech issues

gusty condor Mar 4, 2025, 5:07 PM

#

Should figure 4,5,6,7 be unified into a large figure?

#

Also please include v6 results too

misty igloo Mar 4, 2025, 5:10 PM

#

@gusty condor the pawsx number for 2.9B looks incorrect to me... could you check that it was entered correctly?

gusty condor Mar 4, 2025, 5:14 PM

#

By the way, I think this two pixels (as seen in other states) are used to pin the GroupNorm, preventing it from drifting. Now that v7 has O(1) state size, may we remove that GroupNorm?
Or, I think we can use GroupRMSNorm for that. Therefore we need one value to pin an RMSNorm, instead of two values to pin a LayerNorm.

gusty condor Mar 4, 2025, 5:16 PM

#

misty igloo <@803473343705514025> the pawsx number for 2.9B looks incorrect to me... could y...

I checked pawsx, and it looked correct.

misty igloo Mar 4, 2025, 5:17 PM

#

gusty condor I checked pawsx, and it looked correct.

okay, weird that it got so low?

#

like the bigger model and more training made it a lot worse than it was previously

gusty condor Mar 4, 2025, 5:20 PM

#

There are some inverse scaling problems

young sparrow Mar 4, 2025, 5:34 PM

#

misty igloo actually we may need these for <@533592838529744917> and my upcoming FLOPS vs ac...

I was going to recommend this. I'm having trouble remembering which papers have plots I like, but something like this? People also have been shading regions & drawing Pareto frontier lines with seems like a good idea

young sparrow Mar 4, 2025, 5:34 PM

#

gusty condor I think that very few people would be interested in eval results of such small m...

I disagree actually. Powerful small models are popular.

misty igloo Mar 4, 2025, 5:35 PM

#

young sparrow I was going to recommend this. I'm having trouble remembering which papers have ...

yes, we are working on it now (thanks! this was also your very helpful suggestion last year and I think it's a great plot to have)

gusty condor Mar 4, 2025, 5:35 PM

#

Most of these models can't get a nontrivial score on MMLU

misty igloo Mar 4, 2025, 5:48 PM

#

still working on it, but interesting initial fit lines

#

(somehow excel is being annoying but the unlabeled ones are the other goose world3 models)

young sparrow Mar 4, 2025, 5:52 PM

#

gusty condor Most of these models can't get a nontrivial score on MMLU

So let's plot something more interesting than MMLU score 🙂

obsidian quest Mar 4, 2025, 11:56 PM

#

keen tartan MMLU has been recomputed with 5-shot.

pls add 1.5b 2.9b too

obsidian quest Mar 4, 2025, 11:57 PM

#

misty igloo <@803473343705514025> the pawsx number for 2.9B looks incorrect to me... could y...

pawsx is noisy because it's using a format unseen in usual training data, which can be seen from other models' numbers (llama 3b << llama 1b). i suggest remove it

misty igloo Mar 5, 2025, 12:21 AM

#

obsidian quest pawsx is noisy because it's using a format unseen in usual training data, which ...

maybe multi shot would help smooth out that issue across models

obsidian quest Mar 5, 2025, 12:23 AM

#

yeah can try

nova frost Mar 5, 2025, 12:42 AM

#

There has also been some formatting issues identified with paws-x:
https://github.com/EleutherAI/lm-evaluation-harness/issues/2442

#

I can add a PR with the fixes if anyone wants to try it

nova frost Mar 5, 2025, 1:06 AM

#

PR here. mostly fixed some spacing and capitalization issues for the european tasks

brisk bronze Mar 5, 2025, 1:47 AM

#

@gusty condor I ran glue for rwkv7-1.47b-pile in table 17 and the average of the subscores is like 8 percentage points higher than the glue score it computes, and rwkv7-421M-pile is 12 percentage points off. the glue score it computes is same as in paper
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T00-33-06.json
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T01-44-07.json

#

lambada.o was exactly the same tho

#

mamba2 on the other hand, subscores of glue and computed glue score are only different by like 0.5 percentage points, probably due to rounding. they were run with lm-eval-0.4.3 and fp32 too

obsidian quest Mar 5, 2025, 2:13 AM

#

brisk bronze <@803473343705514025> I ran glue for rwkv7-1.47b-pile in table 17 and the averag...

thats strange 😂 so the real GLUE of rwkv7 should be 8/12 percent higher?

brisk bronze Mar 5, 2025, 2:28 AM

#

obsidian quest thats strange 😂 so the real GLUE of rwkv7 should be 8/12 percent higher?

yeah basically, although 1.47b has a lower glue score than 421m pile 🤨 (48.0 for 1.47b and 50.3 for 421m)

gusty condor Mar 5, 2025, 4:11 AM

#

brisk bronze <@803473343705514025> I ran glue for rwkv7-1.47b-pile in table 17 and the averag...

It's weighted average, I think

brisk bronze Mar 5, 2025, 4:11 AM

#

gusty condor It's weighted average, I think

do you happen to know the weights

#

fwiw rwkv7 1.47b had higher scores on subtasks expect for 1 or 2 iirc on glue

gusty condor Mar 5, 2025, 4:12 AM

#

I think: like each single problem is given equal weight, rather than each task

tropic minnow Mar 5, 2025, 6:44 AM

#

i think the indices for equation 8 are wrong

#

u_{t,j}should be only u_t and be a scalar, not of head dimension, as it is an inner product

#

and if we follow dirac notation (bra-ket) for the inner product, there should be a , or a | between r and diag(rho)

#

but using dirac notation and einstein notation in the same equation is a bit confusing imo

#

so: dirac notation: "add , and remove j subindex" or einstein: "remove <> and diag() "

#

votes?

#

either way, the in R^{D/h} should be in R so im making that change already

#

thoughts for changing Fresh for novel / new / recent ?

tropic minnow Mar 5, 2025, 7:11 AM

#

ithink we can fuse the gating in 4.1.4 into equation 11; similar as equation 10

#

in the Pseudocode For RWKV-7 (appendix G) i would separate the weight projections from the time recurrent operation for clarity

#

happy to take this task

gusty condor Mar 5, 2025, 7:20 AM

#

tropic minnow thoughts for changing `Fresh` for `novel / new / recent` ?

Recent ✅
Novel ❌
Novel has a different meaning

tropic minnow Mar 5, 2025, 7:20 AM

#

ok recent

tropic minnow Mar 5, 2025, 8:26 AM

#

github repo needs to be updated

#

to include v7 code from blinks repo

#

ithink part of Appendix C (until theorem) can be moved to methods and the proof can be kept in the appendix

gusty condor Mar 5, 2025, 8:41 AM

#

tropic minnow ithink part of Appendix C (until theorem) can be moved to `methods` and the proo...

I originally wrote them in the methods and Smerky moved them into the appendix.

tropic minnow Mar 5, 2025, 8:41 AM

#

@misty igloo @gusty condor if we have a bit more time (say 20hours) i would like to include a more theoretical motivation and comparison of rwkv7 vs rwkv6 vs other linear RNNs. Ithink this can give the paper a more theoretical ground rather than the empirical vibe "we mixed 30 things and the result is cool ~sota"

tropic minnow Mar 5, 2025, 8:42 AM

#

tropic minnow ithink part of Appendix C (until theorem) can be moved to `methods` and the proo...

same for appendix D

gusty condor Mar 5, 2025, 8:43 AM

#

tropic minnow <@1007072846960410685> <@803473343705514025> if we have a bit more time (say 20h...

No problem. Do you mean this?

tropic minnow Mar 5, 2025, 8:45 AM

#

gusty condor No problem. Do you mean this?

yes similar. ithink in point 3 we can motivate well the decisions, including a table similar to that

gusty condor Mar 5, 2025, 8:46 AM

#

I tried deriving that and found the online objective being overly complex.

tropic minnow Mar 5, 2025, 8:47 AM

#

in fact, in that table longhorn 's claimed squared associative objective is wrong, as their simplification for practical considerations makes it an inner product objective effectively

#

yes @gusty condor bc rwkv7 is explicit gradient descent, and those objectives are derived from implicit gradient descent algorithm

#

so a proper explanation is what i want to include

tropic minnow Mar 5, 2025, 9:07 AM

#

i added a citation for the OG scaling laws on lstm paper by baidu on 2017 in the introduction

obsidian quest Mar 5, 2025, 2:21 PM

#

seems nan is unrelated to adam eps #992362252269256815 message

misty igloo Mar 5, 2025, 3:27 PM

#

nova frost PR [here](https://github.com/EleutherAI/lm-evaluation-harness/pull/2759). mostly...

They're using lm eval version 0.4.3 - are there other relevant fixes that have occurred since then?

nova frost Mar 5, 2025, 3:32 PM

#

yeah. a major one was this
https://github.com/EleutherAI/lm-evaluation-harness/pull/2434

#

sounded reasonable to me. paws-x has a quite non-standard format

misty igloo Mar 5, 2025, 3:33 PM

#

tropic minnow ithink we can fuse the gating in 4.1.4 into equation 11; similar as equation 10

Okay, I made that change. The real question is which style of readout we prefer. I think the 'alternate' one is a lot easier to understand without all the subscripts (and as you noted, some of the subscripts were incorrect and it was too hard to notice that)

misty igloo Mar 5, 2025, 3:35 PM

#

nova frost yeah. a major one was this https://github.com/EleutherAI/lm-evaluation-harness/p...

seems like pawsx is a mess - I think we should either rerun it with the latest updates or drop it as an eval

misty igloo Mar 5, 2025, 3:36 PM

#

obsidian quest seems nan is unrelated to adam eps https://discord.com/channels/9923596289795687...

Is this part of the paper correct? I had added it:

Despite the general stability of our loss curves, our use of such an extremely low AdamW $\epsilon$ value did sometimes cause NaN loss across a single training step. When this occurs, we rewind the training to the prior checkpoint, clear optimizer states, and continue from that point.

obsidian quest Mar 5, 2025, 3:37 PM

#

misty igloo Is this part of the paper correct? I had added it: > Despite the general stabili...

i think this is probably related to adam eps. so further investigation is required

misty igloo Mar 5, 2025, 3:39 PM

#

obsidian quest i think this is probably related to adam eps. so further investigation is requir...

okay, changed it to:

Despite the general stability of our loss curves, we did sometimes observe NaN loss across a single training step, which we theorize may be due to our use of such an extremely low AdamW $\epsilon$. When this occurs, we rewind the training to the prior checkpoint, clear optimizer states, and continue from that point.

misty igloo Mar 5, 2025, 3:41 PM

#

gusty condor I originally wrote them in the methods and Smerky moved them into the appendix.

At the time, this and some other changes got the paper into the 9 page limit for COLM submission. Since then a lot has been added and we are way over that limit again. I have been waiting to see the full set of experiments before moving more things into the appendix.

gusty condor Mar 5, 2025, 3:43 PM

#

We have no limit for arxiv, so we should submit to arxiv asap.

misty igloo Mar 5, 2025, 3:44 PM

#

Yes, though I think we should be purposeful about which experiments are shown and in which order in the main section, so that it is most impactful for the reader.

gusty condor Mar 5, 2025, 3:45 PM

#

misty igloo They're using lm eval version 0.4.3 - are there other relevant fixes that have o...

The newest version of lm-eval won't show the total score of GLUE, which is bad

keen tartan Mar 5, 2025, 3:46 PM

#

gusty condor The newest version of lm-eval won't show the total score of GLUE, which is bad

We could just add the total score of all the subtasks ourselves from the results, right?

gusty condor Mar 5, 2025, 3:47 PM

#

Not really. We need the formula, otherwise it may be inconsistent

keen tartan Mar 5, 2025, 3:47 PM

#

gusty condor Not really. We need the formula, otherwise it may be inconsistent

Oh, I see. Perhaps we extract the formula from the previous version of lm_eval codebase where it was present (v 0.4.3)

misty igloo Mar 5, 2025, 3:50 PM

#

from the docs in v 0.4.3:
weight_by_size: bool = True whether to perform micro- averaging (True) or macro- (False) averaging of subtasks' accuracy scores when reporting the group's metric.

#

class AggMetricConfig(dict):
    metric: Optional[str] = None
    aggregation: Optional[str] = "mean"
    weight_by_size: Optional[str] = False
    # list of filter names which should be incorporated into the aggregated metric.
    filter_list: Optional[Union[str, list]] = "none"```

#

notice it defaults to False

#

@nova frost I only see group set for each task in this 0.4.3 version of GLUE so would it end up getting non-weightbysize?

gusty condor Mar 5, 2025, 3:53 PM

#

tropic minnow <@1007072846960410685> <@803473343705514025> if we have a bit more time (say 20h...

btw, we should try our best to have this paper submitted to arxiv by this time tomorrow.
We have promised somewhere in RWKV.cn that RWKV-7 paper will be available "Early March" (in Chinese: 3月上旬, before March 10th). Just in case if the paper goes "on hold" for several days.

misty igloo Mar 5, 2025, 3:54 PM

#

gusty condor btw, we should try our best to have this paper submitted to arxiv by this time t...

Who made this promise?

#

And why are we only learning of it one day before you say there is a deadline?

#

This is really not okay.

nova frost Mar 5, 2025, 3:58 PM

#

misty igloo <@328142664476131330> I only see group set for each task in this 0.4.3 version o...

yeah. we added micro-averaging mostly to deal with MMLU. the default is simple mean of the subtask (same) metrics

fresh mulch Mar 5, 2025, 4:06 PM

#

@obsidian quest btw, i've been running ablations on some of the design choices (appendix K.2 at the moment) and find that using the same removal/replacement (k, kk) keys has competitive performance with current baseline Goose. For instance it gets higher acc on minipile validation. What kind of difference have you seen here in your experiments, or is there an intuitive reason why to do it?

gusty condor Mar 5, 2025, 4:06 PM

#

misty igloo Who made this promise?

https://mp.weixin.qq.com/s/nOKrvIsDQSKXKX5V2nILcw

微信公众平台

RWKV-7 论文即将发布，推理模型 G1 系列训练中！国外社区发布 72B 模型

RWKV-7 最新、最全的动态

#

Not me, but we should be as quick as possible

obsidian quest Mar 5, 2025, 4:07 PM

#

gusty condor https://mp.weixin.qq.com/s/nOKrvIsDQSKXKX5V2nILcw

no need to hurry. quality is important

obsidian quest Mar 5, 2025, 4:08 PM

#

fresh mulch <@870137517020688415> btw, i've been running ablations on some of the design cho...

i have run extended tests and i will provide more loss data. just too busy at the moment 😂

fresh mulch Mar 5, 2025, 4:09 PM

#

no problem, thanks! just curious, it's also the least intrusive ablation i tested

obsidian quest Mar 5, 2025, 4:10 PM

#

obsidian quest pls verify Mamba2-1.3B glue 46.1 in table 17. check its individual components vs...

pls check this

obsidian quest Mar 5, 2025, 4:10 PM

#

steady ether Changing inits did improve to 46. Here's the code: https://github.com/guangyuso...

pls update paper 🙂

gusty condor Mar 5, 2025, 4:13 PM

#

RWKV-7-1.47B-pile

#

📎 v7_1b4.txt

#

@brisk bronze

brisk bronze Mar 5, 2025, 4:21 PM

#

gusty condor

subscores are the same as what I got yeah

gusty condor Mar 5, 2025, 4:22 PM

#

Post your subscore please

brisk bronze Mar 5, 2025, 4:22 PM

#

https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-pile/results_2025-03-05T00-33-06.json

gusty condor Mar 5, 2025, 4:22 PM

#

And mamba-2's subscore

misty igloo Mar 5, 2025, 4:22 PM

#

fringe egret Okay, I'll try testing it on other base models.

hey just checking in, did you end up running bamboo on the other models we need to include it in the paper?
Also, I'm a little confused about the author contributions section - what is the "Compilation of the RWKV World 3.X Corpus?" I am the one who put together the listing, is there some dataset you put on huggingface or something?

brisk bronze Mar 5, 2025, 4:23 PM

#

gusty condor And mamba-2's subscore

here's mamba2
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/state-spaces__mamba2-1.3b/results_2025-03-02T06-56-13.319450.json

gusty condor Mar 5, 2025, 4:24 PM

#

misty igloo hey just checking in, did you end up running bamboo on the other models we need ...

I will explain. He sent BlinkDL some data and contributed to World-3.5 corpus this way.

misty igloo Mar 5, 2025, 4:25 PM

#

gusty condor I will explain. He sent BlinkDL some data and contributed to World-3.5 corpus th...

I see. The World-3.5 Corpus is not featured in this paper though.

#

And neither are the chat examples.

#

And we need more recent models like mamba 2 if we are going to include the bamboo results.

young sparrow Mar 5, 2025, 4:35 PM

#

tropic minnow <@1007072846960410685> <@803473343705514025> if we have a bit more time (say 20h...

I think this is a very good idea and it's not clear to me why there is such a rush. This is an artificial deadline right? Having deadlines to motivate work is good, but releasing a worse paper than we could do a few days later due to them is bad.

#

After continuing to read the messages it seems like most people are on the same page as the above. Also, I don't think anyone's going to get too upset if we do hypothetically miss a deadline promised in a blog post by a few days. That said, this is why EleutherAI has a standing policy of not committing to release dates ahead of time.

gusty condor Mar 5, 2025, 4:57 PM

#

I apologize for for pushing too hard on an artificial deadline earlier. I am aware that deadlines can be motivating but the urgency can harm the cooperation. Thank you again for your patience.

#

Moving forward, I will put paper quality into the first place, and avoid imposing too much burden on others.

misty igloo Mar 5, 2025, 5:01 PM

#

Thank you - just know that we all appreciate the immense amount of hard work you're putting into this paper!

brisk bronze Mar 5, 2025, 8:26 PM

#

looks like lm-eval 0.4.3 was using weighted average for glue by number of problems so the glue score it outputs is correct.
from api/metrics.py:

    # A helper function that is used to aggregate
    # subtask scores cross-task.
aggregations
    if not weight_by_size:
        sizes = [1] * len(sizes)

    assert len(metrics) == len(sizes)

    return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)```

`validation split sizes: 
cola: 1043
mnli_matched: 9815
mnli_mismatched: 9832
mrpc: 408
qnli: 5463
qqp: 40430
rte: 277
sst2: 872
stsb: 1500
wnli: 71`

#

(also explains why mamba2-1.3b was higher on glue even though its subtasks scores don't look super different at first glance)

tropic minnow Mar 5, 2025, 8:43 PM

#

misty igloo Okay, I made that change. The real question is which style of readout we prefer....

Agree, we dont use einsum anywhere else

misty igloo Mar 5, 2025, 10:41 PM

#

brisk bronze looks like lm-eval 0.4.3 was using weighted average for glue by number of proble...

after looking into what GLUE is made of, I think we should remove it from the paper
many of the sub-tasks contribute only a percent to the total so the weightings make no sense, causing the numbers reported to be more like 75% QQP than any kind of actual average
and QQP is a pretty weird benchmark, which should probably be run multi-shot to really work well
since we're not going to do that, let's just remove GLUE entirely

I also want to remove paws-x from the paper:
paws-x was broken in v0.4.3, see lm-eval https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
and baber has NEW fixes, that aren't even in the most recent lm-eval: https://github.com/EleutherAI/lm-evaluation-harness/pull/2759
Seems to me that it's too messed up and should be removed.

#

Later versions of lm-eval don't even spit out an aggregate score for GLUE, and our aggregate score doesn't even include all the subtasks. The other evals have been more stable across lm-eval versions, which will help future authors compare to our results. These two benchmarks are simply too wild and messy.

keen tartan Mar 5, 2025, 10:50 PM

#

misty igloo Later versions of lm-eval don't even spit out an aggregate score for GLUE, and o...

These are all fair points that you are raising.

#

QQP is Quora Question Pair Paraphrase subtask.

#

The task is whether two sentences are semantically equivalent.

misty igloo Mar 5, 2025, 10:53 PM

#

keen tartan QQP is Quora Question Pair Paraphrase subtask.

Yeah, it basically asks 'Do these two questions have the same meaning' Yes/No

keen tartan Mar 5, 2025, 10:55 PM

#

For paws-x we can right now use the pawsxx branch https://github.com/EleutherAI/lm-evaluation-harness/tree/pawsxx

#

But we should really think well about which tasks to show.

#

We can compute them anyway and consider to use them or not. Any suggestions for substitute tasks?

misty igloo Mar 5, 2025, 10:57 PM

#

I don't think we need a substitute.

keen tartan Mar 5, 2025, 10:58 PM

#

Yeah, we could also just drop them. True. Less hassle in the end.

obsidian quest Mar 6, 2025, 12:42 AM

#

brisk bronze looks like lm-eval 0.4.3 was using weighted average for glue by number of proble...

this is a wrong choice... should use avg of different tasks

gusty condor Mar 6, 2025, 2:16 AM

#

keen tartan We can compute them anyway and consider to use them or not. Any suggestions for ...

Cherry picking harms academic integrity!

gusty condor Mar 6, 2025, 3:42 AM

#

I suggest blimp (but the scores will be really high)

misty igloo Mar 6, 2025, 4:09 AM

#

gusty condor Cherry picking harms academic integrity!

Yeah I wouldn't have wanted to drop either of them bc of that concern, but both evals just seem like a mess in general

#

and tbh dropping glue seems to harm us a bit on the flops vs acc chart, so at least its not really in our favor anyway

obsidian quest Mar 6, 2025, 4:25 AM

#

gusty condor I suggest blimp (but the scores will be really high)

blimp is too simple for LLMs

#

glue and superglue are all noisy

obsidian quest Mar 6, 2025, 4:26 AM

#

gusty condor Cherry picking harms academic integrity!

dropping badly designed datasets (shown to be bad for llama3 too) is reasonable

dawn pewter Mar 6, 2025, 5:54 AM

#

@gusty condor What happens if the value of c is large (e.g., equal to the wkv matrix dimension)? I found that if c gets this big, it might be possible to simulate (reverse) the Boolean transition matrix with a single step transition

gusty condor Mar 6, 2025, 7:07 AM

#

dawn pewter <@803473343705514025> What happens if the value of c is large (e.g., equal to th...

The range of WKV will explode

keen tartan Mar 6, 2025, 5:24 PM

#

RWKV-7 World v3 corpus as itemized and annotated list on HF:
https://huggingface.co/datasets/hevok/Goose-World-v3

hevok/Goose-World-v3 · Datasets at Hugging Face

keen tartan Mar 6, 2025, 5:53 PM

#

Alternatively the corpus as a HF Collection: https://huggingface.co/collections/hevok/rwkv-world-v3-corpus-67be08105ff513c71632e9dd

RWKV World v3 Corpus - a hevok Collection

#

Additional here is collection of RWKV-7 related resources: https://huggingface.co/collections/hevok/rwkv-7-goose-67c9dd2154d811c24a093f0c

RWKV-7 Goose - a hevok Collection

misty igloo Mar 6, 2025, 9:05 PM

#

@gusty condor @keen tartan is Table 13 correct? Not sure where this breakdown comes from...

keen tartan Mar 6, 2025, 9:14 PM

#

misty igloo <@803473343705514025> <@371036620008194048> is Table 13 correct? Not sure where ...

If I recall correctly @obsidian quest provided it.

misty igloo Mar 6, 2025, 9:15 PM

#

obsidian quest world-3.0 ```book 337.2 science+wiki 222.7 math 32.3 law&gov 19.0 fiction 192.6 ...

here, yeah

misty igloo Mar 6, 2025, 9:15 PM

#

obsidian quest ok could someone please combine v2 + v2.1 + v3 items and arrange them to approxi...

but @keen tartan weren't you working to accomplish this item above?

#

so it is in fact not correct (yet)

keen tartan Mar 6, 2025, 9:16 PM

#

I categorized all datasets with this classes.

#

I intend to automatically classify all individual datasets.

#

Like world languages, artificial or natural, and categories.

misty igloo Mar 6, 2025, 9:17 PM

#

okay, did you co-ordinate the results with Blink? if not, please do so we can get the updated table of categories into the paper

keen tartan Mar 6, 2025, 9:17 PM

#

I will do so.

misty igloo Mar 6, 2025, 9:18 PM

#

thanks!

#

(I'm just going through making sure we have everything right and aren't missing things that need updates)

keen tartan Mar 6, 2025, 9:18 PM

#

misty igloo (I'm just going through making sure we have everything right and aren't missing ...

Understandable.

misty igloo Mar 6, 2025, 11:12 PM

#

@obsidian quest do you happen to have a checkpoint for the Pile models at 300B tokens rather than 332B?

obsidian quest Mar 7, 2025, 3:15 AM

#

misty igloo <@870137517020688415> do you happen to have a checkpoint for the Pile models at ...

i always train full pile 332B, and other models are probably doing this too and say 300B for simplicity

misty igloo Mar 7, 2025, 3:18 AM

#

obsidian quest i always train full pile 332B, and other models are probably doing this too and ...

Unfortunately, it seems that Mamba and possibly others followed Eleuther's Pythia approach which was to limit training to 300B of the Pile. It appears that it is not just rounding.

#

I was surprised by this.

#

My understanding is it was so that people could compare directly to Pythia.

gusty condor Mar 7, 2025, 3:55 AM

#

misty igloo Unfortunately, it seems that Mamba and possibly others followed Eleuther's Pythi...

What, never heard of it

misty igloo Mar 7, 2025, 3:56 AM

#

gusty condor What, never heard of it

Take a close look at the Mamba 2 paper - and Stella confirmed that Pythia used only 300B instead of the full pile.

gusty condor Mar 7, 2025, 4:08 AM

#

No! Do we have to retrain these models?

misty igloo Mar 7, 2025, 4:14 AM

#

Doesn't sound feasible to me 😦

#

I'm not sure what to do, but it's fine in our FLOPs vs Acc plot since it's adjusted for training length

#

latest version of that, not final tho - I still have some work to do on deciding the exact flops counts

gusty condor Mar 7, 2025, 4:21 AM

#

misty igloo latest version of that, not final tho - I still have some work to do on deciding...

Mamba uses tied word embeddings but RWKV does not.

misty igloo Mar 7, 2025, 4:21 AM

#

gusty condor Mamba uses tied word embeddings but RWKV does not.

yeah, there are definitely differences in the models that make it hard to compare fairly

#

and counting flops is kind of only a vaguely correct metric in general - it doesn't directly dictate how fast GPUs run the model

gusty condor Mar 7, 2025, 4:23 AM

#

Your chart is not accurate, you should subtract the embeddings of RWKV

misty igloo Mar 7, 2025, 4:24 AM

#

gusty condor Your chart is not accurate, you should subtract the embeddings of RWKV

yes like I said above, I still have some work to do on deciding the exact flops counts

#

that may push them apart slightly

#

(we also did not include GLUE in this average because I think its a broken eval)

gusty condor Mar 7, 2025, 4:25 AM

#

I can help you with that

misty igloo Mar 7, 2025, 4:26 AM

#

gusty condor I can help you with that

thanks, that would be great!

#

it's not super clear exactly which FLOPs formulas we should use... the attention mechanisms add a small amount, esp because Mamba does the 2x expansion thing

gusty condor Mar 7, 2025, 4:28 AM

#

Can you send me the source

obsidian quest Mar 7, 2025, 5:37 AM

#

misty igloo latest version of that, not final tho - I still have some work to do on deciding...

this will make rwkv look bad because rwkv7 avg eval @ 90% trained is almost same as 100%

#

models trained using different amt of data cant be compared like this

#

models trained with smaller amt of data (such as mamba2) will appear far better than those with more data (such as qwen llama), because if we want optimal loss vs flops we need to follow scaling laws which no one follow in practice for apparent reasons

dawn pewter Mar 7, 2025, 10:04 AM

#

I think for readers unfamiliar with RWKV architecture, the "Blocks" label within the L Blocks notation in this diagram might cause confusion. Specifically, there's ambiguity about whether "Blocks" refers to the entire module or a specific component (like the Time Mix unit) within it. To enhance clarity, perhaps relocating the "L" designation outside the block representation would create a more intuitive visual hierarchy.

gusty condor Mar 7, 2025, 10:15 AM

#

I see

gusty condor Mar 7, 2025, 10:31 AM

#

@dawn pewter

whole ember Mar 7, 2025, 1:38 PM

#

WorldRWKV: https://github.com/JL-er/WorldRWKV/tree/main This demonstrates RWKV7's strong comprehension ability, capable of accepting any modality and performing excellently on benchmarks. Can this be included in the RWKV7 paper?

GitHub

GitHub - JL-er/WorldRWKV

Contribute to JL-er/WorldRWKV development by creating an account on GitHub.

misty igloo Mar 7, 2025, 2:26 PM

#

whole ember WorldRWKV: https://github.com/JL-er/WorldRWKV/tree/main This demonstrates RWKV7...

We already feature VisualRWKV in the paper, which does Image QA and gets higher results, and I don't think we should feature two of these. But would you like to add Audio QA to the paper as a new multimodal experiments subsection?

#

If so, it's probably important to do experiments comparing the results with some other architecture. (could be RWKV-6, but even better if it's something else)
If you don't have time for this now, it's could still be possible for us to add it in a future version of the paper if it's complete before we submit to COLM. COLM deadlines are March 20th for abstract, March 27th for paper.

gusty condor Mar 7, 2025, 2:33 PM

#

@iron parrot Please adjust figure 3 and 4 so that the colors of v7, v6, v5, v4, mamba, mamba-2, v7-128k are consistent across two images

young sparrow Mar 7, 2025, 2:34 PM

#

obsidian quest i always train full pile 332B, and other models are probably doing this too and ...

I told you this when we wrote the original RWKV paper and then again when we wrote the second.

misty igloo Mar 7, 2025, 2:36 PM

#

young sparrow I told you this when we wrote the original RWKV paper and then again when we wro...

To clarify, you mean that these other models train an actual 300B tokens, correct? Not that they just write 300B but train 332B.

#

(I know you said that pythia does actual 300B)

young sparrow Mar 7, 2025, 2:36 PM

#

From the Pythia paper

gusty condor Mar 7, 2025, 2:37 PM

#

I think yes, but I think very few people are aware of that. I bet no reviewer will raise questions on this specific point.

young sparrow Mar 7, 2025, 2:38 PM

#

misty igloo To clarify, you mean that these other models train an actual 300B tokens, correc...

Yes

misty igloo Mar 7, 2025, 2:39 PM

#

gusty condor I think yes, but I think very few people are aware of that. I bet no reviewer wi...

Well, the goal is to do everything correctly to the best of our ability. Not to fool reviewers.

#

And to show correct scientific results that do not contain known errors.

#

We can show anything we have, as long as we point out the distinctions.

gusty condor Mar 7, 2025, 2:41 PM

#

All RWKV models there are trained with 332B Pile, so comparisons are still valid

iron parrot Mar 7, 2025, 2:55 PM

#

gusty condor <@701460149134688386> Please adjust figure 3 and 4 so that the colors of v7, v6,...

OK, done.

misty igloo Mar 7, 2025, 3:01 PM

#

gusty condor All RWKV models there are trained with 332B Pile, so comparisons are still valid

For now, I adjusted the Pile ablations table and discussion to remove Pythia, Mamba, and Mamba2

#

If someone has the resources and we want to compare to those others in table format we could train just RWKV7-1.47B on 300B tokens of Pile. But imho the comparison this way is kind of pointless because they all use somewhat different parameter counts. Probably mainly due to differences in weight tying, at least in Mamba's case.

young sparrow Mar 7, 2025, 3:17 PM

#

Depending on what the claims we want to make are, I don't see a huge issue in using models with slightly different parameter counts and slightly different training token counts

keen tartan Mar 7, 2025, 3:20 PM

#

misty igloo If someone has the resources and we want to compare to those others in table for...

I could attempt doing it if it is seen as useful. Have some spare compute.

#

Need to estimate the requirements.

#

Do we have The Pile binidx files somewhere already?

misty igloo Mar 7, 2025, 3:28 PM

#

young sparrow Depending on what the claims we want to make are, I don't see a huge issue in us...

I added a tokens column, in case we want to do that. Not quite sure what claims we could make though, if any.

#

The reality is that these RWKV-7 models are both more parameters and 10% more tokens trained than the ones being compared to.

gusty condor Mar 7, 2025, 3:39 PM

#

keen tartan Do we have The Pile binidx files somewhere already?

Yes, and ask Blink for his compute

obsidian quest Mar 7, 2025, 5:14 PM

#

https://huggingface.co/BlinkDL/rwkv7-g1/blob/main/rwkv7-g1-0.1b-20250307-ctx4096.pth

announcement a bit later

misty igloo Mar 7, 2025, 5:54 PM

#

misty igloo after looking into what GLUE is made of, I think we should remove it from the pa...

are there any objections - if not I will go ahead and remove these two broken and/or messed up benchmarks

misty igloo Mar 8, 2025, 12:57 AM

#

@keen tartan I'm seeing open-web-math, algebraic-stack (both of which point to proof-pile-2) and FLAN got added to the v3 dataset listing and citations - do you know why these were added? Afaict they were not in my original v3 list, based on what Blink originally sent me

misty igloo Mar 8, 2025, 1:35 AM

#

From looking at the document history, it appears you added them to the dataset listing on 26th February, 1:26 pm ET

gusty condor Mar 8, 2025, 4:06 AM

#

misty igloo are there any objections - if not I will go ahead and remove these two broken an...

Don't remove glue. Pawsx can be removed

gusty condor Mar 8, 2025, 4:29 AM

#

You can't only pick a benchmark when it's advantageous for you.

misty igloo Mar 8, 2025, 4:43 AM

#

I'm definitely not trying to do that - and I agree that's bad

#

But let's not use glue in any future paper, because it's not well constructed

#

I think we can add it to the flops chart (I have no idea if the result will benefit or harm rwkv there) by applying the weighting formula manually

#

@brisk bronze please take a look at how we can do that if you get a chance to

obsidian quest Mar 8, 2025, 5:09 AM

#

use avg for glue, not weighted by number of items in each subset @misty igloo @gusty condor

#

because weighted by items makes no sense here

whole ember Mar 8, 2025, 6:20 AM

#

misty igloo We already feature VisualRWKV in the paper, which does Image QA and gets higher ...

WorldRWKV has a stronger visual QA benchmark, but there are currently no machine experiments—updates will be made later. The audio QA benchmark has already reached SOTA and does not need to be compared with RWKV6. I believe WorldRWKV should appear as a whole rather than being split up, as this is meant to demonstrate RWKV7's ability to understand any modality.

#

I will provide a stronger benchmark, and you can update it in the subsequent RWKV7 paper.

keen tartan Mar 8, 2025, 9:43 AM

#

misty igloo <@371036620008194048> I'm seeing open-web-math, algebraic-stack (both of which p...

Also ccnews. I made an initial itemized list of the World v3 corpus in an excel sheet as it was suggested and Blink added the missing datasets. Based on this I added them to the paper as well. I understood that was the original objective to identify missing datasets. We also found that DeepMind Mathematics dataset dm_math was part of the world v2 as a constitute of The Pile but forgotten to be mentioned in the Eagle & Finch paper. I tried to mention it as a footnote (b) to the table about the World v2.1, but perhaps there is better place for it.

acoustic knoll Mar 8, 2025, 10:28 AM

#

keen tartan Do we have The Pile binidx files somewhere already?

Just curious how do you normally convert a dataset to binidx files? I have a rust script https://github.com/cahya-wirawan/json2bin that would convert 825GB pile dataset in about 40 minutes (using M2 mac mini) instead of 45hours using python script

keen tartan Mar 8, 2025, 10:42 AM

#

acoustic knoll Just curious how do you normally convert a dataset to binidx files? I have a rus...

I use your json2bin implementation already. It is blazing fast. Thank you so much for making it!

#

For The Pile Comparison experiment we need to segment text with the GPTNeoX Tokenizer. I have been using the Rust implementation only with the World tokenizer. How would you go about specifying a different tokenizer?

#

I think we need change the library to support other tokenizers if I am not mistaken. I see let tokenizer = rwkv_tokenizer::WorldTokenizer::new(None).unwrap(); is hardcoded right now. In particular in view of supporting other modalities too.

#

By the way, I did identify a preprocessed version of The Pile with the GPTNeoX tokenizer on HuggingFace as a dataset: https://huggingface.co/datasets/RichardErkhov/RWKV-LM_pile_binidx_dataset

RichardErkhov/RWKV-LM_pile_binidx_dataset · Datasets at Hugging Face

#

From my tests it seems to be correct as we can decode the original text from it. It is however split across many binidx files rather than a single pair of files.

#

Added SmoLM2-1.7B

#

Removed PAWS-X and added SmoLM2-1.7B too.

#

@brisk bronze @gusty condor We need to share the lm_eval results files to calculate MMLU with either the weighted or non-weighted average ourselves.

acoustic knoll Mar 8, 2025, 11:23 AM

#

keen tartan I think we need change the library to support other tokenizers if I am not mista...

I implemented only the world tokenizer because I thought we will not use the old tokenizer anymore:) I will have a look how to add other tokenizer too

gusty condor Mar 8, 2025, 12:15 PM

#

whole ember WorldRWKV has a stronger visual QA benchmark, but there are currently no machine...

WorldRWKV can be written in a separate paper.

keen tartan Mar 8, 2025, 2:13 PM

#

Overleaf in dark/night mode. Finally eye strain reduced!

misty igloo Mar 8, 2025, 2:38 PM

#

obsidian quest because weighted by items makes no sense here

I agree that item-weighted makes no sense, because that way 75% of it is one subset and many others are 1% or less. We are also excluding parts of it that do not have accuracy as a result.

#

This weighting is a result of a mistake in the old version of lm eval harness used.

misty igloo Mar 8, 2025, 2:40 PM

#

keen tartan Also [ccnews](https://huggingface.co/datasets/stanford-oval/ccnews). I made an i...

Okay, that's great! Thank you for doing that and verifying that Blink made the additions.

#

I updated the links to proof pile to point to the proper subdirectories

gusty condor Mar 8, 2025, 2:58 PM

#

keen tartan <@533592838529744917> <@803473343705514025> We need to share the `lm_eval resul...

https://github.com/Triang-jyed-driung/myevals

GitHub

GitHub - Triang-jyed-driung/myevals: my evaluations

my evaluations. Contribute to Triang-jyed-driung/myevals development by creating an account on GitHub.

gusty condor Mar 8, 2025, 2:59 PM

#

obsidian quest use avg for glue, not weighted by number of items in each subset <@1007072846960...

Why using average for GLUE but not so for MMLU?

misty igloo Mar 8, 2025, 2:59 PM

#

gusty condor https://github.com/Triang-jyed-driung/myevals

last time I checked mmlu was missing here - do you have those results as well?

gusty condor Mar 8, 2025, 3:00 PM

#

Probably overwritten?

#

Oh, I found them

#

https://github.com/Triang-jyed-driung/myevals/blob/main/RWKV-x070-World-2.9B-v3-20250211-ctx4096_mmlu.json

GitHub

myevals/RWKV-x070-World-2.9B-v3-20250211-ctx4096_mmlu.json at main ...

my evaluations. Contribute to Triang-jyed-driung/myevals development by creating an account on GitHub.

#

0.1B and 0.4B tested by @keen tartan

keen tartan Mar 8, 2025, 3:07 PM

#

gusty condor 0.1B and 0.4B tested by <@371036620008194048>

I put evals results on HF. Just need to organize them. On it right now.

quaint ingot Mar 8, 2025, 3:10 PM

#

keen tartan Overleaf in dark/night mode. Finally eye strain reduced!

I like this figure, It's a very intuitive way of showing the equation

gusty condor Mar 8, 2025, 3:13 PM

#

quaint ingot I like this figure, It's a very intuitive way of showing the equation

I painted it

keen tartan Mar 8, 2025, 3:14 PM

#

gusty condor I painted it

It is beautiful. I like its simplicity and color composition. It is piece of art!

#

RWKV 0.1B and 0.4B models evals: https://huggingface.co/spaces/hevok/evals/tree/main/lm_eval

hevok/evals at main

#

I ran 0-shot and 5 shot for MMLU separately. Sometimes I ran evals separately per task to better organize. This is why there are multiple files per model. I organize each eval set in a folder per model.

#

I add the other reference models evals too.

gusty condor Mar 8, 2025, 3:17 PM

#

gusty condor Why using average for GLUE but not so for MMLU?

I think we should keep the eval as-is unless we have a very good reason. Making slight modifications to those evals will arouse the attention of reviewers, putting us at risk of rejection.

gusty condor Mar 8, 2025, 3:50 PM

#

We should provide convenience for reviewers to verify our results. i.e., using the default averaging method of the evaluation framework without making any tweaks to the results. This simplifies the reproducibility process and avoids potential accusations of "tweaking" the results to favor our model.

misty igloo Mar 8, 2025, 3:51 PM

#

I never requested that we change the weighting (Blink did tho), I requested that we drop the eval entirely because it uses a bad weighting in that specific lm eval version

#

in newer lm eval versions it does not print an average at all for glue

#

glue also includes other components which do not contain accuracy at all, and these are not reflected in the accuracy score

#

This is probably originally my fault for including glue in the prior paper without checking it thoroughly beforehand

gusty condor Mar 8, 2025, 4:15 PM

#

We can still include it anyway, because that we are already averaging over 9 benchmarks.

unborn lintel Mar 8, 2025, 7:01 PM

#

keen tartan Added SmoLM2-1.7B

updated bolding for smollm2-1.7b on arcC column to reflect better score

brisk bronze Mar 8, 2025, 8:52 PM

#

keen tartan <@533592838529744917> <@803473343705514025> We need to share the `lm_eval resul...

https://github.com/jannalulu/lm-evaluation-harness/tree/main/results-0.4.3

keen tartan Mar 8, 2025, 9:18 PM

#

Who did the evaluations for Llama3.2 1B/3B and Qwen2.5 1.5B/3B?

keen tartan Mar 8, 2025, 9:48 PM

#

@misty igloo I like Figure 3: FLOPs vs. Average Accuracy. I think the title is redundant with the figure caption.
Better label the axes with average accuracy and log scale compute in TFLOPs instead of a title at the top.

#

Perhaps express the accuracy in %

#

I also suggest to attempt mitigating the overplotting of labels on each other for Mamba and RWK7-Pile. I know plotting softwares often make it hard to position them apart.

#

I think the point labels could be shorter just indicating the size of the model as the architecture/dataset is specified in the legend and encoded in color already, e.g. 0.1B, 0.4B, 1.5B, 2.9B, etc.

#

Should we add transformers to the PG19 long range context loss plots?

misty igloo Mar 8, 2025, 10:10 PM

#

keen tartan <@1007072846960410685> I like `Figure 3: FLOPs vs. Average Accuracy`. I think th...

updated, now including size-weighted GLUE results like the rest of the paper contains

#

actually, im going to increase the text size...

keen tartan Mar 8, 2025, 10:12 PM

#

Yes, please. There is plenty of space.

misty igloo Mar 8, 2025, 10:12 PM

#

keen tartan Yes, please. There is plenty of space.

hehe there isn't much sadly bc the labels will overlap

keen tartan Mar 8, 2025, 10:13 PM

#

Make the labels shorter.

#

Just the size of the model please.

#

Put the legend in bottom right or top left.

#

Make axes text a bit bigger also the axes labels.

misty igloo Mar 8, 2025, 10:14 PM

#

hm I dislike it with just the size, but I understand your reasoning

keen tartan Mar 8, 2025, 10:15 PM

#

I see.

#

In my humble opinion this plot is very important. We should make it shine.

misty igloo Mar 8, 2025, 10:17 PM

#

working on it...

keen tartan Mar 8, 2025, 10:18 PM

#

Please put % behind Average Accuracy (%) in parenthesis remove it from the y-axis numbers.

misty igloo Mar 8, 2025, 10:25 PM

#

updated

keen tartan Mar 8, 2025, 10:26 PM

#

Ohh!

#

Already so much better!

misty igloo Mar 8, 2025, 10:26 PM

#

argh llama

#

lol

keen tartan Mar 8, 2025, 10:26 PM

#

Thank you. Llama got lost

#

Can someone please look for some lost Lamas?

#

^^

misty igloo Mar 8, 2025, 10:27 PM

#

my formula was accidentally off before, and this revealed that mistake so its a happy accident that it got lost

keen tartan Mar 8, 2025, 10:28 PM

#

Also for x-axis use 10²... representation with superscript.

misty igloo Mar 8, 2025, 10:31 PM

#

not sure I can do that

#

this is yet another case where GLUE is messing up something

#

this time llama is gonna look horrible as a result

keen tartan Mar 8, 2025, 10:31 PM

#

Hmm

misty igloo Mar 8, 2025, 10:32 PM

#

I really hate this benchmark, at least the way we're using it (which is terrible imho)

#

I might remove llama entirely because I think it's a completely unfair representation of it

#

it literally scores worse than its own 1B, that was DISTILLED from the same model, on GLUE the way we calculate it

keen tartan Mar 8, 2025, 10:32 PM

#

We can iterate over it until it is correct.

#

Let me look into it.

#

How do you calculate the compute complexity? I mean estimate.

misty igloo Mar 8, 2025, 10:33 PM

#

it took a bunch of work

keen tartan Mar 8, 2025, 10:34 PM

#

I can imagine.

misty igloo Mar 8, 2025, 10:34 PM

#

the basic formula is 6 x params x tokens
but there are variations in the models that matter

#

like some use tied embeddings, and embedding doesnt really take flops (it's essentially a lookup table) but de-embedding for the lm_head does

#

and rwkv was upgraded from prior models which had to be calculated separately

keen tartan Mar 8, 2025, 10:35 PM

#

params x tokens is already good rule of thumb. Yeah, the devil lays in the detail.

#

Llama 3.2 1B/3B were destilled from Llama 3.2 8B right?

misty igloo Mar 8, 2025, 10:36 PM

#

other minor differences include the cost of the attention calculation or replacement thereof

misty igloo Mar 8, 2025, 10:36 PM

#

keen tartan Llama 3.2 1B/3B were destilled from Llama 3.2 8B right?

i had them in the chart originally but it makes no sense since there is no true FLOPS used to train them

misty igloo Mar 8, 2025, 10:37 PM

#

keen tartan Llama 3.2 1B/3B were destilled from Llama 3.2 8B right?

they were distilled from 3.1 8B not 3.2

#

there is no 3.2 8B

keen tartan Mar 8, 2025, 10:37 PM

#

All right.

#

Let me think about it.

#

Destillation is kind of like cheating.

#

Include SmolLM2

#

It was trained from scratch via pretraining.

#

No destillation.

misty igloo Mar 8, 2025, 10:39 PM

#

They weren't just distilled - the starting point was actually 3.2 8B cut up into smaller parts!

keen tartan Mar 8, 2025, 10:40 PM

#

I will think about the issue and look around.

misty igloo Mar 8, 2025, 10:41 PM

#

I don't really want to add more models to this plot though

#

Mostly I'm just continually annoyed by glue messing up all the results

keen tartan Mar 8, 2025, 10:42 PM

#

Yeah, don't worry too much right now. You did already pretty well with all those obstacles.

#

I will get back with some concrete solution suggestions.

gusty condor Mar 9, 2025, 1:58 AM

#

keen tartan Who did the evaluations for Llama3.2 1B/3B and Qwen2.5 1.5B/3B?

I did.

obsidian quest Mar 9, 2025, 3:42 AM

#

https://x.com/BlinkDL_AI/status/1898579674575552558

BlinkDL (@BlinkDL_AI) on X

RWKV7-G1 "GooseOne" first release: reasoning @ 0.1b params, pure RNN (attention-free), fully multilingual. Demo & weights on https://t.co/fZ7rmVKsKj 🪿 Larger G1 training in progress.

gusty condor Mar 9, 2025, 8:19 AM

#

Who is doing these expreriments?

keen tartan Mar 9, 2025, 8:19 AM

#

gusty condor I did.

Do you have the results files for those too? I could not find them.

gusty condor Mar 9, 2025, 8:19 AM

#

The x-axis of (c) is not consistent with others

gusty condor Mar 9, 2025, 8:21 AM

#

keen tartan Do you have the results files for those too? I could not find them.

I have them, saved in another txt. I will try to retrieve them

keen tartan Mar 9, 2025, 8:22 AM

#

gusty condor I have them, saved in another txt. I will try to retrieve them

Ok, very well. Yes, please.

gusty condor Mar 9, 2025, 8:25 AM

#

Uploaded

#

https://github.com/Triang-jyed-driung/myevals

GitHub

GitHub - Triang-jyed-driung/myevals: my evaluations

my evaluations. Contribute to Triang-jyed-driung/myevals development by creating an account on GitHub.

keen tartan Mar 9, 2025, 8:28 AM

#

gusty condor https://github.com/Triang-jyed-driung/myevals

That is great. We can parse it. Thank you!

obsidian quest Mar 9, 2025, 9:12 AM

#

Suggestions:

#1103039376184852622 message
#1103039376184852622 message
#1103039376184852622 message

tropic minnow Mar 9, 2025, 10:55 AM

#

obsidian quest Suggestions: 1. https://discord.com/channels/729741769192767510/110303937618485...

yes im doing #3

misty igloo Mar 9, 2025, 11:01 AM

#

tropic minnow yes im doing #3

#3 is already in the document, but not grouped together all in one place in this way, and we should credit schmidhuber/widrow/hebb etc.

#

(it's instead presented in the order of the current narrative)

#

feel free to fix it up tho!

tropic minnow Mar 9, 2025, 11:02 AM

#

the linked repo: https://github.com/RWKV/RWKV-LM needs a fresh pull from blinks'

GitHub

GitHub - RWKV/RWKV-LM: RWKV is an RNN with transformer-level LLM pe...

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

#

@misty igloo oposition for moving appendix D and E to 3. Architecture?

misty igloo Mar 9, 2025, 11:09 AM

#

gusty condor The x-axis of (c) is not consistent with others

These are @brisk bronze 's - we are having some trouble with nathan's ctx-extended 1.5B repo (probably because it was done using a very old version of FLA), but I'm working on fixing it so she can run the part up to 15k

tropic minnow Mar 9, 2025, 11:12 AM

#

tropic minnow <@1007072846960410685> oposition for moving appendix D and E to `3. Architecture...

ithink also Figure 2 (like it a lot !!!) could be moved to 3. Architectute where the recurrence formulation is first laid out

misty igloo Mar 9, 2025, 11:13 AM

#

tropic minnow <@1007072846960410685> oposition for moving appendix D and E to `3. Architecture...

wrt Appendix D, it could be okay for theorem 2, but my current view is that Theorem 3 is not realistic under actual conditions for RWKV-7
there is ongoing work to find a proof that would work without extra tokens, but I am somewhat doubtful it will happen
and the current proof of Theorem 3 is still not explicit enough about this fact that it is impossible for actual RWKV models to execute without injecting multiple tokens in between each input token

#

I also think that generally when someone reads the paper they want the overview not every detail inline

#

The main paper is already 18 pages long

tropic minnow Mar 9, 2025, 11:23 AM

#

@misty igloo are we releasing world-v3 dataset? i could find:

hevok/Goose-World-v3 · Datasets at Hugging Face

#

following the structure of [Intro][method][results][additional] i propose the following reordered abstract:

#

We present RWKV-7 "Goose", a new iteration of linear RNNs featuring a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show this architecture can solve problems outside of TC0 under standard complexity conjectures, exceeding the capabilities of transformers while retaining parallelizability of training.
We trained models up to 3B parameters on a new dataset that we name World-V3; which exhibit improved performance across a wide range of benchmarks and state of the art downstream tasks despite being trained on dramatically fewer tokens than other models in its class, including LLaMA 3.2 and QWen-2.5.
To foster openness, reproduction, and adoption, we release all our models on Huggingface, and our training and inference GitHub; all under Apache 2.0 License.```

keen tartan Mar 9, 2025, 11:27 AM

#

The intention is to release it as a kind of a "meta-dataset" (dataset if datasets) as the majority of subsets are available on HuggingFace. Those that are missing we could add as separate dataset repos and link them all together.

misty igloo Mar 9, 2025, 11:27 AM

#

tropic minnow <@1007072846960410685> are we releasing world-v3 dataset? i could find:[ ](https...

I don't know why the paper currently claims we'll release a 1% slice of it - I'm up for it if Blink wants to, but typically in the past he has not wanted to.

#

I have a comment in the doc asking about this...

misty igloo Mar 9, 2025, 11:34 AM

#

tropic minnow ```Recurrent Neural Networks (RNNs) offer a compelling option for Large Language...

I think you should always lead with your best foot forward, and this has a two sentence lead-in about other models and their problems, instead of immediately describing why the Goose is great or giving a hook to the reader.

There was some recent discussion here #general message from @young sparrow ⁠in the general⁠ channel about how to write a compelling abstract that might be useful

keen tartan Mar 9, 2025, 11:37 AM

#

misty igloo I think you should always lead with your best foot forward, and this has a two s...

I agree, the first sentence is very important and should captivate the reader immediately.

gusty condor Mar 9, 2025, 11:49 AM

#

misty igloo I don't know why the paper currently claims we'll release a 1% slice of it - I'm...

I am asking Blink for that.

gusty condor Mar 9, 2025, 11:53 AM

#

tropic minnow ```Recurrent Neural Networks (RNNs) offer a compelling option for Large Language...

... on a new dataset that we name World-V3; which exhibit improved performance ...
What is this which referring to?

tropic minnow Mar 9, 2025, 1:33 PM

#

gusty condor > ... on a new dataset that we name World-V3; **which** exhibit improved perform...

The models. But can be rephrased to improve clarity. Some readers might be puzzled as well when reading it

young sparrow Mar 9, 2025, 1:37 PM

#

tropic minnow ```Recurrent Neural Networks (RNNs) offer a compelling option for Large Language...

I don't think that the two opening sentences are too problematic, I'm more worried about the fact that the third sentence is about something most ML people don't care about

#

The circuit complexity stuff should be an aside, and probably the second to last sentence in the abstract. Right before the comment about releasing stuff

keen tartan Mar 9, 2025, 1:43 PM

#

young sparrow I don't think that the two opening sentences are too problematic, I'm more worri...

Abstracts are kind like the first few seconds of a Youtube video. You should start with a hook, otherwise viewers (i.e. readers) will drop before even getting a bit further.

#

The current first sentence is very catchy.

young sparrow Mar 9, 2025, 1:46 PM

#

I think we agree?

keen tartan Mar 9, 2025, 1:47 PM

#

Yes, the complexity stuff should not be at the beginning perhaps.

#

But it is also important to highlight what is novel of the suggested architecture and how it was achieved.

young sparrow Mar 9, 2025, 2:04 PM

#

I agree.

misty igloo Mar 9, 2025, 2:11 PM

#

@keen tartan is it possible to put together a list of which sub-datasets within the entire World v3 corpus are no longer available online?

gusty condor Mar 9, 2025, 2:12 PM

#

Wait! What happened?

#

They do sum up 3119.2

misty igloo Mar 9, 2025, 2:12 PM

#

gusty condor Wait! What happened?

Is this correct? I thought you guys said it had to get updated

#

Afaict it never did

gusty condor Mar 9, 2025, 2:13 PM

#

Yes, this is indeed correct

keen tartan Mar 9, 2025, 2:14 PM

#

misty igloo <@371036620008194048> is it possible to put together a list of which sub-dataset...

Yes, it is kinda already done.

#

All available datasets are linked. I will get the list down to the problematic ones.

misty igloo Mar 9, 2025, 2:15 PM

#

obsidian quest world-3.0 ```book 337.2 science+wiki 222.7 math 32.3 law&gov 19.0 fiction 192.6 ...

this was blinks original message and afterward blink asked Hevok to combine all the items so he could go over it and fix it up

#

@gusty condor yet the table has not changed since then

#

that's why I temporarily commented out the table, because it never got updated after that

#

if I misunderstood, let me know - it seemed like Blink wanted to update it to be accurate in some way

keen tartan Mar 9, 2025, 2:20 PM

#

I did combine all and Blink went over and added missed datasets.

#

Datasets that seems not available any more are for example https://huggingface.co/datasets/marianna13/random_quora

#

I am still searching for those.

#

It might be this one: https://huggingface.co/datasets/marianna13/random_dataset

#

I will pin down all.

misty igloo Mar 9, 2025, 2:24 PM

#

keen tartan I did combine all and Blink went over and added missed datasets.

Yeah I know you guys got that part done, which is great! But wasn't there also something about updating the categories summary as a result?

keen tartan Mar 9, 2025, 2:25 PM

#

It was my suggestion. As the categories seem rather arbitrary.

#

For instance something can be both code as well as web.

#

Like StackOverFlow data.

#

An ontology might be helpful in such a case.

misty igloo Mar 9, 2025, 2:28 PM

#

Well we don't have to put this summarized list of category breakdowns into the paper. Let's only put it in if we have one that we think is helpful and correct.

gusty condor Mar 9, 2025, 3:05 PM

#

I think it is very important. Our dataset is rich in novels and fictions, but falls short of math and code compared to Qwen2.5 series. I think this is an important piece of information.

keen tartan Mar 9, 2025, 3:16 PM

#

All right. Updated: https://huggingface.co/datasets/hevok/Goose-World-v3

#

Almost all are ready-ably available with only a few exceptions:

Wikipedia: Loader not working anymore https://huggingface.co/datasets/olm/wikipedia
Guanaco: Was taken down because included private data https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
Books3: Taken down because of copyright issues https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz https://huggingface.co/datasets/defunct-datasets/the_pile_books3

#

Can be easily fixed.

#

For 2 & 3 I am not certain yet how to resolve.

#

Does anyone have backup copies of Guanaco and/or Books3?

#

I may have perhaps on some old drive, not sure.

#RWKV-papers