#RWKV-papers

1 messages · Page 9 of 1

quaint quiver
#

Ya I mean if ur gonna post train it I assumed u would put this thinking between the question and answer

#

Also prompt and chat template I’m not sure is too relevant to the idea of latro

#

How do u plan on doing it to pretraining data

#

Where will the CoT be placed

sinful breach
#

restrict it to settings where the next token isn't necessarily ambigious, a small set of choices (at most 1/p), and where the model is uncertain among those choices and could perhaps reason which would be better. At the same time, might still end up being very wasteful as you'd be trying to reason about stuff like which of 2 synonyms is a better choice

gusty condor
quaint quiver
#

How would u do that efficiently on pretraining data?

#

Just seems like between the question and answer is 99% of the time most optimal

#

Easiest thing to train and model to learn quickly

#

Better user experience

#

And u keep prefill efficiency

gusty condor
quaint quiver
#

Oh ok didn’t realise lol

quaint quiver
#

Hm tbh still don’t see how u would do it efficiently from the pdf

#

Although maybe the inefficiency is fine

gusty condor
#

Not really efficient, but worth trying

obsidian quest
#

this is correct, however for some strange reason, the training will nan after some time if we apply 1.6x

i noticed this before. dont have time to debug it yet lol

#

similar to this

alpine ferry
#

are any of the Goose models (1.5B, 3B or 7B) on huggingface to experiment with? I was thinking of running some long context experiments

misty cedar
#

1b5 can do niah at least to 32k, bigger v7 models coming soon

misty igloo
gusty condor
#

I decide to put this table into the introduction of RWKV-7's paper. However, I don't understand exactly how TTT-linear and Titans update their states. I think TTT involves a mini-batch gradient descent, but I have no idea how to write the state evolution formula in a suitable format.

misty cedar
grim grotto
#

see eq 7 of https://arxiv.org/pdf/2407.04620 for TTT (except exclude the $x_t$). Since TTT is quite different from all the other techniques (since it essentially maintains a state for 16 steps) so maybe just pretend the mini-batch size is 1 and add a comment as a footnote?

Effectively it would be

$$S_t = S_{t-1} (I - 2\eta k_t k_t^T) + 2 \eta v_t k_t^T$$

or equivalently

$$S_t = S_{t-1} -2\eta(S_{t-1} k_t - v_t) k_t^T$$

Where $\eta$ is a scalar
(You may have to switch the transposes based on convention of row vs column vectors? For instance $v_t^T k_t$ in RWKV-6 would result in a single scalar using column vectors conventions, but it should actually be a matrix, so I assume you are using row vector conventions)

silent urchinBOT
#

thiscord13

grim grotto
#

Titans seems to be almost exactly the same (also using minibatch), except it has

$$S_t = S_{t-1} (w_t I - 2\eta_t k_t k_t^T) + 2 \eta_t v_t k_t^T$$

Where $w_t$ and $\eta_t$ are learnable scalars

silent urchinBOT
#

thiscord13

grim grotto
#

Actually $w_t$ and $\eta_t$ may be input-dependent vectors so you would have to wrap these with diag

silent urchinBOT
#

thiscord13

dawn pewter
#

What is the range of a_t?

gusty condor
#

(0,1)

gusty condor
#

I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, @paper dove , @iron parrot , @dawn pewter and I are working on it. Does anyone have suggestions on the current paper?

gusty condor
#

@steady ether Have you tested RWKV-7 MQAR?
It seems that the special initialization of RWKV models are not used, which may affect performance.

steady ether
misty igloo
young sparrow
misty igloo
#

COLM will use the following policy, adapted from NeurIPS: 'Non-anonymous preprints (on arXiv, social media, websites, etc.) are permitted. We recommend you indicate “preprint”, rather than the “final” option int he template. Reviewers will be instructed not to actively look for such preprints, but encountering them will not constitute a conflict of interest.
Yep, looks fine!

misty igloo
# steady ether I tested it a while back. What’s the special initialization you’re referring to?...

this part

            # !!! initialize if you are using RWKV_Tmix_x070 in your code !!!
            # self.receptance.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.key.weight.data.uniform_(-0.05/(C**0.5), 0.05/(C**0.5))
            # self.value.weight.data.uniform_(-0.5/(C**0.5), 0.5/(C**0.5))
            # self.output.weight.data.zero_()

also, the 'suggestion' values for the LORAs would be good to follow, as those are what are actually used for the models

steady ether
#

Ah, I completely forgot. Thanks!

misty igloo
misty igloo
#

also, @obsidian quest could you describe what hardware and batchsizes etc. were used for the training

misty igloo
#

and one more question: when you continued the World v2.0 models on World v2.1, how exactly did that work? It was just an additional 0.3T tokens trained? I know for RWKV-7 World v3.0 you trained again on the whole 3.1T World 3.0 corpus...

gusty condor
#

I suggest you further add this at line 81:

    if 'rwkv' in block_type.lower():
        # initialize embedding and head
        ...
        return
#

@steady ether Add Channel mix too. Your code did not use Channel Mix, and it is very different from GLU

gusty condor
#

You didn't handle properly the v_first term either.

misty igloo
gusty condor
#

Exactly!

misty igloo
#

ok, will have updated the manuscript accordingly 😉

#

I added an intro and did a bunch of checking and edits, will try to add a background section tomorrow and edit more things that I know aren't correct yet

#

Also added trained models section

obsidian quest
misty igloo
#

especially when it was only 1-3 epochs, spread out by trillions each time

#

@gusty condor are you going to train a RWKVMusic for v7?

obsidian quest
#

so i think it's more like 1.1/2 + 1.4/2 + 3.1 🙂

gusty condor
misty igloo
gusty condor
#

0.1B is trained from scratch (likely)

#

I don't know if 0.4B is converted from RWKV-5

misty igloo
#

@obsidian quest which models were trained from scratch? and which were converted from v5 and v6 and which ones are from world v2 etc?
0.1B - from scratch? just world v3?
0.4B - are all the others from v6 world2.1 upgraded?
1.5B
2.9B
and were those v6 world 2.1 all from v6 world2? or from v5

gusty condor
obsidian quest
#

0.1 from v5 world, 0.4 from v5 world 2, 1.5 2.9 from v6 world 2.1

misty igloo
#

so for 0.1B and 0.4B did you upgrade to v7, then train world v3 directly for those? so they are only 1.1T + 3.1T?

gusty condor
#

0.1B is likely world v1

misty igloo
gusty condor
#

There is no world v2 0.1B model

misty igloo
#

RWKV-4 paper does not show the number of tokens or contents of World dataset

#

does anyone know this info?

#

I guess we can live without the contents, but the token count would be good to show

#

@obsidian quest can you provide the values for this chart for v7 World 3 training

#

(I think I have the config for Pile)

gusty condor
#

Something like this

misty igloo
gusty condor
#

I think it's better using RWKV-7 more than Goose

#

World v1: 0.59

#

world v2: 1.12

#

v2.1: 1.42

#

v3: 3.119

misty igloo
#

good to use the same precision consistently, so I think it should be either one or two decimal places

#

I have updated it to use a single decimal place for now

#

@gusty condor let's discuss whether to use "state of the art" versus "state of the open" - what are the closed source models at these scales against which we are competing, and can we find equivalent benchmarks for them?

#

I'm not 100% certain on even state of the open, but I haven't seen any models that beat RWKV-7 at the 3B scale

#

maybe some hybrids might? we need to check this

fresh mulch
# gusty condor I want to have RWKV-7 paper posted on arxiv before March 1st. Currently, <@10720...

My 4 cents:
Do we plan a section on speed/memory benchmarks like sec. 9 of the Eagle/Finch paper? I see it is currently commented in the LaTeX source.

I would also suggest we reformat Sec. 4.1.1 for clarity, because we introduce a dozen or so RWKV-specific variables and it's easy to forget the first few times around. I find myself frequently referring back to it for variable meaning and faster lookup would be great.

Similarly I would like to see intuitive explanations for some design choices throughout Sec. 4, and connections of ways in which Goose design choices can be considered similar (or different!) to other linear attention architectures, like how Eagle/Finch Sec. 4 did it, to contextualize the work in the broader linear attention landscape. (Maybe this will be covered in the background Sec. 2)

Spitballing on this last one, but I also wonder whether we can come up with any simple explanations and visuals. Technical media makes good papers, but simple media makes good blogposts, which in turn makes good (maybe even viral) publicity. I really like Figure 2 in this regard and wonder if we can expand on that.
For example Fig 1 from Attention hits about as many Google Search results as the whole query "RWKV". If we have something that is easily accessible in that regard, I think it will do wonders for RWKV's publicity. (Figure 4 is too intimidating, IMO.)

#

on another note is anyone testing Goose for music or audio modeling atm? if not I'd like to contribute that

misty igloo
#

Definitely contribute some audio modeling if you can! Maybe you'd like to get the linaspeech code and rework it for v7? Or if @gusty condor doesn't have time to do it he could share the music modeling code with you to attempt that one

#

Good point on 4.1.1 - I did some reworking here already but more is needed. As usual, the problem is balancing total pagecount with readability. For arxiv it doesn't matter, so I could just add in a reference sheet, but I think it'd be nicer for the full paper to match somewhat

#

I will also include a code version in the appendix, with legible naming and comments

misty igloo
obsidian quest
fresh mulch
fresh mulch
#

other ideas include semantically spacing the lines, e.g. putting a \vspace{0.5em} between g, d, v, a and r, k, v

misty igloo
fresh mulch
#

hm well it is really weird to me that we use two different font 'v's (well, \nu and v). For instance it is very difficult to tell them apart on my phone screen - could we do \bar{v} or something clearer for value without residual?

misty igloo
gusty condor
silent urchinBOT
#

Zhang Ruichong

gusty condor
#

$\nu$ has a sharper angle at the bottom.

silent urchinBOT
#

Zhang Ruichong

fresh mulch
# silent urchin **Zhang Ruichong**

It also has the serif on the top left. But it is for the sake of clarity: for example, if I lean back in my chair, or open Discord on a phone, I cannot distinguish them in this image.

gusty condor
#

That is more valuable than music modeling ( @iron parrot can do that)

misty igloo
#

it was my mistake, I shouldn't have tried to use a similar looking letter and will change it to have a tilde or hat or something instead

#

same issue with kappa versus k

#

I just wanted something that looked like a k, since its related... but maybe it's better as a completely different letter

#

the problem is these things go through a few steps, so for example we already needed kappa hat

#

same issue with alpha and a btw

#

I'm really not sure what I would replace them with though

#

blink calls kappa 'kk' in the code lol

gusty condor
#

RWKV-7 MQAR

L=512  KV=64  D=64   98.43%
L=512  KV=64  D=128 >99%
L=512  KV=64  D=256 >99%
L=512  KV=64  D=512 >99%

L=1024 KV=128 D=64   95.01%
L=1024 KV=128 D=128 >99%
L=1024 KV=128 D=256 >99%
L=1024 KV=128 D=512 >99%

L=2048 KV=256 D=64   72.93%
L=2048 KV=256 D=128  94.97%
L=2048 KV=256 D=256  98.97%
L=2048 KV=256 D=512 >99%

https://wandb.ai/rwkv_tune/zoology-rwkv-7

gusty condor
#

@misty igloo What is "relaxed replacement semantics" in the abstract?

misty igloo
#

We can rephrase if you like

steady ether
gusty condor
#

@steady ether How many learning rates are tested?
I use $$ LR = \frac {(1.0, 2.0, 4.0)}{\sqrt{\mathrm{d_model}} \cdot \mathrm{sequence_length}} $$

#

Is our batch size aligned?

silent urchinBOT
#

Zhang Ruichong

misty cedar
#

does your inner monologue just run in latex?

gusty condor
# silent urchin **Zhang Ruichong**

See https://arxiv.org/abs/2407.05872 (and some others) why I use such a formula

#

After testing d_model = 64 at sequence length 1024, I transferred the LR across all runs

steady ether
gusty condor
#

@misty igloo You modified my formula here

#

Groupnorm is swallowed

misty igloo
#

This way we can keep everything per head in 4.1.2 and full vectors of size D in 4.1.3

gusty condor
#

4.1.3 is not really dimension D

#

This is summed per-head:
$$ \langle r_{t, j} \mathrm{diag}(\rho_j), k_{t, j} \rangle $$

silent urchinBOT
#

Zhang Ruichong

misty igloo
#

I don't think adding tons of subscripts is making the paper better

#

it's just harder to follow

#

when the reality is that its a bunch of hadamard products and an addition

#

oh sorry

#

yes I made a mistake here

#

and I got busy and forgot to correct it

#

give me a few minutes to look over it - I'll put it back if I don't find a better solution

#

but it would be nice if 4.1.3 was less complex looking

gusty condor
misty igloo
#

I love songlin's papers but they are very challenging for non-mathematicians to read through

#

my original formula that was in the paper before it got changed was more like this

#

putting the heads together doesn't really happen until the very last step, right before multiplying with W_o

gusty condor
#

It's 2025 and people are getting used to these

misty igloo
#

but the actual situation is that everything is per head until W_o... everything except tokenshift can be considered and written per head

#

so there is simply no need for head subscripts in any part of the paper

#

I'm trying to keep things simple here so that it can be quickly and easily understood

gusty condor
#

This is a per-head sum

misty igloo
#

its an inner product

gusty condor
#

Yes, inner product weighted by r_k

#

Also this c

#

Why do you remove that

misty igloo
#

Why did you add it? It's not part of RWKV-7

#

I know it's part of some proofs

#

so I left it in the proofs as an extension

gusty condor
#

This is naturally extendible and exists in some of @iron parrot 's experiments.

#

Should be considered a hyperparameter

misty igloo
#

I don't think it should be in the official formulas if it's not in any RWKV-7 code that has ever publicly existed

#

I agree that it should be listed as an extension in the paper though

#

when people read the paper and see the code they shouldn't be surprised that the code does not have parts that are in the paper

#

if you still think it should be in the main formulas, we can just ask Blink if he wants it to be a part of the official RWKV-7 definition or listed as an extension

gusty condor
#

C being set to 1 is more like a compromise

#

Originally it was 2

#

That caused NaNs in rc2

misty igloo
#

you don't think it will be a problem for readers that the code literally doesn't have this in it?

#

I think it's pretty bad when I read a paper and the code does not conform to it

#

extremely confusing to the reader when that happens

gusty condor
misty igloo
#

@obsidian quest what do you think? should we put this additional c parameter (it would be 1.0 in all existing models) into the codebases?

iron parrot
#

I ran some loss tests on PG19 with different models. Surprisingly, the loss doesn't seem to get better with longer context lengths, even with newer models

#

Looks like this is something specific to this dataset

misty igloo
iron parrot
misty igloo
misty igloo
#

@gusty condor I made a correction to your bonus formula... maybe this is a mistake in the model, but the code uses \tilde{k} for it
There is also a mistake where it does not apply the gate before the output matrix. I have corrected that as well but it does not fit nicely with your head indexing

iron parrot
#

I'm about to run NIAH tests. Which NIAH variant should we use, or should we go with RULER instead?

gusty condor
#

just simple passkey in garbage - at least rwkv-7 solved

misty igloo
#

@brisk bronze ran this NIAH style passkey in garbage test and has results - she has an updated version of @iron parrot 's mamba repo that uses exact token counts instead of an approximation based on the average tokenizer bytes per token, as well as some other updates

#

I asked her to add these to the paper a couple days ago but I guess she hasn't gotten around to it yet

misty igloo
iron parrot
iron parrot
#

long-context loss tests on the Proof Pile dataset

misty igloo
iron parrot
obsidian quest
misty igloo
obsidian quest
#

will provide latest list soon

gusty condor
#

I suggest we open a separate new github link for all our experiments:

  1. RWKV-7 training code (should only include RWKV-7)
  2. MQAR testing
  3. lm-eval code
  4. state visualization
young sparrow
#

Does the new RWKV have a working HF implementation yet?

brisk bronze
misty igloo
#

I have older ones too, and we have an upcoming simplified Rwkv-Blocks repo that implements it too, but I recommend the fla-hub versions at this point

gusty condor
misty igloo
gusty condor
#

Inference mainly

misty igloo
gusty condor
#

But I suspect the main problem lies in the logit head and outputs, which increases ppl a littie

misty igloo
misty igloo
#

I spent about 24 hours tracking that one down at one point

#

like you have to call torch.nn.functional.layer_norm or group_norm with the weights upcasted to float potentially (even though they are stored as bf16) so that the calculation is done in float precision

#

that could easily be the issue

#

it's not currently being done in o = self.g_norm(rearrange(o, '... h d -> ... (h d)'))

gusty condor
#

GLUE looks like problematic

#

There are several subtasks in math and CS of MMLU that RWKV-7 lags behind Qwen by over 20%

obsidian quest
#

qwen2.5 has much higher MMLU comparing with llama3.2 too
they have lots of synthetic data

gusty condor
#

I think there is some slight data leakage. Qwen-2.5's mmlu can be deducted by around 6%.

gusty condor
misty igloo
# gusty condor GLUE looks like problematic

This was done using the FLA hf so maybe the fix will help a bit. Glue was having some issues so I need to double check the methodology there. @brisk bronze where did we end up on that, did you manually average the accuracy based entries?

brisk bronze
gusty condor
misty igloo
gusty condor
#
{
  "model": "/home/zhangping/zrc/RWKV-x070-World-2.9B-v3-20250211-ctx4096",
  "tasks": [
    "glue"
  ],
  "num_fewshot": 0,
  "results": {
    "glue": {
      "f1,none": 0.6928835730390357,
      "f1_stderr,none": 0.0036436664119938256,
      "acc,none": 0.6684581943782754,
      "acc_stderr,none": 0.0016703858434417523,
      "mcc,none": 0.05185503773957725,
      "mcc_stderr,none": 0.032600944586408685,
      "alias": "glue"
    },
    "cola": {
      "mcc,none": 0.05185503773957725,
      "mcc_stderr,none": 0.032600944586408685,
      "alias": " - cola"
    },
    "mnli": {
      "acc,none": 0.39449821701477333,
      "acc_stderr,none": 0.004933523584717906,
      "alias": " - mnli"
    },
    "mnli_mismatch": {
      "acc,none": 0.4044955248169243,
      "acc_stderr,none": 0.004949946753591583,
      "alias": " - mnli_mismatch"
    },
    "mrpc": {
      "acc,none": 0.7794117647058824,
      "acc_stderr,none": 0.020553105287596057,
      "f1,none": 0.8534201954397395,
      "f1_stderr,none": 0.016157946331836814,
      "alias": " - mrpc"
    },
    "qnli": {
      "acc,none": 0.5678198791872597,
      "acc_stderr,none": 0.006702886134456929,
      "alias": " - qnli"
    },
    "qqp": {
      "acc,none": 0.8064803363838734,
      "acc_stderr,none": 0.0019647755361788884,
      "f1,none": 0.6912635151132507,
      "f1_stderr,none": 0.003676786809851292,
      "alias": " - qqp"
    },
    "rte": {
      "acc,none": 0.7472924187725631,
      "acc_stderr,none": 0.026157719758464693,
      "alias": " - rte"
    },
    "sst2": {
      "acc,none": 0.893348623853211,
      "acc_stderr,none": 0.010458867008246837,
      "alias": " - sst2"
    },
    "wnli": {
      "acc,none": 0.5352112676056338,
      "acc_stderr,none": 0.0596130578497224,
      "alias": " - wnli"
    }
  }
}
#

These are my tests, definitely better

misty igloo
#

we are using FLA HF for NIAH etc. too so we really need that repo to work properly

brisk bronze
misty igloo
gusty condor
misty igloo
#

This thing of using RWKV_PAD before the text is pretty weird, and has been discussed as problematic before.
If we are going to do that for evals we should simply change the model to have it be in the starting state.
Where in the FLA HF code does it put [0] in the starting state?

gusty condor
#

then automatically handled by lm_eval

misty igloo
#

yeah that's fine, as long as we tell that to lm-eval

#

doesn't it have a 'add_bos_token' option

#

so we would need to set that when running like pretrained=MODEL,add_bos_token=True

#

I don't think it will add it automatically without that commandline setting

#

lets get the bugs fixed in the fla hf implementation so we can be using it for evals, and so that others who use it will not get bad results

brisk bronze
misty igloo
#

True is not capitalized in python, I dunno if that matters

#

I mean the first letter is but not the rest

#

"True"

#

Also lambada is typically less sensitive to this somehow

#

I find the bos token impacts different evals differently with the average ending up not really different

#

The biggest issue is most likely the group norm bugfix

brisk bronze
obsidian quest
misty igloo
misty igloo
obsidian quest
#

got paper link?

obsidian quest
misty igloo
obsidian quest
misty igloo
obsidian quest
misty igloo
#

I think that's what the 'c' variable is for - to allow expanded eigenvalue range

obsidian quest
#

extending eigenvalue:```orig:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 )
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*a)

new:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*(a.float()*torch.exp(-torch.exp(w.float()))).to(dtype=torch.bfloat16))

or (try both)

a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kktorch.exp(-torch.exp(w.float())).to(dtype=torch.bfloat16), kka)

#

it's a bit different

misty igloo
#

yeah, it wasn't my addition 🙂 I just don't like it when the formulas dont match the code because it confuses readers

gusty condor
#

I believe semantically the constant c matters. It's just a compromise for training stability that we changed it to 1. Adding that c can better explain the motivation of RWKV-7.

gusty condor
gusty condor
#

You are using c=1

#

It's just a generalization

obsidian quest
#

ok please simply remove c

#

because i dont think it is needed

gusty condor
obsidian quest
gusty condor
#

OK, will add a subsection in the appendix to discuss that

misty igloo
misty igloo
#

I added an initial draft of a background section just now.

gusty condor
#

There are too much in "others." How much instruction and Chinese novels are there?

#

@obsidian quest could you elaborate for "others"?

obsidian quest
young sparrow
#

Those numbers don't match the table currently, PSA

obsidian quest
#

ok could someone please combine v2 + v2.1 + v3 items and arrange them to approximately match this list and i will fix on top of it because there are so many components

misty igloo
keen tartan
misty igloo
#

if you're going to put this together maybe a google sheet would be best

bronze frost
#

I was looking at the Transition Matrix Stability Proof in the paper. Normally a contraction matrix is defined as having norm less than 1, not eigenvalues in (-1,1). It's misleading to say it's a contraction, since being a contraction would imply that the state cannot blow up. However, the state can blow up.

bronze frost
#

Additionally, I wrote a ~100 line standalone implementation of RWKV-7 inference in numpy (to avoid hiding things in torch functions). It's verified numerically against the pip rwkv package.

gusty condor
misty igloo
#

I'm not sure who is adding the chat examples, but we should discuss this before you do. Using the "base Gradio 7B model" (whatever that is) is not appropriate for a paper that does not include any 7B model.

#

I'm also not sure we want to show chat examples in this paper.

gusty condor
#

I don't want that.

#

We should include more technical stuff.

misty igloo
#

Agreed. I think we are playing with the big boys now and are way past needing to show that it can talk nicely.

#

I added a draft appendix section on design decisions, walking people through how it all works and why.

gusty condor
young sparrow
#

@misty igloo @gusty condor I was going to read through the paper and do a suggestions / editing pass today. Is there anything in particular you'd like me to focus on?

obsidian quest
obsidian quest
#

pls add lora dimensions suggestions as in RWKV-LM

#

more suggestions: wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18

obsidian quest
#

add RWKV-4 1.5b to Compression rate% eval

obsidian quest
#

v7 0.1/1.5/2.9/0.4 loss curves

#

0.4 - bsz 240 lr_init 5e-4 => bsz 480 lr_init 6e-4

1.5 - bsz 480 lr_init 4e-4 => bsz 672 lr_init 4.5e-4 => bsz 1152 lr_init 6.1e-4

2.9 - bsz 640 lr_init 4e-4 => bsz 1008 lr_init 5e-4 => bsz 1120 lr_init 5.4e-4 => bsz 2016 lr_init 8e-4

all - wd 0.1 // adam beta (0.9, 0.99) // adam eps 1e-18 // lr_final 1e-5
iron parrot
obsidian quest
#

let's try v7 for sudoku too @iron parrot

gusty condor
#

TODO:
5. Add limitations and acknowledgements

gusty condor
#

I mean, is there a formula for it?

obsidian quest
neon tree
#

how could contribute to the manuscript (from fla groups berk

young sparrow
#

@obsidian quest It looks like I don't have permissions to view the overleaf

#

Can you add me?

obsidian quest
#

i am not its owner 😂

young sparrow
#

Who is?

misty igloo
#

@tropic minnow is the owner, but that URL will allow you access

keen tartan
#

Does Overleaf have a dark/night mode? It is so bright! o.O

keen tartan
#

I noticed that the DeepMind Mathematics (dm_math) dataset is part of The Pile 1 and was already included in world-v2 I assume, but it seems to be not mentioned in the Eagle & Finch paper. Where should it be placed?

obsidian quest
keen tartan
#

Should it perhaps be added as an Errata to the Eagle & Finch paper too?

obsidian quest
#
v7 0.4b = v5 0.4b + subsampled 2T tokens from world-3```
misty igloo
#

Also, a lot of the evals are still preliminary or missing. We are working on some discrepancy issues we've found to ensure everything is really solid.

spiral minnow
fringe egret
#

Excuse me, who is conducting the Bamboo Benchmark test right now? If no one is doing it, may I take on this part of the test?

unborn lintel
#

Would it make sense to mention the D512 and D576 variants of the 0.1B model?

misty igloo
#

It is mostly there simply because I copied over the results from benchmarks in the Eagle/Finch paper.
Also, if you decide to run it you might consider using our extended context finetunes of 1.5B and 2.9B.

misty igloo
#

I recall that they are narrower but deeper.

gusty condor
misty igloo
#

What's your opinion on listing them? I think it may complicate the paper unnecessarily

#

We could add a section on depth versus width ablations, but I'm also not sure that this is really a RWKV specific result

gusty condor
gusty condor
brisk bronze
nova frost
#

I think bamboo will be pretty low signal for base models

fringe egret
nova frost
#

They mostly have instruct benchmarks in their paper and the tasks are structured in a way that base models will do poorly

misty igloo
# gusty condor Is there previous research about this?

the problem is that it's very niche - we don't have comparison models of other architectures with these changes
so we could show it in its own separate ablations section I suppose, but it wont be relative to other architectures

misty igloo
gusty condor
#

Skip it, ok

#

I tested gsm8k and found that it's very sensitive on response format, so I decide to skip it

nova frost
#

But maybe the instruct models have context extension idk

misty igloo
#

Yeah we also have a context extended versions that we just trained, but we already will show NIAH for that

#

@brisk bronze is doing those, with her fork of what was originally jellyfish's revision of the mamba test 🤣

nova frost
#

Yeah. I think niah single needle and maybe one other. Multi key/query depending on the framing

misty igloo
#

single needle improves quite a lot with the extension

#

like 32k->48k @ 3B scale

obsidian quest
misty igloo
#

smollm 135M is 576 x 30

obsidian quest
misty igloo
#

yeah sorry typo

#

there's no really good comparison point because we don't have a 'normal' depth smollm2

#

but we can show them all side by side

#

just cant really draw much of a conclusion

#

also SmolLM is not trained on pile

#

@obsidian quest what kind of comparison with SmolLM were you thinking we would show?

obsidian quest
#

so this choice is from MobileLLM

misty igloo
#

gotcha

#

that used cross-layer weight sharing to reduce device RAM usage iiuc

gusty condor
#

State visualization for v6. Working on v5 and v7

gusty condor
#

SR: stable rank, (Frobenius norm / spectral norm) ^ 2

unborn lintel
#

Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, normalized the same way as the cryscan webgpu state visualization demo

unborn lintel
# gusty condor V7

Visualizations of S @ ones(64,1) for each head, arranged per layer, for some text generated with 0.1B, colored the same way as the state visuals above

gusty condor
obsidian quest
#

Interestingly, the stable rank of the WKV matrix in RWKV-7 has shown to be lower than that
of RWKV-5 and RWKV-6.

this is strange. if you check state visualization, rwkv7 states look much more "random" while rwkv6 states are more like checkboards (rank 1)

gusty condor
#

You can rerun those experiments. Actually in some layers of RWKV-7, the state is very concentrated.

obsidian quest
gusty condor
gusty condor
gusty condor
#

@misty igloo I got the formula for parameters correct:
$$ #(\mathrm{Params}) = 2DV + 4D + LD \left(12D + 2\left(d_w + d_a +d_v +d_g \right) + 19 \right) - (2Dd_v + D) $$

silent urchinBOT
#

Zhang Ruichong

gusty condor
#

Please double-check Appendix E. I'm finishing in a few hours!

obsidian quest
misty igloo
#

@obsidian quest did you really use adam_eps=1e-18 for all of the the entire runs?

obsidian quest
#

ok probably sometimes it NaN in 1 step because of this 😂 maybe 1e-16 will avoid this

misty igloo
obsidian quest
#

i just rewind a bit with cleared optimizer states

misty igloo
obsidian quest
misty igloo
obsidian quest
#

different cosines patched together

misty igloo
#

it looks like a time stretched cosine

#

do you have a formula?

#

visually it looks something like cos(t**2)

obsidian quest
#

=(1e-4)*(0.01+0.495*(1+COS(x*PI))) is this cosine decay

#

oh it's because i am using log axis for y

misty igloo
#

AH.. ok that makes sense now

#

sorry didnt notice that

misty igloo
#

@gusty condor I think we should mention that we increase the number of compute nodes as training progresses

#

@obsidian quest how many nodes and what kind of GPU was used total?

gusty condor
# misty igloo <@803473343705514025> I think we should mention that we increase the number of c...

Yes, let's make it an advantage

This approach not only enhances training efficiency but also utilizes GPU resources economically. After smaller models complete their training, additional GPU resources become available for the later stages of training larger models. This cascading resource allocation ensures that computational power is dynamically reallocated, maximizing hardware utilization and reducing idle time.
misty igloo
#

Great work! This section is really looking good.
Should we provide the FLOPs counts? I know it has been useful for people in the past, including Quentin

#

And it can be helpful if we want to put in a table comparing total trained FLOPs vs quality, like we had in the Eagle/Finch paper

#

I think that will most clearly show the pareto improvement of RWKV7 over these other heavily trained models

#

We can make it short and simple instead of the longwinded version that we had earlier.

gusty condor
#

@misty igloo I think there is a paper named "regular languages in nc1" and you can cite that. (assuming that Wu Tianyi's proof is good)

misty igloo
#

we think it needs some revision, but there may be something we can claim that exceeds the abilities of transformers

#

it also may be able to be simplified quite a bit

gusty condor
#

Yes I agree

misty igloo
#

for example, @bronze frost has a very simple construction for showing that you can create true transpositions (row-pair permutation matrices) with RWKV-7

gusty condor
#

I think NC1 can be achieved by just householder matrices (I'm not an expert in complexity theory)

misty igloo
#

unfortunately, we think true full permuation matrix requires multiple tokens

gusty condor
#

I agree with that

misty igloo
#

but within a single token, having a two-row permutation should exceed transformers abilities

#

so for example, we should be able to solve swaps on S5 using only incoming (prefill) tokens, which afaict transformers cannot do

#

supposedly this makes it so that we can correctly claim being NC1 complete under reduction by AC0

#

I'm not even a novice in this stuff, let alone an expert tho 🙂

gusty condor
#

Yes, @iron parrot tested on the parity experiment, RWKV-7 can grok while transformers can't

misty igloo
#

@young sparrow do you have a complexity theorist you could recommend to help us with this aspect of the paper? I'm muddling through but we really need someone who can easily cut through it all and validate our claims and (not yet rewritten) proofs

misty igloo
#

was jellyfish's parity experiment done with c=2?

gusty condor
#

Yes, c=2

misty igloo
#

(due to normalization, you can use decay instead of c to achieve similar things)

young sparrow
#

@misty igloo Will Merrill is the expert on this topic. Let me ping him and set up an introduction.

misty igloo
#

Yes I've been looking through his papers lately

#

I'm a little worried that I'm at too low a level of comprehension of this stuff to be the right one to discuss with him, but I do have a well informed broad view of what we're trying to achieve somewhat generally, and the mechanisms involved

gusty condor
#

COLM 2025

OpenReview submission site opens: February 27, 2025
Abstract deadline: March 20, 2025
Full paper submission deadline: March 27, 2025
Rebuttal period: May 27 to June 10, 2025
Decision notifications: July 7, 2025
Conference dates: October 7-10, 2025

misty igloo
#

nor, are you somewhat well versed in complexity enough to help us with the paper? it sounds like you might be

#

If so, let's chat

sinful breach
misty igloo
sinful breach
#

yes

misty igloo
steady ether
#

I found this set of synthetic tasks which seems relevant (https://arxiv.org/pdf/2403.17844). I ran a few of them and v7 is performing quite well. Here's an early plot (only ~10% complete but looking promising)

Also, someone should try the scaling experiments too but that looks like it will cost $$$$

steady ether
obsidian quest
misty cedar
#

Need to tag ffmpeg guy

steady ether
#

It's only like 10% done so we haven't gotten to the hard stuff yet

misty cedar
#

Compression here being a repeat after me?

steady ether
#

It's encoding a sequence into a token and then decoding it

gusty condor
misty cedar
gusty condor
misty igloo
#

And feedback is always welcome!

#

The writing process has been fairly organized so far, which has been great. I'd like to keep it that way and have editing proceed in an organized manner, with people mostly just adding sections, or working together directly on a section. We may need a wider edit beyond that and what I can provide at some point in the near future, I just want to avoid the 'the whole paper gets rewritten every day' thing that happened towards the end last year.

steady ether
steady ether
gusty condor
#

I think the arxiv v1 version can be uploaded in a few hours.

unborn lintel
gusty condor
unborn lintel
#

If they could share the data, I or someone else can remake the graphs using the same theme

#

currently, it seems like there is a mix of excel and python-generated plots...

steady ether
misty igloo
#

And I'm not comfortable yet with the exact claims we can make for the complexity class, yet that should be something we claim in the abstract.

#

I'm also not sure we have properly shown SOTA that we claim. I am working on a FLOPS chart, which will likely show that we have a new pareto frontier here, which would be a desirable claim.

#

Some other notes: I think you need to show Mamba-2 Pile in section K. It's not fair to compare it to the older model only.

bronze frost
#

I also don't think the paper will realistically be ready for arxiv today, but we should get it ready as soon as possible

#

We need everyone to fill in the author contributions section, it's barely started.

misty igloo
#

@here Yes, in the spirit of getting ready as soon as possible: If you made significant contributions to the paper and want to be listed as an author, please list your name and affiliations at the top in the authors section and begin putting in the details of your contributions into Appendix A: Author Contributions. I don't think we currently left anyone out of the list of authors, but definitely let me know if we did. Please also let me know your email address.

obsidian quest
#

yeah need a few more days

obsidian quest
steady ether
fresh mulch
#

I have an AudioRWKV experiment in the oven but I doubt it'll be ready for v1, still trying to figure out Goose state tuning. In the meantime, it looks like sec 2 is still a draft, so I might work on that - do I need to ask to make changes?

misty igloo
obsidian quest
#

let's call it a generalized FWP (fast weight programmer) RNN to respect Schmidhuber 😂

#

RWKV-7 is a generalized version because i am using deformed keys etc.

gusty condor
#

@brisk bronze please use lm-eval 0.4.3 and fp32 to evaluate mamba2

obsidian quest
quaint quiver
# steady ether It is!

Would recommend adding gated deltanet here to show the advantage of the vector lr and decay

steady ether
steady ether
gusty condor
obsidian quest
#

or share your code so i can check

gusty condor
#

I think it's more like wrong LR / Adam epsilon

gusty condor
#

@obsidian quest Which ZeRO stage is RWKV-7 trained on?
Is RWKV trained without pipeline parallelism?

obsidian quest
#

zero2

misty igloo
# quaint quiver Would recommend adding gated deltanet here to show the advantage of the vector l...

@fresh mulch and I are currently doing ablations against all changes from gated deltanet
they just arent in the manuscript yet
these are the differences we're ablating - please let us know if there are others you think are important to show:

  • making the gating (decay) w vector-valued instead of scalar
  • making the removal kk and replacement k keys different from one another
  • making the in-context learning rate a a vector instead of scalar
  • adding bonus (last part of code)
misty igloo
quaint quiver
#

Would say the parametrization of the decay

#

Gated deltanet uses the mamba way iirc so can compare with the rwkv with bias

misty igloo
quaint quiver
#

Ya like the calculation of the decay, mamba uses a specific init and multiplication style which gated deltanet use (songlin mentioned this was pretty important)

fresh mulch
#

@quaint quiver are you referring to training a gated deltanet for table 7, or using the gated deltanet init for our rwkv7 ablations, or something else?

quaint quiver
#

Mainly for table 7 as apparently it was important although could also be done as an ablation

obsidian quest
obsidian quest
#

could you explain this 🙂

fresh mulch
#

@fringe egret are you waiting on results for reportsumsort/showssort bamboo benchmarks or did every model legitimately just score a 0 on them?

obsidian quest
#

feel free to change my text


Giving two sequences of vectors $\{k_t\}$ and $\{v_t\}$, RWKV-7 will test-time-train an internal model $v \approx k S^\top$ via in-context gradient descent w.r.t the L2 loss $\mathcal{L} = \frac{1}{2}\Vert\, v - k S^\top\Vert^2$.

The gradient is:
\[\frac{\partial \mathcal{L}}{\partial S} = S k^\top k - v^\top k\]

The gradient descent formula (with dynamic weight decay $w_t$ and learning rate $\eta_t$) is:
\[S_t = S_{t-1} \operatorname{diag}(w_t) - (S_{t-1} k_t^\top k_t - v_t^\top k_t)\operatorname{diag}(\eta_t)\]
which equals:
\[S_t = S_{t-1} \left(\operatorname{diag}(w_t) - k_t^\top k_t\operatorname{diag}(\eta_t)\right) + v_t^\top k_t\operatorname{diag}(\eta_t)\]

In RWKV-7 I use the generalized formula:
\[S_{t} = S_{t-1} (\operatorname{diag}(w_t) + \textbf{a}_t^\top \textbf{b}_t) + \textbf{v}_t^\top \textbf{k}_t\]
where a reasonable choice of initial values is $\textbf{a} = -k$, $\textbf{b} = k\cdot\eta$, $\textbf{v} = v$, $\textbf{k} = k \cdot \eta$.

(update: basically diagonal + rank1 because it's good for parallelization. we can do rankn by adding more terms but it will be slower)

\textbf{RWKV-7 uses $\{k_t, v_t\}$ to test-time-train an internal model and uses $\{r_t\}$ as input for this model.} It overcomes the $\mathsf{TC^0}$ limitation of QKV-softmax-attention transformers (and RWKV-6, Mamba, Mamba-2, xLSTM, GLA, ...), while still being efficiently trainable on GPUs.

Such ideas can be traced back to fast weights (1991) by Jürgen Schmidhuber, delta rule (1959) by Bernard Widrow, hebbian learning (1949) by Donald Hebb. RWKV-7 is a generalized scalable version with more tricks to make it actually great at LLM. Details are in my open-source code.
#
Because the internal model is $v \approx k S^\top$, the output for input $r$ is $r S^\top$, and the pseudocode is:
\begin{lstlisting}
    for t in range(T):
        sab = torch.einsum("ik,k,j->ij", state, a[t], b[t])
        state = state * w[t] + sab + torch.einsum("j,i->ij", k[t], v[t])
        out[t] = torch.einsum("j,ij->i", r[t], state)
\end{lstlisting}
\vspace{-8pt}```
fresh mulch
#

this should go into section 3

obsidian quest
#

use log scale Y-axis for (a) RMS of RWKV state entries @gusty condor

misty igloo
obsidian quest
#

Table 3 4 9 17 19, put stronger models on top

misty igloo
obsidian quest
#

however our bamboo results are good so we can show them 🙂

misty igloo
misty igloo
#

anyway it really doesn't matter to me whether or not we include bamboo, but if we show bamboo it has to include those models.

misty igloo
misty igloo
fringe egret
#

Okay, I'll try testing it on other base models.

obsidian quest
#

figure 3 15 16, please test pg19 (not proofpile) @gusty condor

#

proofpile likely has bad information density. and test rwkv WORLD models

gusty condor
#

Yes, I tested World models.

gusty condor
obsidian quest
steady ether
obsidian quest
#

oh why 0.8+ but you mentioned 46%

misty igloo
steady ether
misty igloo
#

This was jellyfish's PG19 result - we could use that in figure 3 if you prefer it over proofpile...

obsidian quest
#

probably code bug

#

please test other models too

misty igloo
obsidian quest
#

ok then we should pick the "middle 16384 token" for each data item

#

if the length = X, pick token X*0.5-8192 to X*0.5+8192

misty igloo
#

theyre books so the beginning is often some fairly standard preamble, contents etc

obsidian quest
steady ether
sonic relic
fringe egret
iron parrot
#

RWKV-7 world seems to be a bit 'overfitted' to 4k ctx, when exceed 4k tokens, the ppl increase.

obsidian quest
iron parrot
#

Here's how RWKV World performs on PG19 (16k ctx) after removing the first 2048 tokens

#

After 16k, RWKV-7's ppl increases

#

That's the 'overfitting to 4k' phenomenon I mentioned before

young sparrow
iron parrot
obsidian quest
#

pls verify Mamba2-1.3B glue 46.1 in table 17. check its individual components vs rwkv7

obsidian quest
obsidian quest
misty igloo
misty igloo
#

we found that it was quite a bit better than the base model for NIAH as context length grew, so hopefully it should be for PG19 loss as well

young sparrow
obsidian quest
obsidian quest
#

highlight all RWKV model names in Table 5

pls search for my ID here to see all suggestions

iron parrot
#

So far, my test results on the proof pile and PG19 show:
All pile models (v4 v5 v6 v7) show decreasing loss as sequence length increases.
For world models (trained on way more tokens), the behavior varies: v4's loss explodes with longer sequences, v5 and v6's loss decreases then stabilizes, while v7's loss slightly increases after about 16k tokens (long-context fine-tuning can fix this).
This is why I call it some kind of "overfitting", more training actually hurts generalization. The severity ranking is: v4 > v7 > v6 = v5.
I think we should show separate loss charts for pile and world models, discuss this issue, and include comparisons with other models.
@obsidian quest@young sparrow@misty igloo

obsidian quest
iron parrot
gusty condor
#

Maybe it's time for us to shrink the state size once V7 has better state utilization.

obsidian quest
#

when you do ctxlen extension, use LONG data, for much better results

keen tartan
#

RWKV-7 World 0.1B and 0.4B LM Evaluation Harness Benchmarks

#

English focus

#

Multilang focus

gusty condor
keen tartan
#

0-shot.

gusty condor
#

I tested 5-shot (in order to match Qwen's performance in the technical report)

keen tartan
#

I check.

gusty condor
#

Usually 0-shot is 1-2% worse than 5-shot

misty igloo
misty igloo
misty igloo
#

Transformers, too, (including Qwen) are typically post-trained to increase context length so while this isn't exactly a win for us it's interestingly comparable.

iron parrot
#

a bit noisy since pg19 test set only has 100 samples

keen tartan
#

MMLU has been recomputed with 5-shot.

gusty condor
#

@obsidian quest This is on PG19

gusty condor
misty igloo
#

she was running them too but I think she had some tech issues

gusty condor
#

Should figure 4,5,6,7 be unified into a large figure?

#

Also please include v6 results too

misty igloo
#

@gusty condor the pawsx number for 2.9B looks incorrect to me... could you check that it was entered correctly?

gusty condor
#

By the way, I think this two pixels (as seen in other states) are used to pin the GroupNorm, preventing it from drifting. Now that v7 has O(1) state size, may we remove that GroupNorm?
Or, I think we can use GroupRMSNorm for that. Therefore we need one value to pin an RMSNorm, instead of two values to pin a LayerNorm.

gusty condor
misty igloo
#

like the bigger model and more training made it a lot worse than it was previously

gusty condor
#

There are some inverse scaling problems

young sparrow
young sparrow
misty igloo
gusty condor
#

Most of these models can't get a nontrivial score on MMLU

misty igloo
#

still working on it, but interesting initial fit lines

#

(somehow excel is being annoying but the unlabeled ones are the other goose world3 models)

young sparrow
obsidian quest
obsidian quest
misty igloo
obsidian quest
#

yeah can try

nova frost
#

I can add a PR with the fixes if anyone wants to try it

nova frost
#

PR here. mostly fixed some spacing and capitalization issues for the european tasks

brisk bronze
#

@gusty condor I ran glue for rwkv7-1.47b-pile in table 17 and the average of the subscores is like 8 percentage points higher than the glue score it computes, and rwkv7-421M-pile is 12 percentage points off. the glue score it computes is same as in paper
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T00-33-06.json
https://github.com/jannalulu/lm-evaluation-harness/blob/main/results-0.4.3/VisualRWKV__rwkv7-1.47B-pile/results_2025-03-05T01-44-07.json

#

lambada.o was exactly the same tho

#

mamba2 on the other hand, subscores of glue and computed glue score are only different by like 0.5 percentage points, probably due to rounding. they were run with lm-eval-0.4.3 and fp32 too

obsidian quest
brisk bronze
gusty condor
brisk bronze
#

fwiw rwkv7 1.47b had higher scores on subtasks expect for 1 or 2 iirc on glue

gusty condor
#

I think: like each single problem is given equal weight, rather than each task

tropic minnow
#

i think the indices for equation 8 are wrong

#

u_{t,j}should be only u_t and be a scalar, not of head dimension, as it is an inner product

#

and if we follow dirac notation (bra-ket) for the inner product, there should be a , or a | between r and diag(rho)

#

but using dirac notation and einstein notation in the same equation is a bit confusing imo

#

so: dirac notation: "add , and remove j subindex" or einstein: "remove <> and diag() "

#

votes?

#

either way, the in R^{D/h} should be in R so im making that change already

#

thoughts for changing Fresh for novel / new / recent ?

tropic minnow
#

ithink we can fuse the gating in 4.1.4 into equation 11; similar as equation 10

#

in the Pseudocode For RWKV-7 (appendix G) i would separate the weight projections from the time recurrent operation for clarity

#

happy to take this task

gusty condor
tropic minnow
#

ok recent

tropic minnow
#

github repo needs to be updated

#

to include v7 code from blinks repo

#

ithink part of Appendix C (until theorem) can be moved to methods and the proof can be kept in the appendix

gusty condor
tropic minnow
#

@misty igloo @gusty condor if we have a bit more time (say 20hours) i would like to include a more theoretical motivation and comparison of rwkv7 vs rwkv6 vs other linear RNNs. Ithink this can give the paper a more theoretical ground rather than the empirical vibe "we mixed 30 things and the result is cool ~sota"

tropic minnow
gusty condor
#

I tried deriving that and found the online objective being overly complex.

tropic minnow
#

in fact, in that table longhorn 's claimed squared associative objective is wrong, as their simplification for practical considerations makes it an inner product objective effectively

#

yes @gusty condor bc rwkv7 is explicit gradient descent, and those objectives are derived from implicit gradient descent algorithm

#

so a proper explanation is what i want to include

tropic minnow
#

i added a citation for the OG scaling laws on lstm paper by baidu on 2017 in the introduction

obsidian quest
#

seems nan is unrelated to adam eps #992362252269256815 message

misty igloo
nova frost
#

sounded reasonable to me. paws-x has a quite non-standard format

misty igloo
misty igloo
misty igloo
obsidian quest
misty igloo
misty igloo
gusty condor
#

We have no limit for arxiv, so we should submit to arxiv asap.

misty igloo
#

Yes, though I think we should be purposeful about which experiments are shown and in which order in the main section, so that it is most impactful for the reader.

gusty condor
keen tartan
gusty condor
#

Not really. We need the formula, otherwise it may be inconsistent

keen tartan
misty igloo
#

from the docs in v 0.4.3:
weight_by_size: bool = True whether to perform micro- averaging (True) or macro- (False) averaging of subtasks' accuracy scores when reporting the group's metric.

#
class AggMetricConfig(dict):
    metric: Optional[str] = None
    aggregation: Optional[str] = "mean"
    weight_by_size: Optional[str] = False
    # list of filter names which should be incorporated into the aggregated metric.
    filter_list: Optional[Union[str, list]] = "none"```
#

notice it defaults to False

#

@nova frost I only see group set for each task in this 0.4.3 version of GLUE so would it end up getting non-weightbysize?

gusty condor
misty igloo
#

And why are we only learning of it one day before you say there is a deadline?

#

This is really not okay.

nova frost
fresh mulch
#

@obsidian quest btw, i've been running ablations on some of the design choices (appendix K.2 at the moment) and find that using the same removal/replacement (k, kk) keys has competitive performance with current baseline Goose. For instance it gets higher acc on minipile validation. What kind of difference have you seen here in your experiments, or is there an intuitive reason why to do it?

gusty condor
#

Not me, but we should be as quick as possible

obsidian quest
obsidian quest
fresh mulch
#

no problem, thanks! just curious, it's also the least intrusive ablation i tested

gusty condor
#

RWKV-7-1.47B-pile

#

@brisk bronze

brisk bronze
gusty condor
#

Post your subscore please

gusty condor
#

And mamba-2's subscore

misty igloo
# fringe egret Okay, I'll try testing it on other base models.

hey just checking in, did you end up running bamboo on the other models we need to include it in the paper?
Also, I'm a little confused about the author contributions section - what is the "Compilation of the RWKV World 3.X Corpus?" I am the one who put together the listing, is there some dataset you put on huggingface or something?

gusty condor
misty igloo
#

And neither are the chat examples.

#

And we need more recent models like mamba 2 if we are going to include the bamboo results.

young sparrow
#

After continuing to read the messages it seems like most people are on the same page as the above. Also, I don't think anyone's going to get too upset if we do hypothetically miss a deadline promised in a blog post by a few days. That said, this is why EleutherAI has a standing policy of not committing to release dates ahead of time.

gusty condor
#

I apologize for for pushing too hard on an artificial deadline earlier. I am aware that deadlines can be motivating but the urgency can harm the cooperation. Thank you again for your patience.

#

Moving forward, I will put paper quality into the first place, and avoid imposing too much burden on others.

misty igloo
#

Thank you - just know that we all appreciate the immense amount of hard work you're putting into this paper!

brisk bronze
#

looks like lm-eval 0.4.3 was using weighted average for glue by number of problems so the glue score it outputs is correct.
from api/metrics.py:

    # A helper function that is used to aggregate
    # subtask scores cross-task.
aggregations
    if not weight_by_size:
        sizes = [1] * len(sizes)

    assert len(metrics) == len(sizes)

    return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)```

`validation split sizes: 
cola: 1043
mnli_matched: 9815
mnli_mismatched: 9832
mrpc: 408
qnli: 5463
qqp: 40430
rte: 277
sst2: 872
stsb: 1500
wnli: 71`
#

(also explains why mamba2-1.3b was higher on glue even though its subtasks scores don't look super different at first glance)

tropic minnow
misty igloo
# brisk bronze looks like lm-eval 0.4.3 was using weighted average for glue by number of proble...

after looking into what GLUE is made of, I think we should remove it from the paper
many of the sub-tasks contribute only a percent to the total so the weightings make no sense, causing the numbers reported to be more like 75% QQP than any kind of actual average
and QQP is a pretty weird benchmark, which should probably be run multi-shot to really work well
since we're not going to do that, let's just remove GLUE entirely

I also want to remove paws-x from the paper:
paws-x was broken in v0.4.3, see lm-eval https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
and baber has NEW fixes, that aren't even in the most recent lm-eval: https://github.com/EleutherAI/lm-evaluation-harness/pull/2759
Seems to me that it's too messed up and should be removed.

#

Later versions of lm-eval don't even spit out an aggregate score for GLUE, and our aggregate score doesn't even include all the subtasks. The other evals have been more stable across lm-eval versions, which will help future authors compare to our results. These two benchmarks are simply too wild and messy.

keen tartan
#

QQP is Quora Question Pair Paraphrase subtask.

#

The task is whether two sentences are semantically equivalent.

misty igloo
keen tartan
#

But we should really think well about which tasks to show.

#

We can compute them anyway and consider to use them or not. Any suggestions for substitute tasks?

misty igloo
#

I don't think we need a substitute.

keen tartan
#

Yeah, we could also just drop them. True. Less hassle in the end.

obsidian quest
gusty condor
gusty condor
#

I suggest blimp (but the scores will be really high)

misty igloo
#

and tbh dropping glue seems to harm us a bit on the flops vs acc chart, so at least its not really in our favor anyway

obsidian quest
#

glue and superglue are all noisy

obsidian quest
dawn pewter
#

@gusty condor What happens if the value of c is large (e.g., equal to the wkv matrix dimension)? I found that if c gets this big, it might be possible to simulate (reverse) the Boolean transition matrix with a single step transition

keen tartan
misty igloo
#

@gusty condor @keen tartan is Table 13 correct? Not sure where this breakdown comes from...

keen tartan
misty igloo
#

so it is in fact not correct (yet)

keen tartan
#

I categorized all datasets with this classes.

#

I intend to automatically classify all individual datasets.

#

Like world languages, artificial or natural, and categories.

misty igloo
#

okay, did you co-ordinate the results with Blink? if not, please do so we can get the updated table of categories into the paper

keen tartan
#

I will do so.

misty igloo
#

thanks!

#

(I'm just going through making sure we have everything right and aren't missing things that need updates)

misty igloo
#

@obsidian quest do you happen to have a checkpoint for the Pile models at 300B tokens rather than 332B?

obsidian quest
misty igloo
#

I was surprised by this.

#

My understanding is it was so that people could compare directly to Pythia.

misty igloo
gusty condor
#

No! Do we have to retrain these models?

misty igloo
#

Doesn't sound feasible to me 😦

#

I'm not sure what to do, but it's fine in our FLOPs vs Acc plot since it's adjusted for training length

#

latest version of that, not final tho - I still have some work to do on deciding the exact flops counts

gusty condor
misty igloo
#

and counting flops is kind of only a vaguely correct metric in general - it doesn't directly dictate how fast GPUs run the model

gusty condor
#

Your chart is not accurate, you should subtract the embeddings of RWKV

misty igloo
#

that may push them apart slightly

#

(we also did not include GLUE in this average because I think its a broken eval)

gusty condor
#

I can help you with that

misty igloo
#

it's not super clear exactly which FLOPs formulas we should use... the attention mechanisms add a small amount, esp because Mamba does the 2x expansion thing

gusty condor
#

Can you send me the source

obsidian quest
#

models trained using different amt of data cant be compared like this

#

models trained with smaller amt of data (such as mamba2) will appear far better than those with more data (such as qwen llama), because if we want optimal loss vs flops we need to follow scaling laws which no one follow in practice for apparent reasons

dawn pewter
#

I think for readers unfamiliar with RWKV architecture, the "Blocks" label within the L Blocks notation in this diagram might cause confusion. Specifically, there's ambiguity about whether "Blocks" refers to the entire module or a specific component (like the Time Mix unit) within it. To enhance clarity, perhaps relocating the "L" designation outside the block representation would create a more intuitive visual hierarchy.

gusty condor
#

I see

gusty condor
#

@dawn pewter

whole ember
misty igloo
#

If so, it's probably important to do experiments comparing the results with some other architecture. (could be RWKV-6, but even better if it's something else)
If you don't have time for this now, it's could still be possible for us to add it in a future version of the paper if it's complete before we submit to COLM. COLM deadlines are March 20th for abstract, March 27th for paper.

gusty condor
#

@iron parrot Please adjust figure 3 and 4 so that the colors of v7, v6, v5, v4, mamba, mamba-2, v7-128k are consistent across two images

young sparrow
misty igloo
#

(I know you said that pythia does actual 300B)

young sparrow
#

From the Pythia paper

gusty condor
#

I think yes, but I think very few people are aware of that. I bet no reviewer will raise questions on this specific point.

misty igloo
#

And to show correct scientific results that do not contain known errors.

#

We can show anything we have, as long as we point out the distinctions.

gusty condor
#

All RWKV models there are trained with 332B Pile, so comparisons are still valid

misty igloo
#

If someone has the resources and we want to compare to those others in table format we could train just RWKV7-1.47B on 300B tokens of Pile. But imho the comparison this way is kind of pointless because they all use somewhat different parameter counts. Probably mainly due to differences in weight tying, at least in Mamba's case.

young sparrow
#

Depending on what the claims we want to make are, I don't see a huge issue in using models with slightly different parameter counts and slightly different training token counts

keen tartan
#

Need to estimate the requirements.

#

Do we have The Pile binidx files somewhere already?

misty igloo
#

The reality is that these RWKV-7 models are both more parameters and 10% more tokens trained than the ones being compared to.

gusty condor
obsidian quest
misty igloo
misty igloo
#

@keen tartan I'm seeing open-web-math, algebraic-stack (both of which point to proof-pile-2) and FLAN got added to the v3 dataset listing and citations - do you know why these were added? Afaict they were not in my original v3 list, based on what Blink originally sent me

misty igloo
#

From looking at the document history, it appears you added them to the dataset listing on 26th February, 1:26 pm ET

gusty condor
gusty condor
#

You can't only pick a benchmark when it's advantageous for you.

misty igloo
#

I'm definitely not trying to do that - and I agree that's bad

#

But let's not use glue in any future paper, because it's not well constructed

#

I think we can add it to the flops chart (I have no idea if the result will benefit or harm rwkv there) by applying the weighting formula manually

#

@brisk bronze please take a look at how we can do that if you get a chance to

obsidian quest
#

use avg for glue, not weighted by number of items in each subset @misty igloo @gusty condor

#

because weighted by items makes no sense here

whole ember
#

I will provide a stronger benchmark, and you can update it in the subsequent RWKV7 paper.

keen tartan
# misty igloo <@371036620008194048> I'm seeing open-web-math, algebraic-stack (both of which p...

Also ccnews. I made an initial itemized list of the World v3 corpus in an excel sheet as it was suggested and Blink added the missing datasets. Based on this I added them to the paper as well. I understood that was the original objective to identify missing datasets. We also found that DeepMind Mathematics dataset dm_math was part of the world v2 as a constitute of The Pile but forgotten to be mentioned in the Eagle & Finch paper. I tried to mention it as a footnote (b) to the table about the World v2.1, but perhaps there is better place for it.

acoustic knoll
keen tartan
#

For The Pile Comparison experiment we need to segment text with the GPTNeoX Tokenizer. I have been using the Rust implementation only with the World tokenizer. How would you go about specifying a different tokenizer?

#

I think we need change the library to support other tokenizers if I am not mistaken. I see let tokenizer = rwkv_tokenizer::WorldTokenizer::new(None).unwrap(); is hardcoded right now. In particular in view of supporting other modalities too.

#

From my tests it seems to be correct as we can decode the original text from it. It is however split across many binidx files rather than a single pair of files.

#

Added SmoLM2-1.7B

#

Removed PAWS-X and added SmoLM2-1.7B too.

#

@brisk bronze @gusty condor We need to share the lm_eval results files to calculate MMLU with either the weighted or non-weighted average ourselves.

acoustic knoll
gusty condor
keen tartan
#

Overleaf in dark/night mode. Finally eye strain reduced!

misty igloo
#

This weighting is a result of a mistake in the old version of lm eval harness used.

misty igloo
#

I updated the links to proof pile to point to the proper subdirectories

gusty condor
gusty condor
misty igloo
gusty condor
#

Probably overwritten?

#

Oh, I found them

#

0.1B and 0.4B tested by @keen tartan

keen tartan
quaint ingot
keen tartan
#

I ran 0-shot and 5 shot for MMLU separately. Sometimes I ran evals separately per task to better organize. This is why there are multiple files per model. I organize each eval set in a folder per model.

#

I add the other reference models evals too.

gusty condor
gusty condor
#

We should provide convenience for reviewers to verify our results. i.e., using the default averaging method of the evaluation framework without making any tweaks to the results. This simplifies the reproducibility process and avoids potential accusations of "tweaking" the results to favor our model.

misty igloo
#

I never requested that we change the weighting (Blink did tho), I requested that we drop the eval entirely because it uses a bad weighting in that specific lm eval version

#

in newer lm eval versions it does not print an average at all for glue

#

glue also includes other components which do not contain accuracy at all, and these are not reflected in the accuracy score

#

This is probably originally my fault for including glue in the prior paper without checking it thoroughly beforehand

gusty condor
#

We can still include it anyway, because that we are already averaging over 9 benchmarks.

unborn lintel
keen tartan
#

Who did the evaluations for Llama3.2 1B/3B and Qwen2.5 1.5B/3B?

keen tartan
#

@misty igloo I like Figure 3: FLOPs vs. Average Accuracy. I think the title is redundant with the figure caption.
Better label the axes with average accuracy and log scale compute in TFLOPs instead of a title at the top.

#

Perhaps express the accuracy in %

#

I also suggest to attempt mitigating the overplotting of labels on each other for Mamba and RWK7-Pile. I know plotting softwares often make it hard to position them apart.

#

I think the point labels could be shorter just indicating the size of the model as the architecture/dataset is specified in the legend and encoded in color already, e.g. 0.1B, 0.4B, 1.5B, 2.9B, etc.

#

Should we add transformers to the PG19 long range context loss plots?

misty igloo
#

actually, im going to increase the text size...

keen tartan
#

Yes, please. There is plenty of space.

misty igloo
keen tartan
#

Make the labels shorter.

#

Just the size of the model please.

#

Put the legend in bottom right or top left.

#

Make axes text a bit bigger also the axes labels.

misty igloo
#

hm I dislike it with just the size, but I understand your reasoning

keen tartan
#

I see.

#

In my humble opinion this plot is very important. We should make it shine.

misty igloo
#

working on it...

keen tartan
#

Please put % behind Average Accuracy (%) in parenthesis remove it from the y-axis numbers.

misty igloo
#

updated

keen tartan
#

Ohh!

#

Already so much better!

misty igloo
#

argh llama

#

lol

keen tartan
#

Thank you. Llama got lost

#

Can someone please look for some lost Lamas?

#

^^

misty igloo
#

my formula was accidentally off before, and this revealed that mistake so its a happy accident that it got lost

keen tartan
#

Also for x-axis use 10²... representation with superscript.

misty igloo
#

not sure I can do that

#

this is yet another case where GLUE is messing up something

#

this time llama is gonna look horrible as a result

keen tartan
#

Hmm

misty igloo
#

I really hate this benchmark, at least the way we're using it (which is terrible imho)

#

I might remove llama entirely because I think it's a completely unfair representation of it

#

it literally scores worse than its own 1B, that was DISTILLED from the same model, on GLUE the way we calculate it

keen tartan
#

We can iterate over it until it is correct.

#

Let me look into it.

#

How do you calculate the compute complexity? I mean estimate.

misty igloo
#

it took a bunch of work

keen tartan
#

I can imagine.

misty igloo
#

the basic formula is 6 x params x tokens
but there are variations in the models that matter

#

like some use tied embeddings, and embedding doesnt really take flops (it's essentially a lookup table) but de-embedding for the lm_head does

#

and rwkv was upgraded from prior models which had to be calculated separately

keen tartan
#

params x tokens is already good rule of thumb. Yeah, the devil lays in the detail.

#

Llama 3.2 1B/3B were destilled from Llama 3.2 8B right?

misty igloo
#

other minor differences include the cost of the attention calculation or replacement thereof

misty igloo
misty igloo
#

there is no 3.2 8B

keen tartan
#

All right.

#

Let me think about it.

#

Destillation is kind of like cheating.

#

Include SmolLM2

#

It was trained from scratch via pretraining.

#

No destillation.

misty igloo
#

They weren't just distilled - the starting point was actually 3.2 8B cut up into smaller parts!

keen tartan
#

I will think about the issue and look around.

misty igloo
#

I don't really want to add more models to this plot though

#

Mostly I'm just continually annoyed by glue messing up all the results

keen tartan
#

Yeah, don't worry too much right now. You did already pretty well with all those obstacles.

#

I will get back with some concrete solution suggestions.

obsidian quest
gusty condor
#

Who is doing these expreriments?

keen tartan
gusty condor
#

The x-axis of (c) is not consistent with others

gusty condor
keen tartan
gusty condor
#

Uploaded

keen tartan
obsidian quest
#

Suggestions:

  1. #1103039376184852622 message

  2. #1103039376184852622 message

  3. #1103039376184852622 message

misty igloo
# tropic minnow yes im doing #3

#3 is already in the document, but not grouped together all in one place in this way, and we should credit schmidhuber/widrow/hebb etc.

#

(it's instead presented in the order of the current narrative)

#

feel free to fix it up tho!

tropic minnow
#

@misty igloo oposition for moving appendix D and E to 3. Architecture?

misty igloo
tropic minnow
misty igloo
# tropic minnow <@1007072846960410685> oposition for moving appendix D and E to `3. Architecture...

wrt Appendix D, it could be okay for theorem 2, but my current view is that Theorem 3 is not realistic under actual conditions for RWKV-7
there is ongoing work to find a proof that would work without extra tokens, but I am somewhat doubtful it will happen
and the current proof of Theorem 3 is still not explicit enough about this fact that it is impossible for actual RWKV models to execute without injecting multiple tokens in between each input token

#

I also think that generally when someone reads the paper they want the overview not every detail inline

#

The main paper is already 18 pages long

tropic minnow
#

following the structure of [Intro][method][results][additional] i propose the following reordered abstract:

#
We present RWKV-7 "Goose", a new iteration of linear RNNs featuring a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show this architecture can solve problems outside of TC0 under standard complexity conjectures, exceeding the capabilities of transformers while retaining parallelizability of training.
We trained models up to 3B parameters on a new dataset that we name World-V3; which exhibit improved performance across a wide range of benchmarks and state of the art downstream tasks despite being trained on dramatically fewer tokens than other models in its class, including LLaMA 3.2 and QWen-2.5.
To foster openness, reproduction, and adoption, we release all our models on Huggingface, and our training and inference GitHub; all under Apache 2.0 License.```
keen tartan
#

The intention is to release it as a kind of a "meta-dataset" (dataset if datasets) as the majority of subsets are available on HuggingFace. Those that are missing we could add as separate dataset repos and link them all together.

misty igloo
#

I have a comment in the doc asking about this...

misty igloo
keen tartan
gusty condor
tropic minnow
young sparrow
#

The circuit complexity stuff should be an aside, and probably the second to last sentence in the abstract. Right before the comment about releasing stuff

keen tartan
#

The current first sentence is very catchy.

young sparrow
#

I think we agree?

keen tartan
#

Yes, the complexity stuff should not be at the beginning perhaps.

#

But it is also important to highlight what is novel of the suggested architecture and how it was achieved.

young sparrow
#

I agree.

misty igloo
#

@keen tartan is it possible to put together a list of which sub-datasets within the entire World v3 corpus are no longer available online?

gusty condor
#

Wait! What happened?

#

They do sum up 3119.2

misty igloo
#

Afaict it never did

gusty condor
#

Yes, this is indeed correct

keen tartan
#

All available datasets are linked. I will get the list down to the problematic ones.

misty igloo
#

@gusty condor yet the table has not changed since then

#

that's why I temporarily commented out the table, because it never got updated after that

#

if I misunderstood, let me know - it seemed like Blink wanted to update it to be accurate in some way

keen tartan
#

I did combine all and Blink went over and added missed datasets.

#

I am still searching for those.

#

I will pin down all.

misty igloo
keen tartan
#

It was my suggestion. As the categories seem rather arbitrary.

#

For instance something can be both code as well as web.

#

Like StackOverFlow data.

#

An ontology might be helpful in such a case.

misty igloo
#

Well we don't have to put this summarized list of category breakdowns into the paper. Let's only put it in if we have one that we think is helpful and correct.

gusty condor
#

I think it is very important. Our dataset is rich in novels and fictions, but falls short of math and code compared to Qwen2.5 series. I think this is an important piece of information.

keen tartan
#
  1. Can be easily fixed.
#

For 2 & 3 I am not certain yet how to resolve.

#

Does anyone have backup copies of Guanaco and/or Books3?

#

I may have perhaps on some old drive, not sure.