#RWKV-papers

1 messages Β· Page 8 of 1

misty igloo
#

nah it's easy now that I found my spreadsheet contains the data

#

I'll make those changes

last mauve
#

Great! Thanks.

#

I'm doing a pass now

misty igloo
misty igloo
#

@last mauve also, once its resized maybe you can port that to arxiv as well

void quartz
#

let me know if you need any particular data.... its probably all in there, but yea the raw data is a giant pile to sort

misty igloo
gusty condor
#
  1. Done (Should I add speed experiments for Cahya's Rust implementation?)
misty igloo
gusty condor
misty igloo
#

@tropic minnow see above.. maybe you can remove some people's edit access or reset sharing on the arxiv doc so we can re-add editors from zero?

young sparrow
tropic minnow
#

Okay will do soon

last mauve
last mauve
misty igloo
#

still says this to me when I use your edit link:

This project has more than the maximum number of collaborators allowed on the project owner’s Overleaf plan. This means you could lose edit access from August 26th.

To keep edit access, ask the project owner to upgrade their plan or reduce the number of people with edit access.

gusty condor
#

So how can I regain access?

tropic minnow
last mauve
gusty condor
#

It's almost camera-ready deadline!

#

We have only 4 days left

gusty condor
#

Acknowledgement will not count toward the page limit, but here's still a paragraph to compress.
I noticed this sentence:

Authors can add an optional ethics statement to the paper; it will not count toward the page limit, but should not be more than 1 page.

Can we move some information (like fostering multilinguality and culture diversity) into the ethics statement to add a bit more information into the main pages?

misty igloo
#

@gusty condor I copied your new tokenizer appendix section into the arxiv version, and will add @acoustic knoll as an author when I get his details

gusty condor
#

Thank you!

#

How do you plan to fit in the page limit?

misty igloo
gusty condor
#

He has been absent and deadline is coming

misty igloo
#

I'm happy to work on rewriting stuff- do you have suggestions?

gusty condor
misty igloo
#

and maybe just shrink Figure 4

#

ok it all fits now... I removed "The Eagle and Finch models fall short on certain aspects that can be mitigated and addressed in future work." because it doesn't add any information, and shrank VisualRWKV image to 80% but its still a good size

gusty condor
#

Great, let's upload it as camera-ready. In case that a revision is needed, we can still upload before the deadline.

last mauve
#

I'm handling it today. Been sick but I'm feeling a bit better

misty igloo
#

hmm the contents section of the Arxiv version ended up with a second page now, and looks messy
maybe someone with more latex knowledge than myself can help fix that up?

young sparrow
#

Yeah I can take a look in 20-ish min

misty igloo
last mauve
#

Ok, did my final pass of minor edits and submitted an updated camera-ready

young sparrow
misty igloo
last mauve
#

Let's target making an arxiv revision live by Wednesday.

gusty condor
#

Who will go as a presenter?

tropic minnow
acoustic knoll
young sparrow
young sparrow
tropic minnow
#

Camera ready deadline extended until 9th (friday)

last mauve
#

oops look like camera-ready isn't formatted right

rose mango
#

I was just about to say

last mauve
#
We have detected a critical formatting issues with submission #422. The issues are:
Wrong font; author list misformated; no author email
#

we used the colm conference format, so not sure about the font point

#

missing emails is fair. I can add them

misty igloo
#

re font: maybe something about the chinese support stuff?

last mauve
#

does anyone know what they mean by the wrong font and author list misformatted?

rose mango
#

but we need CJK font support

rose mango
last mauve
misty igloo
#

only thing really with 'font' in the document is \usepackage{pifont}

last mauve
#

@misty igloo -- Can you email them and get clarification on those while I gather author emails

misty igloo
young sparrow
#

It does look like our font is off, comparing the template and our paper side by side

#

Check out the title in particular

misty igloo
#

Hi,

We are the authors of submission #422, ("Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence") and received an email that our camera-ready version has the following errors: "Wrong font; author list misformated; no author email"

If you could supply a bit more detail it would help us resubmit a corrected version as quickly as possible. Specifically:

  • What font and point size is supposed to be shown?
  • We have a very large number of authors. If listed as Author/Affiliation/Address/Email on separate lines this will take many pages. Is that what we should submit or do you have an alternate example template for this many authors?
  • Are there other authorship formatting issues we should be aware of?

Thanks for your assistance,
Dan Goldstein

young sparrow
#

I would say "more than a full page"

misty igloo
#

ok, sent

young sparrow
#

What's your email?

misty igloo
young sparrow
#

I was going to add a "correspondence to X, Y, and Z" line with the emails of you, bo, and quentin

misty igloo
#

my other emails are like personal and unrelated company ones so I guess that one is best!

young sparrow
misty igloo
#

but so does the word "Abstract"

#

it's a totally different 't'

#

in that pic that's the easiest letter to see how different the two fonts are - lowercase t

#

seems like it's not just the headings, the actual normal font is also different in the same way

#

so a document-wide font change of some sort

young sparrow
#

Yeah I'm working on it

#

experiment.tex is a new file that contains a heavily stripped down header and seems to match the font

#

oh duh

#

I think I have it, lemme recompile

#

Yeah fixed it

#

The packages fourier and times both set the default font in the document. After removing them, the font looks right

misty igloo
#

maybe in the interest of time we should guess what they might want and just try resubmitting that way tonight

young sparrow
#

We reran the evals of models like Mistral and Falcon instead of copying them from their papers right

young sparrow
#

Does anyone know where I can find the raw eval harness outputs btw?

#

(Doing a final pass on some details)

young sparrow
#

Each lm-eval output jsonl logs the hash of the library commit used. It looks like there are at least four different commit hashes used to do the evals

misty igloo
misty igloo
#

@young sparrow @last mauve with the new font the paper now appears to be 11 pages long

#

not including a larger authors section we may be required to include

#

tables 4,5 also now appear to have gotten too wide somehow

#

my only ideas on how to reduce the size easily and quickly to fit within the 10 page limit is to move Figures 3 and 4 back to the Appendix

#

working on that now

#

done.

young sparrow
#

I would cut one of the lambada metrics (we report both acc and ppl) to make it fit width-wise

misty igloo
void quartz
misty igloo
#

@young sparrow @last mauve I updated the author block with one that's in the original COLM style.. it puts us slightly over the page count but maybe we could fix that by eliminating text or the huge number of affiliations. Let me know what you think.

#

I commented out the following conclusions paragraph:

Because our training corpus contains synthetic data from GPT-3.5 and ChatGPT, our released models exhibit behaviors similar to ChatGPT and will mimic ChatGPT's conversation style and tone. For instance, the model might occasionally claim that it is trained by OpenAI. However, this is not a general property of RWKV architecture but rather a specific outcome of the training data.
and now we are at exactly 10 pages, even with the COLM style author block

misty igloo
#

We could probably resubmit in its current form. I mildly abbreviated a few of the affiliations. And we only list emails for the three first authors. (I don't know how we could fit emails for everyone)

gusty condor
#

Actually, I don't need an email listed, to avoid spam emails

last mauve
young sparrow
#

Gosh that looks terrible

#

(Not your fault obviously)

misty igloo
young sparrow
#

Did they respond to your email?

misty igloo
#

nope

#

The other thing they did in the example was group people by affiliation

#

but that seemed nearly impossible

young sparrow
#
  • requires changing author order
misty igloo
#

yeah exactly, or at least taking the first three authors and putting them under EleutherAI and then doing... something else with everyone else

#

couldn't find a reasonable way to do it

misty igloo
young sparrow
#

The deadline is Thursday. I think we should give them at least 24 hours more to respond

misty igloo
#

as of now we had to ask some fairly vague questions about what they want us to do, given the number of authors

young sparrow
# misty igloo maybe their response would be better informed if they can see what we have here?

Good point. Send a follow-up email saying:

Hello,

I wanted to follow up on my previous email with additional information. We've solved all of the typesetting and layout issues except for the author block. As mentioned previously, a strict interpretation of the layout guidelines would take more than a full page due to needing to put each author on their own line. I've attached two screenshots of alternative solutions, one using the authblk package and one not. Are either of these acceptable solutions?

misty igloo
#

btw the only thing really stopping the author list from looking like we used to have it was their low quality footnote/thanks mechanism
and that their examples basically said to put the affiliation below the author

#

I'll send that followup email

young sparrow
#

I debated saying we had a preference for the authblk vesion but idk

#

The authblk version has been saved as its own file named authblk.tex

misty igloo
#

looks like quentin submitted the revision too (with the colm style authors)

last mauve
#

yep just resubmitted

young sparrow
#

I have a preference that the authblk version go on arXiv but otherwise it's blobshrug

last mauve
#

send them the email and get their take on it

misty igloo
#

I'll change the email to say that we submitted this one, and show both options

young sparrow
#

I wouldn't because I think the only thing that might do is cause them to default to saying "keep whatever you submitted"

misty igloo
#

haah ok

#

sent

last mauve
#

wait do you want me to resubmit with a different version?

#

or is smerky just changing the email

misty igloo
#

I'm fine with having submitted one that more strictly conforms - at least this way we're less likely to get booted from the conference πŸ™‚

#

I kept the email as written by Stella

last mauve
#

yep ok

gusty condor
#

It makes no sense that author list takes up 2/10 of the main pages, and some figures are moved back into the appendix.

misty igloo
misty igloo
misty igloo
steady ether
#

No worries, that's optional. Nice to have if we do end up resubmitting

last mauve
misty igloo
#

wow, GoldFinch got a citation in a paper that is appearing in COLM'24!
so... they must have added that for the camera ready version - since GoldFinch didn't exist at the time of COLM submission!
https://arxiv.org/abs/2407.18003

but.. they cited the original RWKV paper when discussing it (doh!) We have some kind of weird discoverability issue with the Eagle/Finch paper

gusty condor
rose mango
misty igloo
#

research paper internal SEO πŸ€·β€β™‚οΈ

#

we could change the title to something more like "Eagle and Finch: RWKV-5 with Matrix-Valued States and RWKV-6 with Dynamic Recurrence"

#

or "Eagle (RWKV-5) and Finch (RWKV-6): RWKV with Matrix-Valued States and Dynamic Recurrence"
or "RWKV-5 'Eagle' and RWKV-6 'Finch': RWKV with Matrix-Valued States and Dynamic Recurrence"

#

any other suggestions?

steady ether
last mauve
#

I agree we have terrible SEO on the eagle/finch paper, and I'm of the opinion that "RWKV" should be the first word of any future title.

misty igloo
gusty condor
steady ether
#

I also took time to read through the citations. They only cited the foundational works, so I think it's appropriate that they cited RWKV 4 instead of 5/6, and they cited Mamba 1 instead of Mamba 2.

acoustic knoll
#

Hi, last week, I wrote about the Rust RWKV world tokenizer update on the RWKV Discord channel. In case some of you do not see it, here again. Three weeks ago, Huggingface tokenizer released a test comparison of the encoding speed of Tiktoken and Huggingface tokenizer on different sizes of text and different numbers of threads (for batch encoding). The result was that the Huggingface tokenizer is faster on small text sizes and more threads. Otherwise, Tiktoken is faster. https://github.com/huggingface/tokenizers

So, we updated the Rust RWKV world tokenizer to support multithreading for batch encoding. We ran the same comparison script from the HF tokenizer with the additional rwkv tokenizer. The result is that the rwkv world tokenizer is significantly faster than the Tiktoken and Huggingface tokenizers in all numbers of threads and document sizes (on average, its speed is ten times faster).

GitHub

πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers

rose mango
void quartz
#

Do you have a fancy new windows laptop, with local copilot installed?

it might be running RWKV, im trying to fact check this, so if possible scan the OS for any files larger than 1GB

#

We already confirmed that RWKV.cpp codebase is part of the windows OS latest update

rose mango
#

the copilot pc thing?
i don't even know what PCs support it

#

but Microsoft is shipping RWKV.cpp?

#

If they are, you can probably be sure... wherever they put OSS licenses

#

i don't even know where to find the windows eula anymore, actually

rose mango
#

they also ship llama.cpp

void quartz
#

our stuff is all apache2, so its definately allowed

#

and its not like they remove the license entirely either, so its all above board

#

im trying to source for a single working laptop with the "offline copilot" beta, typically on a snapdragon CPU (if anyone here has it, please let me know)

#

so i can trace if its actually using our model, or is our code just being dumpped in there

#

if ur using windows 11 updated, you can just search system files for rwkv

rose mango
#

I found the files, didn't find any models though

#

I assume you don't necessarily need the copilot pc thing, since they have libraries for CPU & GPU

void quartz
#

there is no model, the copilot offline mode (which i assume will download the models), is in very limited beta - so im trying to find that 1 laptop that has it

void quartz
#

https://x.com/RWKV_AI/status/1830859408106192942

The completed finch model, trained and eval-ed can be found here - its generally a step up from the previous Eagle models

The RWKV v6 Finch lines of models are here
Scaling from 1.6B all the way to 14B

Pushing the boundary for an Attention-free transformer, and Multi-lingual models.

Cleanly licensedm Apache 2, under
@linuxfoundation

Find out more from the writeup here: https://t.co/30VbPbbfCm

gusty condor
#

V6 7B MMLU should be higher than 41.7%. Which code did you use?

obsidian quest
gusty condor
#

User: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6

Assistant: The answer is

Remove the first line break, 46.7% -> 47.2%

obsidian quest
#

which line break

gusty condor
#

This one

#

That's Discord's formatting problems

#

A better prompt is:

User: You are a very talented expert in abstract algebra. Answer this question:
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6

Assistant: The answer is

Correct: 6696 / 14042 (47.69%)

obsidian quest
gusty condor
#

I don't own that repo, will ask @iron parrot to update

iron parrot
obsidian quest
gusty condor
#

Special tokens influence so much on the final result

void quartz
#

@hushed orchid - you might want to update to results?

hushed orchid
#

Is that ACC norm?

hushed orchid
obsidian quest
#

it's only detecting A/B/C/D

iron parrot
obsidian quest
quaint quiver
#

So is rwkv7 = rwkv6 + matrix valued decay and/or time boost? No delta rule?

#

Also what are the specifics on how to make that decay? Is it like before (vector data dependent + matrix data independent)?

#

@obsidian quest

#

And are there some speed benchmarks for the cuda kernel compared to v6?

obsidian quest
#

matrix-valued evolution already includes delta rule as a special case

misty igloo
#

and curious how we can do this efficiently in terms of the kernel

acoustic knoll
misty igloo
quaint quiver
#

Best to test in incognito I think

misty igloo
#

I think when citation count (16) approaches v4 (275) we might get higher placement on the newer paper

acoustic knoll
quaint quiver
#

Well most times it barely appears 😒

quaint quiver
last mauve
#

What's the plan on the RWKV COLM poster? I'm not attending so can't really decide things here.

misty igloo
#

same issue here

#

@tropic minnow are you going to be there and presenting?

young sparrow
#

Is there a poster template? I'm not currently seeing one on the website

gusty condor
#

I had one for RWKV-4, but I'm not planning for RWKV-5/6 poster.

obsidian quest
tropic minnow
gusty condor
#

Has v7 architecture finalized?

obsidian quest
#

not yet. but close

misty igloo
#

@obsidian quest if you could supply a list of all the world v3 datasets and mix, that'd help get a headstart on the RWKV-7 paper so we can turn it around faster this time

#

(if you plan to train it on world v3, or continued from Finch world v3)

gusty condor
#

Qwen 2.5 is out. It is said that they used 18 trillion tokens of data. RWKV world v3 is only one sixth of their size.

void quartz
#

Did some testing on the llama3 8B model... and transformers might be just RNN's with extra steps and more memory?

Not sure if this is significant / should be its own thing. Maybe its already a known thing (and i was ignorant)

You can setup a transformer, with a prompt, in sort of a needle-in-heystack situation. And delete the needled KV embedding.... And it still works, from the recurrent embedding stored in subsequent tokens.

This is more prelevant in longer prompt / chain-of-thought, and would explain how such processes (or thinking tokens) help model performance improve.

The longer write up is here: https://docs.google.com/document/d/1ShztwKqQtqkG5ZsbbhKxw2toS0_s-OwxR_FbaLK2nIU/edit?usp=sharing
( note i might be just ignorant, and too used to thinking in RWKV recurrent terms - so im sharing here, to see if it makes sense to you all too )

misty igloo
# void quartz Did some testing on the llama3 8B model... and transformers might be just RNN's ...

I think (?) this is known (tho I haven't seen it written about), and is why SWA (Sliding Window Attention) can work for longer than its window
But it's very interesting!
Essentially, the KV state of higher layers can keep around deleted KV state from lower layers, subject to layer count limits
This might be something to show the folks in the interpretability channel, in case they haven't analyzed it before and want to do a deep dive on it!

#

Oh, I see you already posted there! nice!

void quartz
obsidian quest
misty igloo
tough crane
last mauve
last mauve
#

submitted the latest arxiv overleaf

obsidian quest
gusty condor
#

I suggest using torch.lerp to simplify RWKV-LM token shift. Might improve performance a bit πŸ™‚

#

And modulize lora-like MLP (being quite common in v7).

obsidian quest
#

@tropic minnow

tropic minnow
gusty condor
#

I agree, but let's call them Low-rank MLP.

misty igloo
dawn pewter
#

In the paper "The Illusion of State in State-Space Model", it is only proven that an SSM can be simulated by TC0 when the projection matrix (transition matrix) is both input-independent and diagonal. However, this does not necessarily imply that an SSM cannot be simulated by TC0 if the transition matrix is not diagonal, does it?

quaint quiver
dawn pewter
#

Maybe we need some mathematical proof

quaint quiver
#

They already have things setup to test I guess

dawn pewter
#

Can RWKV-7 solve the A5 word problem?

quaint quiver
tribal notch
gusty condor
quaint quiver
gusty condor
#

Not really interesting, finite group multiplication is regular. Test Chomsky Hierarchy if possible.

dawn pewter
young sparrow
gusty condor
#

Maybe ... we are talking about languages, right? For example, most programming languages are context-free or context-sensitive, and MQAR is context sensitive.

young sparrow
#

The Chompsky Heirarchy is one of multiple different ways to classify computational problems by difficulty. It is very poorly aligned with transformers and massively parallel computation techniques though, and does not meaningfully capture degrees of difficulty for such models.

#

And the leading theorists do not use it (see Angluin et al., 2023; Merrill
& Sabharwal, 2023a; Liu et al., 2023; Chiang et al., 2023;
Merrill & Sabharwal, 2023b; Hao et al., 2022, etc)

misty igloo
#

@obsidian quest what do you think about renaming eta to beta in the paper? That way we could say RWKV-7 is an extension of the delta rule so that Beta becomes vector valued

delta rule:

misty igloo
#

restated rwkv-7 version

#

also this brings up the question of whether you tried keeping everything normalized like in delta rule while extending Beta to be vector valued

#

the current version comes close by using the k*=1-w trick, but its not exact due to the delta rule portion not being included in that

young sparrow
#

Here's a visualization from Will Merrill showing how regular languages are incomparable with circuit classes

dawn pewter
#

Is the transition matrix of RWKV-7 Diagonalizable?

dawn pewter
#

I think the claim that "Transformers and RNNs with diagonal transition matrix could only represent functions in TC0" is potentially misleading. Given the nonlinear transformations in RNNs like LSTMs and GRUs, the notion of a transition matrix might be unclear. Using "linear RNNs" instead could make the statement more precise.

dawn pewter
quaint quiver
#

so its not fully diagonalisable

tough crane
tough crane
young sparrow
young sparrow
tough crane
#

Here?

SFO (My vacation has just finished now )

obsidian quest
#
    r = r.view(B, T, H, N).double()
    k = k.view(B, T, H, N).double()
    v = v.view(B, T, H, N).double()
    a = a.view(B, T, H, N).double()
    b = b.view(B, T, H, N).double()
    w = torch.exp(-torch.exp(w.view(B, T, H, N).double()))
    out = torch.zeros((B, T, H, N), device=DEVICE).double()
    state = torch.zeros((B, H, N, N), device=DEVICE).double()

    for t in range(T):
        kk = k[:, t, :]
        rr = r[:, t, :]
        vv = v[:, t, :]
        aa = a[:, t, :]
        bb = b[:, t, :]
        sab = torch.einsum('bhik,bhk,bhj->bhij', state, aa, bb)
        state = state * w[: , t, :, None, :] + sab + torch.einsum('bhj,bhi->bhij', kk, vv)
        out[:, t, :] = torch.einsum('bhj,bhij->bhi', rr, state)

    return out.view((B, T, C))```
dawn pewter
#

I tried to prove that RWKV7 can simulate DFA based on the methods proposed in the paper "The Illusion of State in State-Space Models". Is it correct?

misty igloo
# last mauve submitted the latest arxiv overleaf

I don't see it on arxiv.org yet (maybe there was some error in the process?)
also, I was looking for some stats and realized that the separation of arc_easy and arc_challenge never made it into the arxiv manuscript - would you like me to port that table into it from the COLM version? If so, I seem to have lost edit access there due to subscription limits 😦

last mauve
obsidian quest
gusty condor
# obsidian quest ```def ref_fwd(r, w, k, v, a, b): r = r.view(B, T, H, N).double() k = k....
def try_bwd(self, r, w0, k, v, a, b, gout, gstate):
        gout = gout.view(B, T, H, N).double()
        gr = torch.zeros((B, T, H, N)).double()
        gw = torch.zeros((B, T, H, N)).double()
        gk = torch.zeros((B, T, H, N)).double()
        gv = torch.zeros((B, T, H, N)).double()
        ga = torch.zeros((B, T, H, N)).double()
        gb = torch.zeros((B, T, H, N)).double()
        w = torch.exp(-torch.exp(w0.view(B, T, H, N).double()))
        for t in range(T-1, -1, -1):
            rr = r[:, t, :]
            ww = w[:, t, :]
            kk = k[:, t, :]
            vv = v[:, t, :]
            aa = a[:, t, :]
            bb = b[:, t, :]
            gr[:, t, :] = torch.einsum('bhij,bhi->bhj', self.state_cache[:, t+1, :], gout[:, t, :])
            gstate      = torch.einsum('bhj,bhi->bhij', rr, gout[:, t, :]) + gstate 
            gk[:, t, :] = torch.einsum('bhi,bhij->bhj', vv, gstate) 
            gv[:, t, :] = torch.einsum('bhj,bhij->bhi', kk, gstate) 
            ga[:, t, :] = torch.einsum('bhik,bhj,bhij->bhk', self.state_cache[:, t, :], bb, gstate)
            gb[:, t, :] = torch.einsum('bhik,bhk,bhij->bhj', self.state_cache[:, t, :], aa, gstate)
            gw[:, t, :] = torch.einsum('bhij,bhij->bhj', self.state_cache[:, t, :], gstate)
            gstate      = torch.einsum('bhj,bhij->bhij', ww, gstate) + torch.einsum('bhk,bhj,bhij->bhik', aa, bb, gstate)
        gw = -torch.exp(w0-torch.exp(w0)) * gw
        return gr, gw, gk, gv, ga, gb, gstate

Note that

self.state_cache = torch.zeros((B, T+1, H, N, N)).double()
self.state_cache[:, 0, :] = state

The cache is designed to avoid another forward computation

obsidian quest
tough crane
#

Hi, it might be a silly question...

Is the reason that Stella think log precision is extremely reasonable as follows? 1. the total number of binary sequences with log N precision is (num of params) * ( log N), where N is an arbitrary input length, and then 2. The transformers are equivalent to boolean circuits with size N^{const * num of params}. Thus, if num of params does not depend on input lengths, transformers are in P/poly which is a tractable circuit complexity class.

hushed viper
#

thanks πŸ™ for sharing the reference forward & backwards passes. I recall some mention of precision issues, is that the reason for .double() ? (sorry if question has already been answered)

misty igloo
obsidian quest
#

delta rule is ICL gradient descent (this is shown in TTT paper too, for example. it is known decades ago)

we can add some computation to show my factors are indeed ICL wd & lr

crystal hull
#

@obsidian quest any way to help in evaluations ?

quaint quiver
obsidian quest
#

w

quaint quiver
obsidian quest
# obsidian quest delta rule is ICL gradient descent (this is shown in TTT paper too, for example....

RWKV-7'update is pretty similar to the Longhorn model's update (https://t.co/Ll0GIayA8p), which is derived explicitly from solving online associative recall in closed form.

The household transform used in the RWKV-7, (diag(w) - a \alpha^\top \beta), stems from optimizing a

quaint quiver
last mauve
misty igloo
misty igloo
last mauve
misty igloo
dawn pewter
#

The consistent use of subscript t in equation 15 and subscript j in equation 16 is somewhat confusing.

misty igloo
obsidian quest
#

Changes in rc3:
kk = F.normalize(kk.view(B,T,H,-1), dim=-1, p=2.0).view(B,T,C)
and
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 )
and some incremental stuffs

misty igloo
obsidian quest
#

then we need a to be within (0,1) range or it will nan

misty igloo
#

I removed the 2 multiplier from the paper just now

#

we'll change the eigenvalue proof etc. in a bit

#

unfortunately, because of w replacing I rather than being outside, the model can still flip the signs of existing state values

#

I might wait to change our description of how it works until we're a little more certain of the final version

obsidian quest
#

yeah but at least now it stays within abs < 1. rc2 will nan after 150G tokens probably because of this

misty igloo
obsidian quest
#

deformed k works better for LLM
in any sense, i think these are data-dependent. your "nicer" version might be better for some time series

misty igloo
#

I'm really most interested in adding in this alpha I show above

#

maybe via a second deformed k and no alpha or beta

obsidian quest
#

if you can get eigenvalues under control, i think deformed alpha + deformed beta will be the best

misty igloo
#

(deformed key -> alpha, beta)

obsidian quest
#

i found (1-w) is actually too much normalization

misty igloo
#

this one it would sum correctly to exactly one value

quaint quiver
#

maybe 1 - w^2 would be better bcs it more so emulates the diagonal of kk^T

#

idk just a guess

misty igloo
#

so I think reducing (1-w) is an approximation to the imbalance in the formula

quaint quiver
obsidian quest
#

just slight (but not noise) performance difference. maybe 0.001

quaint quiver
misty igloo
#

maybe you need his credentials or something for that @acoustic knoll

last mauve
#

You'll need to add them to the authors list along with a contribution section that justifies their inclusion

misty igloo
#

he was even in the prior version you published

last mauve
#

ah oops. You mean just in the arxiv console. Sure I can put them in.

misty igloo
#

yeah sorry, should have been clearer πŸ™‚

last mauve
#

added them along with Jiaju

misty igloo
#

thanks!!!

dawn pewter
#

Where is k^bar used in Formula 12?

gusty condor
#

These formulas are placeholders and do not represent the exact RWKV-7 architecture.

dawn pewter
#

I have discovered that RWKV-7 can mimic the state transitions of any Deterministic Finite Automaton (DFA) by performing multiple calculations. This is because I've proven that RWKV-7's transition matrix can be configured to represent any permutation matrix. Since the state transitions of a DFA can be expressed as a sequence of permutation matrix multiplications, RWKV-7 can simulate any DFA through iterative computations!

young sparrow
misty cedar
#

simple v7 expression

dawn pewter
# young sparrow Can you explain this proof in more detail?

In essence, the state transitions within a Deterministic Finite Automaton (DFA) can be fully represented by a Boolean transition matrix. This matrix, in turn, can be constructed by multiplying a sequence of permutation matrices. I found that, under specific parameter conditions, the transition matrix of an RWKV7 model can assume the form of any arbitrary permutation matrix. Consequently, by multiplying RWKV7 transition matrices, we can generate any Boolean transition matrix that defines a DFA's state transitions. This implies that RWKV7 models possess the capability to simulate the behavior of any DFA.

tropic minnow
misty cedar
#

Heres another formulation I like

misty igloo
#

I have that in there for now because it's the motivating equation for what's used in practice, which imho is adjusted because of other current discrepancies in the left vs right sides of the rc2 formula

#

it will eventually get replaced with whatever the final version uses in practice, with some text describing what it approximates

dawn pewter
#

What is the forward computation formula of RWKV-7rc2 now?

misty igloo
#

@dawn pewter I noticed your comment about restricting Beta in the manuscript
Bo is now restricting it to [0,1] in the latest versions
One problem is if it goes up to 2 it can cause flipping, where every timestep parts of the state are negated back and forth

dawn pewter
rustic rivet
dawn pewter
rustic rivet
#

It could be quite interesting if RWKV-7 could do more things with one layer

rustic rivet
#

this paper suggested there are tasks that "efficient transformers" can't solve efficiently

#

could be a good time to revisit these conclusions, with rwkv-7 design

dawn pewter
#

I think RWKV7 will exhibit capabilities that set it apart from RNNs, transformers, and earlier Efficient Transformers.

quaint quiver
misty igloo
dawn pewter
dawn pewter
misty igloo
#

as stella pointed out back up there in this channel, there were a few different commit hashes of lm-eval used tho

#

I think because RWKV wasn't well supported at first and picocreator/hailey needed to change it a bit for that

void quartz
#

we probably should rerun everything anyway for v7

misty igloo
tropic minnow
tropic minnow
dawn pewter
#

I understand that from the perspective of this paper, since permutation matrices in state tracking tasks can have eigenvalues of -1, transfer matrix with only positive eigenvalues cannot represent these permutation matrices. However, if eigenvalues can be -1, then these matrices may be represented.

tropic minnow
#

@obsidian quest i did the experiment hand in hand with: delta rule vs delta rule with scaled beta between [0, 2] and this last thing worked better

#

with headwise normalization

#

(and beta being a headwise scalar, not a vector)

obsidian quest
#

try 3 different random initializations

tropic minnow
hallow breach
#

Hey! Any of you wonderful RWKV people going to be at COLM this week? I'd love to meet up at some point!

young sparrow
#

@tropic minnow and I are πŸ™‚

hallow breach
hallow breach
young sparrow
#

Oh and I forgot to tag @sonic rose

spiral minnow
#

I didnt see this til just now! @sonic rose and i went for lunch with a big group

hallow breach
#

No worries! I'd love to meet up for dinner, lunch tomorrow, really whenever if you guys are still up for it! @spiral minnow @young sparrow @tropic minnow @sonic rose

obsidian quest
#

remember to hype rwkv7 πŸ˜„

#

current run finishing in 5 days

young sparrow
obsidian quest
#

more general dynamic state evolution, while still efficiently trainable on current GPUs
rwkv 5/6 : diagonal matrix diag(w)
rwkv 7: diagonal + low rank (such as diag(w) + a^t b)

hallow breach
tribal notch
obsidian quest
tribal notch
obsidian quest
misty cedar
#

Some interesting stuff in here I am sure

tropic minnow
hallow breach
#

It is without a doubt one of the best looking posters here

gusty condor
#

It's based on my last year RWKV-4 poster

tropic minnow
tropic minnow
#

There was quite some interest in RWKV today! Also some papers used is as comparison of LinearAttention-like models and RNN baselines! Made sure to remind people that v7 is just around the corner :)

violet iris
#

Are there any live pictures of RWKV in COLM?

paper dove
#

Could someone explain the evolution process from RWKV-6 to RWKV-7?

quaint quiver
#

In high level terms it just uses the delta rule additionally

crystal hull
misty igloo
paper dove
misty igloo
paper dove
#

Goose extends this delta rule removal principle into vector-valued territory, allowing precise
channel-specific portions of values to be removed from the state in a data-dependent manner.

misty igloo
#

Sorry, I should probably update that sentence - this was a placeholder early on

#

I'm waiting to see what's in rc3 before revising the paper a bit more

paper dove
#

previous w is also a removal(decay) from the state, what is the key difference?

misty igloo
#

It's not exactly incorrect but I'd like to be more specific about what directions those channels face

#

In v7 Bo uses a 'deformed key' to remove from a key which is slightly different than the key which is added to

#

he first does the normal decay, and then removes a fraction of the value stored at that deformed key

#

but it's a bit messy in terms of the math so I'd rather wait until rc3 to clarify exactly what's going on there

#

but in general, the difference between decay and delta rule formulations is that in delta rule you remove a fraction of the projection of the state onto the removal key

paper dove
#

reflection ΞΊ parameters and Ξ²_t represents the "in-context learning rate". these are total new concept for me.

misty igloo
#

ΞΊ is that 'deformed key'

#

you can consider it like a modified version of the normal key

paper dove
#

so the reflection is confusing.

misty igloo
#

the other interesting perspective from which to view this, which we will eventually put into the paper, is as a form of SGD

paper dove
#

is it the reflection of a matirx like this?

misty igloo
#

@gusty condor wrote in the 'reflection' naming - I hadn't seen that until now, and I don't think I agree with the terminology

#

but I'll way for him to explain it since I only just saw it now

#

Bo calls it 'deformed key', not reflection

paper dove
#

'deformed key' may be a better name

misty igloo
#

Sorry there isn't much explanation of the parameters meanings in the paper yet - I'm just waiting because the architecture is going to change slightly

paper dove
#

Many people complain that the RWKV paper is not readable, one reason being the insufficient explanation of the meaning of parameters.

misty igloo
#

there are also a few details that don't match the existing implementation, which I left in for clarity to myself/others, like formula 12 is wrong

#

imho it represents more of the underlying meaning as it's written, but it does not match what he actually does

#

it's just a placeholder

paper dove
misty igloo
misty igloo
# paper dove

yeah this specific formula is passed on from GoldFinch and Finch C2, but Bo found that a slightly different variation works better for v7 in its current formulation
imo this is because the formula for v7 is imbalanced, so the modification is a way of approximating what should really be k'=k*(1-w)

#

but due to the imbalance it's more effective to use a somewhat different formula for that

#

the rationale for k'=k*(1-w) in GoldFinch is that it keeps the state naturally normalized to containing exactly one value at all times in any given key channel

#

classic delta rule automatically preserves this kind of status without requiring such a formulation, but v7 has a weird delta rule with a different amount being removed than is added back

#

that's the 'imbalance' im referring to

misty igloo
#

Unfortunately I have a feeling the best way might be a separate blog post etc.

#

RWKV, and especially v7, is quite complicated relative to many other architectures

#

Since we only have one architecture this time instead of two, hopefully we can fit more description of the different parts up front and where they come from

paper dove
#

let the network keep capacity for future use

misty igloo
#

there's definitely a tradeoff, and afaict so far Bo has found the imbalanced versions to perform a bit better

paper dove
misty igloo
#

there can also be issues in very long contexts potentially if the state can grow unbounded and then gets renormalized

#

so I personally prefer non-growing mechanisms

misty igloo
#

like I know roughly where it comes from and have a good guess as to why it works better, but it'd be very hard to justify or prove

#

but that makes explaining why we adjust k very tricky to do, even though the k=k*(1-w) viewpoint is very easy to explain

#

we could say it in a blog post much more easily, where there doesn't have to be a full defense of every statement or claim

gusty condor
misty cedar
#

This is my favorite formulation so far, for v7rc2
it really shows that the v7 is super simple,
its:

  1. create fast weights [ab] and [kv]
  2. add a diag of decay to [ab]
  3. for each timestep: [kv](t) += [ab](t) @ [kv](t-1)
    3.1) essentially, the fastweight [ab] is being used to do processing on the fastweight kv
  4. use the new fastweight kv as a linear module
misty igloo
#

didn't check it thoroughly for the correct transpositions since I'm just writing it in discord, but maybe the easiest way to show what it does recurrently would be with something like this?

outer_product = lambda x, y: x[:, None] @ y[None, :]
out = torch.empty_like(v)
for t in range(T):
    r_t, w_t, k_t, v_t, a_t, b_t = map(lambda: x[:, t, ...], [r, w, k, v, a, b])
    G_t = w_t.diag_embed() + outer_product(a_t, b_t) # the transition matrix
    state = G_t @ state + outer_product(k_t, v_t)
    out[:, t, ...] = r_t @ state

or, restated in terms of deformed k:

outer_product = lambda x, y: x[:, None] @ y[None, :]
out = torch.empty_like(v)
for t in range(T):
    r_t, w_t, k_t, v_t, d_t = map(lambda: x[:, t, ...], [r, w, k, v, deformed_k])
    G_t = w_t.diag_embed() - outer_product(d_t, beta * d_t) # the transition matrix
    state = G_t @ state + outer_product(k_t, v_t)
    out[:, t, ...] = r_t @ state
sullen horizon
obsidian quest
misty igloo
# obsidian quest https://x.com/BlinkDL_AI/status/1845070341779095676

so, is this the correct list of changes?

  • per head deformed key normalization, as previously discussed
  • per-channel dynamic adjustment of k towards the in context learning rate, so that it can add anywhere from 'the correct amount of in context learning' up through full k to the state at each step?
  • a new replacement for k*=1-w using some kind of approximation that it would be helpful if you could explain the idea for
obsidian quest
#

in vanilla delta rule, k should be scaled by iclr

#

and k should be scaled by 1-w if you like the idea

#

however it's better to let the model determine by itself the amount of these changes

misty igloo
#

actually I can just graph it πŸ™‚

obsidian quest
misty igloo
#

thanks, I see - I think you showed this before

obsidian quest
#

mk = 1 ==> full scaling, similar to k = k * (1-exp(-exp(w)))
mk = 0 ==> no scaling, similar to k = k

misty igloo
#

I guess the only special note is that the mk parameter acts in a nonlinear way to scale 1-w, since the exp() is after it is applied

misty igloo
# gusty condor or "reflector" maybe?

kappa is the hyperplane normal onto which we project the state - and then subtract this projection off of the state
so kappa itself isn't really the reflection, nor does it do the reflecting, it just chooses the hyperplane

#

it's an expression of what part of the state we want to remove (the amount to remove is determined separately, by the in-context learning rate)

quaint quiver
misty igloo
#

hehe I don't know enough topology to know about topological retracts... but in common English usage it most commonly means withdrawing statements in a newspaper or journal, or as the name of a medical device
Bo wanted to call kappa the 'deformed key'

gusty condor
#

I noticed a pattern in RWKV designation: Important information like key and value uses full matrix, while variables mainly for controlling (not important for information transmission) use low-rank MLPs.

gusty condor
young sparrow
paper dove
gusty condor
#

I think it's publicity and promotion that we are lagging behind.

obsidian quest
gusty condor
#

I think another reason is that RWKV is an over-designed architecture (since v5.2).

acoustic knoll
#

And I think the name is also one of the reasons. I have difficulty to pronounce it, and people can’t remember it easily which is not so good for promotion

obsidian quest
obsidian quest
#

however we definitely need more blogposts

gusty condor
#

We will soon reach a point where the architecture is too complex to explain in every detail.

obsidian quest
#

far from that. YOLO (v1 to v11) is a good example of real "over-design"

paper dove
#

I don't think RWKV is over-design

young sparrow
paper dove
#

From RWKV-4 to RWKV5/6, the incremental design is very clear.

#

But the reader's preliminary knowledge is not enough to understand

obsidian quest
#

mamba is more complicated than wkv, however they create an illusion by providing some "reasoning" to make the reader feel better

young sparrow
#

Communication with an under-informed reader is a, if not the, primary goal.

#

That's who reads methods sections

obsidian quest
#

agree. WKV itself is simple, however ddlerp etc. can be confusing for newcomers, and we need to seperate these topics

#

in fact, ddlerp is beneficial for transformers too. has nothing to do with WKV.

young sparrow
gusty condor
young sparrow
obsidian quest
#

firstly, mamba users
secondly, attention users

young sparrow
gusty condor
#

For me (my background includes mainly algebra and mathematical analysis), RWKV paper is more informative than Mamba, and formulae are consistent.

quaint quiver
#

Also make sure to try out alpha * 2 again @obsidian quest to get some state tracking and showcase better math and code performance

obsidian quest
quaint quiver
# obsidian quest that's certainly a weaker design

Ya ik but it should be much faster and simpler, so it’s a trade off for efficiency but also for readability which is what happened with mamba. I’m just saying we should try and make rwkv7 explained very well so we don’t get mamba’d

quaint quiver
obsidian quest
quaint quiver
#

I don’t think that’s how u get state tracking the best way, should have the 2 * only on ab (like u were doing before)

#

But ya I agree

obsidian quest
quaint quiver
#

For beta being a vector I’m not sure it brings much overhead but I think it’s supposed to preserve some properties and be more stable

#

Oh also it does w * (I - beta * kk^T)

quaint quiver
obsidian quest
#

I have an idea. We can build a RWKV CoT demo to do MCTS. For example, Reversi (Othello).

Rewrite the MCTS procedure as some very long text, and simply train a tiny RWKV model on plenty of such data.
This will be a proof-of-concept to show RWKV is good for very long CoT.
Discussion: https://discord.com/channels/992359628979568762/1296413705159966751
The RWKV model will simulate the full MCTS process. Not just a "value network" / "policy network".

FYI:
https://github.com/LeC-Z/RWKV-nonogram
https://x.com/BlinkDL_AI/status/1834300605973889111

gusty condor
#

But Othello is solved: https://arxiv.org/abs/2310.19387

obsidian quest
#

This will be a proof-of-concept to show RWKV is good for very long CoT

obsidian quest
wraith heron
# quaint quiver I’m worried rwkv7 gets mamba’s by https://openreview.net/forum?id=r8H7xhYPwz

how does this compare to https://arxiv.org/abs/2407.14207

quaint quiver
#

then gated deltanet is even better than gated deltanet

#

esp for length extrapolation

wraith heron
#

so is gated deltanet the best alternative to rwkv7 atm? I'm just interested in architectures using delta-rule, because that would make them more expressive than the tc0 space that transformers operate in AFAIK.

quaint quiver
# wraith heron so is gated deltanet the best alternative to rwkv7 atm? I'm just interested in a...

yes gated deltanet is the best, also this paper (https://openreview.net/forum?id=UvTo3tVBk2) shows u need to modify the delta rule a bit to actually get state tracking

#

which u can do easily in rwkv7 and gated deltanet

wraith heron
#

will keep in mind

#

might train a model from scratch to play chess and want the best architecture

obsidian quest
#

https://x.com/BlinkDL_AI/status/1848343821467390156 with my very inefficient RWKV-7 kernel and @bronze frost 's fast kernel πŸ™‚

RWKV-7: attention-free and surpassing modded-GPT. Training code & log: https://t.co/cuH0pItsPy Larger headsz can reach 3.26xx. My current implementation is slow🀣Might can reach 85% GPT speed @ ctx1k (or faster than GPT @ ctx4k) after optimization. Any helps are welcomeπŸ™#RWKV

crystal hull
#

@obsidian quest What is the model that I should try on ?

obsidian quest
crystal hull
#

On it! Will post here

crystal hull
#

I have started experiments for transformers, got hang of it

#

@obsidian quest I am trying to run it for rwkv. I am looking at the code. what is the rescale layer in inference code? ( 281-293) in rwkv_demo.py

obsidian quest
crystal hull
#

Under rwkv7

obsidian quest
#

RESCALE_LAYER is only for preventing overflow when doing fp16 inference

#

so you don't need them for bf16 training

obsidian quest
obsidian quest
crystal hull
#

@obsidian quest Is not possible to train f32 training on 'cuda' ?, I changed it to DTYPE = torch.float32 and there is an error saying expected half

obsidian quest
#

most stuffs are hardcoded bf16 now

#

change cuda .cu and .cpp too

#

typedef float bf16

crystal hull
#

@obsidian quest Are you saying 'typdef at::Float bf16' is not correct?

obsidian quest
#

typedef float bf16 is better

crystal hull
#

Oh I see, I changed it

crystal hull
#

@obsidian quest ' you must implement either the backward or vjp method for your custom autograd.function to use it with backward mode AD'

#

@obsidian quest forward pass worked fine, but backward is throwing above error.

#

@obsidian quest seems like the backward code is missing in the 'WKV_7'?

crystal hull
#

@obsidian quest Should I try with rwkv_cuda or rwkv_cuda_wind ? This is a 4 million Parameter ( single layer model for testing on word problem )

obsidian quest
#

wkv7g_v1 is reference implementation (slighly better loss, very slow, but enough for your tiny model)

crystal hull
#

@obsidian quest another question, I was originally trying with code under RWKV-LM/RWKV7/rwkv_v7_demo.py

But the one in modded-nanogpt-rwkv/ doesn't have RWKV_Tmix_x070, RWKV_CMix_x060 but only single RWKV7. Does this subsume both mixs ?

obsidian quest
#

that one is tmix

#

you need very good understanding of rwkv to use current rwkv7 πŸ˜‚ can try rwkv6 first

crystal hull
#

@obsidian quest Yup,πŸ˜…. This is my first time. But can I just change the config in GPT (I mean vocab_size, n_embd) , be sure it works right?

crystal hull
#

@obsidian quest yup, after a lot of staring. Seems like Cmix is replaced by normal MLP?

obsidian quest
#

i keep train_gpt2 MLP for some fair comparison

crystal hull
#

@obsidian quest Should I also use this as well, because we are comparing against transformers as baseline?

#

Or do you want me to use cmix?

obsidian quest
#

likely similar results

crystal hull
#

kernels in modded_gpt_rwkv_7 are hard-coded for a chunk length of 16, but I am trying to train on small sequences of 5. So there were errors

#

So I'm trying to train rwkv6

#

But cuda kernels are not compiling

#

There is an import error in the code for rwkv_v6_demo.py

crystal hull
#

@obsidian quest after a lot of wrangling, figured out the cause

This error stems when importing import RWKV

setting is_python_module=False as load(name="wkv6"..., is_python_module=False,...) and using torch.ops.wkv6 fixed the issue.

crystal hull
#

I completed a run for the A5 group and k=5 and for n=2 ( seq_len ) and number_of_layers=2, RWKV6 got perfect validation accuracy within 2 epochs !

#

I started the training run for sequence length=15 for both 1,2 layers.

obsidian quest
obsidian quest
lofty marten
#

Do we have plan for model like rwkv-o (GPT4-o)

obsidian quest
crystal hull
#

Training runs for longer sequences (k=15) will require approximately 15 million sequences in a single epoch, and it seems like they do not converge as quickly as shorter sequences. Previously, I was running experiments on Kaggle P100 GPUs, but each notebook can only run for 12 hours. Could you let me know if there is a cluster available where I can run the full experiment, or suggest how I should proceed next?

crystal hull
#

2 layer RWKV-6 only ran 6 epochs in 12 hours and best val sequence accuracy is 0.32

gusty condor
steady ether
void quartz
#

PS: i will be at neurips this year

(finally finally, closing up our fundraise round... and have time to focus more on RWKV again, but yea in general good news for RWKV soon)

obsidian quest
obsidian quest
obsidian quest
obsidian quest
misty igloo
obsidian quest
misty igloo
#

oh not in the v7 folder lol

obsidian quest
#

i am removing more loras

misty igloo
#

@obsidian quest you got rid of the sigmoid limit on ICLR/Key mix amount - was that intentional so it can go <0 and >1.0?

#

you also got rid of all adjustment of key by decay - this was also intentional, right?

#

like now you just do something like:
k = k + k * (iclr-1) * self.iclr_mix_amt

#

which i guess is supposed to mean:
k = k - k * (1-iclr) * self.iclr_mix_amt

#

so just making sure you want it to be able to exceed [0,1]

#

like is the idea k = torch.lerp(k, k * iclr, self.iclr_mix_amt)?
just confused since it isn't equivalent to that

obsidian quest
#
k = k + k * (a-1) * ma
k = k * (1-ma) + k*a * ma
misty igloo
#

gotcha yeah it is equivalent, somehow i got mixed up πŸ™‚

#

[putting all of this into my RWKV_Explained repo]

misty cedar
void quartz
#

i guess now in hindsight, he is outside the transformer cult bubble?

gusty condor
#

From my opinion, the reason is that RWKV uses LayerNorm while all others use RMSNorm.

#

It might be LayerNorm that makes this projection invalid.

#

Try Pythia (if it uses LayerNorm)

#

Another possibility is that RWKV has token shift.

obsidian quest
rose mango
#

faster than RWKV-6, but better loss as well???

obsidian quest
#

yes much better loss

obsidian quest
void quartz
#

I will be helping cover the RWKV paper at neurips at the following in person paper club here: https://x.com/swyx/status/1861197521126859260

Since there is a strong (10% of vote) demand for transformer alt

Disclosure: I been helping co-organize some of their paper club on a regular basis, and are friends with the organizer

Super interesting responses from all the NeurIPS LS live attendees so far:

- Everyone wants Agents, Vision, Open Models, Transformers Killers, Economic Landscape/CodeGen

- Nobody wants Voice (!?!), Diffusion, Finetuning, RAG content?!??!

Ok we need speakers/debaters on all

#

Asking: Anyone here want to join me, and cover statespace? If your there in person. In particular i guess would be @last mauve anyone from your team? (since ur not going to be there)

#

My ending stance would be: It might not matter our architecture differences at this stage, we do not know until we scale - to avoid making it a "this is better then that presentation"

So its more of running through the 3 alts RWKV, statespace, XLSTM, the high level similarity and differences.

obsidian quest
#

pls talk about RWKV-7 πŸ™‚

obsidian quest
#

Here is how RWKV-7 really works. It is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token.
https://x.com/BlinkDL_AI/status/1861753903886561649

my simplest explanation:

you have some {k0, v0} {k1, v1} ... and q
ignoring details:
if q = ki, you'd like result to be close to vi
if q = (ka+kb)/2, you'd like result to be close to (va+vb)/2

RWKV-7:
simply test-time-train a model ki -> vi using in-context online GD
if q = ki, result is close to vi
if q = (ka+kb)/2, result is close to (va+vb)/2```

RWKV-7 "Goose" πŸͺΏ is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token. It is like a world model ever adapting to external environment: https://t.co/ecOwkzJCOoπŸ™‚#RWKV

obsidian quest
#

@gusty condor we can draw RWKV-7 graph now based on rc4a

#

@void quartz @last mauve please make sure you have good understanding of rwkv-7 πŸ™‚ feel free to ask questions

void quartz
#

Though the format I’m planning may not let any of us dive too deeply unless asked

quaint quiver
#

This might also help a bit as a different perspective #992359629419991142 message

#

Last section has rwkv7 but it might be a bit outdated

obsidian quest
quaint quiver
#

Ya true

#

I think tho the deltanet database explanation could be more intuitive for some ppl

obsidian quest
#

yeah your version is about some further details

obsidian quest
gusty condor
#

The current RWKV-7 implementation of W is not elegant. This induces one extra exponential, one negation and one logarithm, which might be the bottleneck for training. The formulae can be further simplified as follows.

#

torch.exp(-0.606*F.sigmoid(u)) is very elegant.

obsidian quest
gusty condor
#

I think it's necessary to fuse this

#

RWKV-7 is stable version

#

Make it as fast as possible

obsidian quest
#

because probably it's possible to remove this w clipping with clever bwd

sinful breach
#

so is there a shared doc or sth? How could I help with any parts?

misty igloo
# sinful breach so is there a shared doc or sth? How could I help with any parts?

There is, but it may depend on your existing level of familiarity with RWKV and RWKV-7 specifically.

Link: https://www.overleaf.com/5753862368yvnbymysbrsf#07fba2

Please don't edit yet without discussion (or add wholly new proposed sections/appendices if you like, but no guarantee these will stick) - I've been holding off on updating the formulas bc things have changed quite a bit recently, and we haven't really begun writing the discussion sections yet.

The main codebase or my https://github.com/SmerkyG/RWKV_Explained repo might be a good place to look if you need to first learn how RWKV-7 works.

Generally speaking, people propose and run experiments and we add those in, or they write proofs, etc. The idea is to get a lot of community involvement. Proposing and doing ablation studies could be a great way to help out.

I know you've expressed an interest in making things clearer and more appealing in the coming paper, which will be great!

misty igloo
#

One thing to keep in mind that has constrained us (or well, certainly me personally) in the past is that everything we say in the paper has to be well substantiated by empirical evidence or proof to ensure a smooth review process... though this seems somewhat relaxed when such statements or descriptions lie within the appendices.

This, and restrictive page count limits (usually 9), can make it more difficult to be descriptive or provide a clear intuitive basis for what are often somewhat complicated technical bits, especially when the paper is written in a communal open-source kind of manner with often 20+ authors.

#

That said, let's strive for clarity and accessibility!!!

sinful breach
#

i see, yea it will definitely be hard to do good ablations at scale and it's not clear how informative ablations at small scale (e.g. ~100M params) would be

#

what type of theoretical results are we looking for?

#

deriving motivation for why grad descent/meta-learning grad descent of the specific type of linear regression is useful?

#

Im not actually familiar with the existing literature around rnn alternatives, but, it seems to me that many have converged on this idea of updates based on online grad descent, yet, the exact formulation of why this makes sense to do (especially given that both k and v are functions of x) is missing no?

misty igloo
# sinful breach Im not actually familiar with the existing literature around rnn alternatives, b...

have you read the various papers on delta rule usage in models in general? like the delta net paper https://arxiv.org/abs/2102.11174 and others

#

I think in order to determine what's missing you should probably do some literature review

#

as for why delta rule/modified grad descent is useful for these kv memory states... that's something that we will definitely try to cover in the paper

#

but RWKV-7 is not just traditional delta rule

misty igloo
misty igloo
sinful breach
sinful breach
# misty igloo heh that's the thing about an open paper writing process like this - anyone (inc...

Yea, I was just wondering if people have concrete open problems they're interested or not. Imo I think the optimization perspective is still not entirely complete. I get the notion of approximating the operation of querying to retrieve linear combination of value vectors, approximating linear cross attention, but I do think there is something slightly broader of asking what exact role this serves for doing autoregressive prediction, why aren't we making the key matrix part of our online learned state (thereby turning the linear regression optimization into a linear autoencoder problem), what is the functional importance of sparsity or nonlinearities in how they can actually assist with the objective, etc.

#

Hell, can even ask why isn't rwkv doing multiple gradient descent steps and why meta learning the learning rate makes sense instead of choosing the optimal learning rate for linear regression. How can we think of label noise and robust algorithms, replacing the squared loss with Huber like losses, regularizations

#

If we can understand the broader objective better, we should be able to reason about how keys and values may need to evolve or behave as well

misty igloo
#

I don't exactly disagree, but I would point out that you're really describing new research directions, not RWKV-7

sinful breach
#

Yea that's what I'm worried about and Im not sure what people's feelings are in regards to the current RWKV-7 and goals regarding theoretical results

#

Ig it's a sort of reverse engineering theory to justify the current implementation

#

Like why normalizing the update vectors and making learning rate a learned parameter is important

#

Although perhaps some things could be done if proper ablations are conducted (e.g. ablating over learned vs fixed learning rate)

misty igloo
#

generally speaking, the process for RWKV is that Bo does a ton of experiments (with a mix of consultation with others) and decides the architecture based on what experimentally works best... based on his very strong intuitions about why certain directions will likely work well, of course

the description is then written up based on the kind of reverse engineering you're describing, by folks who have been paying close attention to the development process

it can definitely include analysis that didn't exist when the arch was defined

#

he or I can probably answer questions about why specific choices were made that don't necessarily seem obvious
I often feel like a real time archaeologist 🀣

sinful breach
#

I see, ok, I'll follow along closely in the meantime then and see how things play out and see if I can try any ablations myself. Mostly interested in getting some more practical experience here, better understanding real world concerns when it comes to large models, and the optimization perspectives

misty igloo
#

there definitely are lots of choices that have been informally ablated, which would benefit from a formal recorded ablation

#

even if not at large scale

#

e.g. 'deformed' key removal

sinful breach
#

what are the typical compute requirements for ablation experiments? Would 4 80GB A100s be enough?

misty igloo
#

I think that's more than enough for small scale ablations

#

the biggest issue is that longer/larger scale tests often reveal quite different behavior much later in training

#

afaik the main v7 release candidate versions Bo has tried each got a full Pile run at smaller size

#

we frequently see architecture 'improvements' that win at smaller scales and for 1-2gtok but fall behind later in training

#

there's simply no feasible way to truly fully ablate everything, given any reasonable compute budget

sinful breach
misty igloo
#

often though, additional parts become less necessary at scale

#

which is part of why you see the regression of tokenshift back towards v5 in v7 (this is not in the paper yet - but its in the rc4a code)

#

so there's at least some aspect of 'the bitter lesson' at play, which is I think what you were maybe getting at with that question

#

and yet, there are other complexities which seem useful and low parameter

sinful breach
#

Yea I am honestly kinda hoping that simpler and closer to just basic grad descent with optimal learning rate is what is optimal, and that additional modifications such as learned learning rate aren't actually improving performance much. I'm also curious if the normalization corresponds to certain types of regularization

sinful breach
misty igloo
#

unless you first develop some very advanced theory of what you're modelling and why

#

no one here is purposely making things complex

rose mango
#

The bitter lesson is about simplifying as much as possible given the constraints, not absolute maximum simplicity.

misty igloo
#

but it's valuable to get some experimental experience with these architectures, so either way trying it is valuable!

sinful breach
#

There's also things that I don't quite understand like why is the learned learning rate only applied when subtracting the key outer products but not applied when adding the new value key outer product update

misty igloo
#

you can view the state as a memory, or as SGD... and if you're viewing it as a memory the geometric interpretation is pretty clear cut

#

as for normalization, that's a topic I have a lot of thoughts about...

#

see the k*=1-w thing from Finch-C (originally in GoldFinch paper) for a general idea of what I think is wrong with RWKV normalization in other versions

#

we tried some stuff like that for RWKV-7, but the formulas are kind of lopsided and it didn't matter enough

#

from my perspective its not that it didnt matter, its that the formulas never got clean enough for it to matter

#

and normalization is an end-run around this problem

sinful breach
#

Ic ill take a look

misty igloo
#

in any case, a deeper understanding of the SGD perspective definitely needs to make its way into the RWKV-7 paper
Bo for sure wants this described in it

sinful breach
#

Thanks for the pointers, I really appreciate it!

misty igloo
#

no problem! sorry I can't go on at length too much about it all ... got a lot of other unrelated (but RWKV related) stuff i gotta get done πŸ™‚

#

also, the RWKV discord server might be the best place to discuss architecture concerns

#

we generally use this channel mostly around paper writing organizational stuff

#

tho its flexible heh

iron parrot
#

Some questions about RWKV-7.
currently, the state calculation in RWKV-7 is:
state = state * w.view(H,1,N) + state @ ab.float() + vk.float()
where:
w = torch.exp(-0.606531 * torch.sigmoid(w)) # 0.606531 = exp(-0.5)
ab = (-kk).view(H,N,1) @ (kk*a).view(H,1,N)
kk = torch.nn.functional.normalize(kk.view(H,N), dim=-1, p=2.0).view(-1)
a = torch.sigmoid(a0 + (xa @ a1) @ a2)
Is the range of eigenvalues of the state-transition matrix [-0.455, 2]? Would this affect the model's performance on some tasks (like parity)?

obsidian quest
#

how to make it [-1, 1] (however i notice this will nan after some time)

a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 )
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*a)
 
new:
a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0 
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk, kk*(a.float()*torch.exp(-torch.exp(w.float()))).to(dtype=torch.bfloat16))

or (try both)

a = torch.sigmoid( self.time_aaaaa + (xa @ self.time_aaa_w1) @ self.time_aaa_w2 ) * 2.0
x = RUN_CUDA_RWKV7g(r, w, k, v, -kk*torch.exp(-torch.exp(w.float())).to(dtype=torch.bfloat16), kk*a)
iron parrot
#

After expanding the eigenvalues, RWKV-7 (green) solved the parity task instantly compared to the original version (brown)
magic!

quaint quiver
obsidian quest
#

and i don't know why it will nan after xxG tokens (well can certainly locate the nan, but too busy now)

crystal hull
#

also, problem i working on is bit more general. I am workign on non-commutative group mulitplication on S_5, A_{4} \times Z_{5}

crystal hull
#

for even slightly more complicatd like modulo 3

misty igloo
tribal notch
obsidian quest
tribal notch
#

Jeez, that's so smooth

obsidian quest
#

perfectly smooth for 0.4b 1.5b too

tribal notch
#

wait is this eval loss or train loss?

tribal notch
# obsidian quest perfectly smooth for 0.4b 1.5b too

im training my own architecture and get a way more noisy training loss (if this is training loss) but my eval loss is pretty smooth and aligns more so with what you have (I am only training a ~150 param model atm)

quaint quiver
#

but thats training loss

tribal notch
tribal notch
#

that pretty dang nice

acoustic knoll
gusty condor
remote elbow
gusty condor
#

Yes

obsidian quest
# gusty condor

rmsnorm => l2norm and there is "Text" near bottom LayerNorm

last mauve
gusty condor
#

Fixed!

quaint ingot
#

Looks a tiny bit more complex than a transformer layer πŸ˜†

obsidian quest
# gusty condor

if you draw llama style transformer with rotary, that can show the illustration makes everything looking more complex πŸ™‚

gusty condor
quaint quiver
#

It should be done for each block

#

Something like this

#

But more detailed I guess and specific to rwkv

#

Recurrent view could also be shown

obsidian quest
#

yeah our goal should be making it looking as simple as possible

sinful breach
#

The biggest thing that has thrown me off from looking into RWKV in the past were overly complicated architecture diagrams that made it look like some over-engineered mess, as opposed to simple, interpretable, well motivated architectural decisions. Equations, such as those presented in BlinkDL's tweet about the connection to lin reg gradient descent, make a lot of this far more digestible imo.

#

Even if there are still many details and components, anything that starts more abstract and introduces these "later on" for those seeking specifics on exact implementation would definitely help clarity

quaint ingot
#

I was commenting that because of the impact of seeing such a complex diagram, I actually like having the entire achitecture avilable at a single illustration

#

But it does have a strong inital impression

#

It's hard to infer the role of components in the time mix block, but it seems to me that showing it in an intuative way is pretty hard.

misty igloo
#

forgive the ascii art, but something more like


       ↑
    [Linear]
       ↑
  [LayerNorm]
       ↑
------[+]
|      ↑
|   [CMix]
|      ↑
| [LayerNorm]
|      ↑
-------|
------[+]
|      ↑
|    [WKV]
|      ↑
| [LayerNorm]
|      ↑
-------|
       ↑
  [LayerNorm]
       |
     input

WKV Attention Block:

         ↑
     [Linear] (W_out)
         ↑
        [*] (gate)
         ↑
        [+] (bonus)
         ↑
    [GroupNorm]
         ↑
        [*] (receptance)
         ↑
[Modified Delta Rule]
         ↑
  [Linear / LoRA] (w,k,v,r,g,a)
         ↑
   [Token Shift]
         ↑
#

and we can add detail views for the modified delta rule and how some of the special aspects of k, kappa, v are calculated

misty igloo
sinful breach
#

yes def even as ascii art (although idk about the Linear / LoRA, such details can be left to appendices and so on. For any paper, the main goal in the main text is to provide as much clarity and accessibility to as broad an audience as possible. The hope is to bait them in with an interest to understand more and without a feeling that understanding will require massive investment

#

And include a block showing what the Cmix is

#

Modified delta rule might be more understandable in terms of equations than architecturally

misty igloo
misty igloo
sinful breach
#

yea exactly. And for purposes of clarity I think starting as abstract/high level as possible and gradually going into details of specific components is going to be ideal

#

The complete diagram may still be useful for people who are interested in exact implementation specifics perhaps, and it could help in possibly highlighting differences from previous methods

misty igloo
#

on its own its complicated as a diagram, but not totally insane

sinful breach
#

not totally insane i agree, but so much less clear than the equations imo

misty igloo
#

wouldn't help me personally reading a paper to see it instead of code or equations

young sparrow
#

I think the key question to ask is: what's the most accessible form of each piece of info. I think the high level diagram is very clear in part because it maps pretty cleanly to the standard transformer diagram, so it makes it clear how it relates. But this I feel is probably easier to digest as equations

sinful breach
#

yes exactly

misty igloo
#

the less annoying TL;DR of the above diagram is:
state = state times decay, minus 'a' amount of the old value at the deformed key, plus 'a' amount of the new value at the current key

#

but the details are just a tiny bit more complicated than that bc of the LERP

last mauve
# young sparrow I think the key question to ask is: what's the most accessible form of each piec...

+1 on using equations and code as the primary way to convey rwkv.

For diagrams, I propose a 3-level approach:

  1. (max-detail) The maximally-detailed figure posted above. This is for rwkv practitioners to understand rwkv version differences. This should be in an appendix of the paper.
  2. (mid-range) A modified version of the figure above, where we remove the inner details and just have blocks like "channel mix". I.e. the rwkv analogue of #1103039376184852622 message. This is for RNN/SSM researchers to compare rwkv with competing blocks like mamba/lstmx/etc. This should be placed in the "Design" section of the paper.
  3. (min-detail) A transformer-like diagram like what's proposed by @misty igloo in #1103039376184852622 message. This is the headline figure for the general public. This should be either in the "intro" or "design" section of the paper based on how we handle the storyline.
obsidian quest
obsidian quest
gusty condor
obsidian quest
#

@gusty condor we can simply call various loras lora as in the v6 graph

gusty condor
#

They are essentially different

void quartz
#

Im gonna present RWKV & QRWKV in 30 mins at the latent space event
https://lu.ma/LSLIVE
https://www.youtube.com/watch?v=wT636THdZZo&ab_channel=LatentSpace

Both Dan Fu and I agreed we are not going to go into the math / details, so we can spend more time high level - and what we expect next in the future

So no V7 beyond a mention

Let's get together to send off 2024 with the first LIVE Latent Space Paper Club, hosted during NeurIPS! Instead of going paper-by-paper as NeurIPS does, we are…

obsidian quest
obsidian quest
acoustic knoll
obsidian quest
#

lets call it RWKV instead of WKV

iron parrot
#

In terms of long-context PPL, RWKV-7 completely outperformed Mamba and appears to be capable of extrapolating to infinite context

#

This is the original RWKV-7 trained on Pile ctx4k

quaint quiver
#

interesting that it doesnt seem to suffer from state collapse https://arxiv.org/abs/2410.07145

young sparrow
# iron parrot In terms of long-context PPL, RWKV-7 completely outperformed Mamba and appears t...

This is pretty cool and a clear demonstration would be a compelling pitch. @nova frost has been scoping out long context evals for the eval harness and is planning to implement some more, maybe he can be a helpful collaborator?

I know that there are some formal benchmarks for long context evals used in papers studying limitations of long context models, as well as naturalistic benchmarks for tasks that require long context. Do you have any specific benchmarks you're most interested in?

obsidian quest
misty igloo
#

ruler and longbench come to mind

obsidian quest
#

can modify them to support rwkv7

misty igloo
#

otherwise we have to modify things every time we have a new architecture

#

which I seem to have a lot of lately 🀣

#

easier to modify each new architecture once to work on HF than modify 100 tools for each new architecture

#

esp since we have to make new architectures work on HF anyway

obsidian quest
#

rwkv7 not supported in HF yet

misty igloo
#

I'll just go do that - I gotta do it for QRWKV7 anyway
I guess for now I'll use icecuber's Triton implementation

obsidian quest
quaint quiver
young sparrow
nova frost
#

Will add them!

misty igloo
last mauve
#

I aim to kick rwkv-7 paper writing into gear by the end of this year. I think we have much of what we need now.

@misty igloo @obsidian quest -- Can you summarize any remaining experiments we need to finalize (e.g. the long-context discussion above)?

misty igloo
#

I can do a Q-RWKV 7B 'any old day' in hours, but that's quite a different thing than even a v6 continuation trained model

last mauve
#

Makes sense. In that case we can focus RWKV-7 efforts on nailing down:

  1. Design section and associated architecture messaging
  2. Intro/background/related-work
  3. Calculations for FLOPs and params
last mauve
misty igloo
obsidian quest
void quartz
#

i wish there is a good 70B class model without llama, or qwen wierd licensing

last mauve
void quartz
last mauve
#

πŸ˜… honestly have not considered it, until you asked lol - will discuss with smerky separately.
Yep no pressure. Think on it.

We are also started to hire postdoc (intern or fulltime), to help scale up RWKV paper processes
Exciting! I'll lyk if names cross my path.

misty igloo
#

We're also looking for a Machine Learning Research Engineer to focus on the RWKV open source software ecosystem and tech for other RWKV projects - if you're someone in one of these two categories or in between, ideally have familiarity with RWKV, and are interested in either role, definitely reach out to us

tribal notch
tribal notch
misty igloo
tribal notch
#

What is that state size?

misty igloo
#

same as mamba 2, gla, etc.

#

all these models work the same way in terms of state

tribal notch
misty igloo
#

168M:
args.n_layer = 12
args.n_embd = 768
421M:
args.n_layer = 24
args.n_embd = 1024

tribal notch
misty igloo
obsidian quest
tribal notch
obsidian quest
tribal notch
tribal notch
gusty condor
# quaint quiver interesting that it doesnt seem to suffer from state collapse https://arxiv.org/...

They don't really understand the mechanisms, and state collapse was revised to state explosion. However, that doesn't really apply to RWKV-6, because entries in WKV can go up to 1e+4.
This paper receives 3,3,3,3,6 after ICLR 2025 rebuttal.
My understanding is that Mamba has poorer state management than RWKV-6 and RWKV-7. The state evolution formula plays a key role in preventing state degradation.

iron parrot
#

More test results:
original 0.4B RWKV-7 ctx4k (Figure 1) completely outperforms the 2.8B Mamba (Figure 2) on the Haystack test, even though Mamba was specifically fine-tuned for long context

#

With longer context lengths, RWKV-7's PPL continues to decrease without any apparent limitations

#

RWKV-6 vs. RWKV-7 as context length increases, RWKV-7's advantage grows

obsidian quest
#

pls test non-tuned mamba too

lean elm
iron parrot
misty igloo
lean elm
obsidian quest
obsidian quest
obsidian quest
misty igloo
misty igloo
acoustic knoll
# obsidian quest https://x.com/BlinkDL_AI/status/1869433254425833487

You might be interested, rwkv is mentioned https://arxiv.org/abs/2406.10149

misty igloo
#

@nova frost any progress with long context benchmarks? I have the RWKV-7 HF models ready

nova frost
#

do you have a link? can run them

obsidian quest
nova frost
obsidian quest
#

can ask @iron parrot

iron parrot
nova frost
#
|Tasks |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------|------:|------|-----:|-----:|---|-----:|---|------|
|niah_2|      1|none  |     0| 16384|↑  |0.2440|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.7900|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.6080|Β±  |   N/A|
|niah_3|      1|none  |     0| 16384|↑  |0.0640|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.7860|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.5320|Β±  |   N/A|
|niah_4|      1|none  |     0| 16384|↑  |0.1860|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.1680|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.1720|Β±  |   N/A|
|niah_5|      1|none  |     0| 16384|↑  |0.0000|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.0280|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.0000|Β±  |   N/A|
|niah_6|      1|none  |     0| 16384|↑  |0.0000|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.0120|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.0000|Β±  |   N/A|
|niah_7|      1|none  |     0| 16384|↑  |0.1190|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.2230|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.2350|Β±  |   N/A|
|niah_8|      1|none  |     0| 16384|↑  |0.1310|Β±  |   N/A|
|      |       |none  |     0|  4096|↑  |0.5040|Β±  |   N/A|
|      |       |none  |     0|  8192|↑  |0.3555|Β±  |   N/A|

did some evals on some of the other NIAH variants from ruler on RWKV7-Goose-1.4B-Pile-HF

#

The metric here is context length

obsidian quest
nova frost
#

On this branch of the harness. The prompt template is here. To choose which context lengths to run comment them out, here, and here

#

lm_eval --model hf --model_args pretrained=...,max_length=<max length evaluating> --tasks niah_2,niah_3,..

obsidian quest
young sparrow
young sparrow
obsidian quest
nova frost
#

state-spaces/mamba-2.8b does really badly for some reason. scoring max in single digits even for cxt length of 4096

#

implementation seems to be correct as it scores perfectly on the passkey retrieval (niah_1 4096 - as most models do)

#

and it was trained with 8192 sequence length

#

Completely falls apart though. mostly generating ( ( ( ( ( ( ( (, and sometimes (2 (2 (2 (2 (2 (2 (2 (2

young sparrow
nova frost
#

oh I should check how it does on niah_1 with ctx_len 8096. The main difference with that and all the other tasks is that in the former the haystack is a short repetitive phrase, while the others use Paul Graham's essays

nova frost
#

completely degenerated. Scored 0

obsidian quest
#

from #992359629419991142 message

uneven blade
#

Left: V7; Right: V6

obsidian quest
quaint ingot
#

is the intuition here that a higher rank states contain more information?

sinful breach
#

each state update is at most a rank two update to the state, higher rank states would presumably imply that a larger diversity of distinct state updates are stored/can be retrieved

misty cedar
void quartz
void quartz
# young sparrow **We should not be tuning the prompts for evaluation tasks for the sake of makin...

On that note; if the impact of just read twice is significant enough for linear models (mamba or rwkv), are we open to benchmarking that separately?

https://arxiv.org/abs/2407.05483

obsidian quest
obsidian quest
sinful breach
# uneven blade

https://arxiv.org/abs/2407.02678 might be a related perspective

midnight venture
#

perhaps with everyone shifting their focus to inference compute and 'thinking' models like o1/o3, that might be an area where v7 can shine

quaint quiver
fresh mulch
#

complexity sure but what about actual times? mamba is also theoretically linear complexity but usually slower than transformers in practice

quaint quiver
#

But with an optimised implementation it should be faster

#

Maybe noticeable after 2k tokens

misty igloo
quaint quiver
misty igloo
#

Doesn't mean that's quite the same for v7, but roughly...

misty igloo
#

But of course everything comes down to ctxlen vs how optimized the implementation can be for current gpus

#

For batched inference the kv cache thing is a big deal

fresh mulch
#

thought so. hopefully later down the line in development for RWKV-7 optimization for modern gpus gets some focus

#

re: chain of thought, it could also be viable (given there exists a sufficient dataset) to tune a small rwkv model on chain of thought and benchmark it, which would also be a nice addition to the paper + the first linear cot model?

misty igloo
fresh mulch
#

ah, didn't know that, nice

obsidian quest
#

please test RWKV-7 MQAR πŸ™‚ using RWKV-LM --my_testing "x070" and recommended lora dimensions

steady ether
obsidian quest
obsidian quest
obsidian quest
obsidian quest
#
obsidian quest
#

@steady ether important: use LayerNorm instead of RMSNorm for RWKV-7

we should mention this in paper. i think it's related to better initial state. I find LN as fast as RMSNorm in latest pytorch

because i am not using trainable initial state (=persistent memory). found it useless, at least when using LN

steady ether
obsidian quest
#

latest checkpts: gen7 RNN finally solves MMLU
https://huggingface.co/BlinkDL/temp-latest-training-models/tree/main

MMLU
rwkv6-v2.1-7b 47.9%
rwkv6-v3-7b 54.2%

rwkv6-v2.1-3b 32.38%
25%trained-rwkv7-v3-2b9 43.08%
35%trained-rwkv7-v3-2b9 45.24%
40%trained-rwkv7-v3-2b9 47.36%
49%trained-rwkv7-v3-2b9 49.34%

rwkv6-v2.1-1b6 26.34%
38%trained-rwkv7-v3-1b5 33.89%
51%trained-rwkv7-v3-1b5 40.44%
60%trained-rwkv7-v3-1b5 40.77%
72%trained-rwkv7-v3-1b5 41.36%```
tropic minnow
obsidian quest
#

rnn part πŸ™‚

misty cedar
gusty condor
obsidian quest
obsidian quest
steady ether
paper dove
#

Hello everyone, I have just updated the data from VisualRWKV-7 in the "RWKV for Image Understanding" section. I think we should speed up the progress of the RWKV-7 paper. Recently, Google has also introduced the concept of an in-context learner, which is very similar to RWKV-7.

gusty condor
quaint quiver
obsidian quest
obsidian quest
last mauve
obsidian quest
#

let's start. 0.1b/0.4b/1.5b @ 300b is far more than what they were doing
and 1.5b v7 world in 10 days

misty igloo
fresh mulch
#

I've been interested in contributing, is there anywhere help would be particularly appreciated rn or should I just explore potential opportunities?

misty igloo
#

or experiments that train a specific model from scratch (like we do for MQAR) and test them work great, too

rose mango
#

If there's anything I can help with, let me know. Illness last month derailed most of my planned experiments.

misty igloo
rose mango
#

I was going to do RWKV-DiT

#

then I spent weeks in bed πŸ˜†

misty igloo
#

that sucks, but also it should just be its own paper πŸ™‚

#

because it will be really awesome and deserves it own spotlight!

rose mango
#

good point there

fresh mulch
misty igloo
misty igloo
#

@obsidian quest to get the paper ready we need to start gathering the details of World v2.1,3,4
here's what I have so far:

Added in World v2.1 for ~1.4T tokens total
β€’ cosmopedia
β€’ slimpajama c4 (missing in v2)
β€’ dolma v1.6 reddit
β€’ Magpie-Align_Llama-3-Magpie-Pro-1M
β€’ Magpie-Align_Magpie-Pro-MT-300K
β€’ Magpie-Align_Magpie-Air-MT-300K
β€’ Magpie-Align_Magpie-Qwen2-Pro-1M
β€’ Magpie-Align_Magpie-Phi3-Pro-300K-Filtered
β€’ Magpie-Align_Magpie-Gemma2-Pro-200K-Filtered
β€’ glaiveai_glaive-code-assistant-v3
β€’ cognitivecomputations_SystemChat-2.0_SystemChat
β€’ migtissera_Tess_tess-v1.5
β€’ openbmb_UltraInteract_sft
β€’ m-a-p~Code-Feedback~Code-Feedback

Added in World v3 for ~3.1T tokens total
β€’ remove slimpajama cc and c4
β€’ fineweb-edu
β€’ DCLM (only global-shard_10_of_10)
β€’ cosmopedia-v2
β€’ Buzz-V12
β€’ WebInstructSub
β€’ SKGInstruct
β€’ math-ai/TemplateGSM
β€’ all of starcoder (instead of only >10 stars repo)
β€’ python-edu (in HuggingFaceTB/smollm-corpus)

Still missing v4 datasets

obsidian quest
#

v3.1 first, with more code, and o1-style data, etc

misty igloo
obsidian quest
#

v3.1 not constructed yet πŸ˜‚

misty igloo
gusty condor
#

Update stability of RWKV-7 DPLR rule

obsidian quest
#

(Linear) Attention Mechanisms as Test-Time Regression

v1.1

I've added @BlinkDL_AI's RWKV-7 and fixed the update rule for Vanilla DeltaNet

---

Note that the arrows in the part where we derive linear attention variants don't necessarily indicate generality nor a tech-tree. For

dawn pewter
misty igloo
#

stable as in not growing but can still flip back and forth, which may lead to undesirable behaviors even if more expressive

dawn pewter
misty igloo
#

certainly other people have reported that it can be beneficial

#

the reality is that the decay is almost always near 1.0, so even that will not really allow all the way [-1,1] in most cases

misty cedar
#

I know its hard right now with the state of things, but please do make an effort πŸ‘‡

gusty condor
#

I think this idea may have been present somewhere. There might already be code repos for this.

quaint quiver
# gusty condor I think this idea may have been present somewhere. There might already be code r...
gusty condor
#

This one is different: we don't have prompts, we don't have chat templates, and CoT is conducted on pretraining data,