#RWKV-papers

1 messages · Page 4 of 1

outer vine
#

i have to admit this is funny, but this also gives a easy chance to raise the score

outer vine
#

and i think maybe we should upweight SRU paper in the next version? I know Bo is deeply inspired by AFT. But the one core ingredient to scale up RNN is lightweight time-relevant operation and RWKV follows exactly two principles in SRU: (1) lightweight recurrence(hadamard product) with customized cuda kernel (2) other modules computed in parallel

tender karma
#

Can we presume that v5 will supplant v4? Based on the differences, v5 appears to address the "true parallelization" concerns, especially given the modifications in time mixing. Have I grasped this right?

void quartz
#

but yea, v5 will presumely supplant v4 (when its trained finished : there is no fully trained model yet)

tender karma
obsidian quest
steady ether
#

Any plans for rebuttals? Could clear up questions for the committee.

tropic minnow
#

i can deduplicate the contents from figs 2-3 that a reviewer complained about

last mauve
#

Here are the current TODOs. Grab an item or two:

1.~~ (HIGH-IMPORTANCE) Fill out rebuttal section for reviewer Zd3h~~
2. (HIGH-IMPORTANCE) Fill out rebuttal section for reviewer rSzx
3. (HIGH-IMPORTANCE) Fill out rebuttal section for reviewer 85wr
4.~~ (HIGH-IMPORTANCE) Fill out rebuttal section for reviewer HDNB~~
5. Update the text to have a sentence defending the following from reviewer rSzx: Two specific tasks (ReCoRD and Winogrande, as shown in Figure 5) see the model underperforming other models. This underperformance requires further investigation.
6. Fix the following typos found by reviewer rSzx L126:a computationally efficient alternatives. L136:Simultaneously with this work, (Poli et al., 2023): citep -> citet
7. Add a sentence to the text defending against reviewer 85wr's confusion on: My understanding is that RWKV is roughly equivalent to the AFT local model that was previously presented. Yet this is not mentioned in the paper and the table does not include this key property. Is this an oversight or am I missing something?
8. Update figure 1 to fix reviewer 85wr's comment: Figure 1 needs to have actual references to datasets and calculations. Having unlabeled graphs is not okay in a published paper. Languages need to be provided as well (BLOOM is multilingual, are these English tasks?)
9. Add tables in an appendix to address reviewer 85wr's suggestion: All the main information in the paper is shown in graphs in terms of scaling. While I understand why the authors want to show their model in this way, as a reader I want to see standard tables showing tokens / ppl (or bpc). Please include these tables in the paper so I can understand the data efficiency without trying to extrapolate from tables.

#
  1. Update the fonts to address reviewer 85wr's comment: Generally the graph labels are much too small to read, please increase these to be similar to the text itself.
  2. Add a sentence or two clarifying the inference experimental setup, addressing 85wr's comment: Can you provide more details on exactly the inference method / software hardware used for the text generation results? From the text it is unclear whether it is even cpu or gpu.
  3. Table 1 is overlapping the middle margin. Needs fixed.
#

@tropic minnow -- Do you have time today to write some initial rebuttals? Feel free to take whatever help you need. I should be able to work on this later tonight, but the rebuttal is due tomorrow August 28 AoE so any help is welcome

steady ether
#

6 is fixed

steady ether
#

Added AFT-local (conv) row to table 1 for (7). Think that's what 85wr wanted

snow zealot
#

Point 11 - Added

 In figure 6 we can see the cumulative inference time of different models when generating a sentence of 1000 tokens on a NVIDIA A100 80GB GPU.
 For all our experiments we use float32 precision and generate the sentence using sampling decoding.

to the inference results.

snow zealot
last mauve
#

I'm writing it now.

outer vine
last mauve
#

Can you start on rSzx in parallel?

outer vine
#

sure, i would make a draft first

last mauve
#

Finished Zd3h. Moving to 85wr

#
> My understanding is that RWKV is roughly equivalent to the AFT local model that was previously presented. Yet this is not mentioned in the paper and the table does not include this key property. Is this an oversight or am I missing something?

How do I respond to this?

#
> While the pen-and-paper FLOP calculations are interesting, would be curious to understand how the actual training time compares on real hardware. Some graphs in the main paper would help.

Can we do this? Maybe infer the training time using timestamps from our logs?

#

Finished a draft on 85wr. Moving to HDNB.

snow zealot
#

So like we said in this paragraph

#

AFT learns a parameters for each $t, i$ pair, RWKV learns one W that is multiplied by $t, i$ to produce an decay

silent urchinBOT
#

SSamuel

last mauve
# snow zealot I don't know if we should write something in the paper comparing both models, bu...

We have a few statements in the paper explicitly comparing AFT and RWKV. I'm thinking we say something along the lines of:

AFT and RWKV are indeed overall similar, but differ in a few key ways. We compare these exact differences between architectures in section 4.1, but at a high level AFT learns a decay for each pair of locations, where in the local approach if the distance between two locations is higher than the kernel size this decay is 0. RWKV uses exponential decays that decreases with the distance.
#

@fickle hare and @outer vine -- Do you think this response to HNDB is reasonable:

Recent large language models are using float16 or bfloat16 precision, it will be great to see RWKV also works in these precisons.

RWKV now supports bf16 training and inference, and evaluating under this precision type is left to future work.

#

RWKV also explicitly was tested under fp16/bf16 as of June right? We kind of have to say something along the lines of "yes it works, but doing it now is too costly for a revision"

last mauve
outer vine
outer vine
last mauve
#

From reviewer rSzx:

> A significant potential benefit of an RNN-like formulation is its applicability to longer contexts, but Figure 6 appears to limit this method to up to 2^12=4096 tokens. Further exploration of context length scaling is desirable. Additionally, most of Figure 6 is unsurprising, as more context naturally results in lower perplexity within the context window size. The figure's x-axis should start with the context window size being trained on. Clarification on the context size being fine-tuned up to would also be beneficial.

I'm thinking our response should be along the lines of:

  • We'll explore longer context than 4k tokens in future work
  • (not sure how to respond to ...most of Figure 6 is unsurprising.... I kinda agree? Am I missing something?)
  • We need to add explicit details on the fine-tuning context length strategy and also respond on the rebuttal with it. @tropic minnow and @obsidian quest -- Who can tell me this?
#

rSzx says:

> The time mixing component, while parallelizable along other dimensions, is not parallelizable in the time dimension. This lack of parallelization could become a training bottleneck for very long context windows.

I don't think this is accurate since we have time-parallel mode in 4.2. Is it sufficient to just say "we solved this, look at 4.2" or am I missing something? I need someone to double-check me here.

obsidian quest
#

"more context naturally results in lower perplexity within the context window size"
previous LSTM LMs are unable to utilize ctxlen beyond ~100 tokens

obsidian quest
obsidian quest
obsidian quest
obsidian quest
obsidian quest
# last mauve rSzx says: ``` > The time mixing component, while parallelizable along other di...

Attention is not parallelizable in time dimension (I mean going beyond O(T)), unless we use FFT-style / prefix scan-style designs and reach O(log(T))

Recently the RetNet paper claims that it can achieve time-parallelizability, however if we expand the formulas (by looking at the hardware implementation) we can see that's not true. One still have the loop over T.

So it can only claim usage of tensorcores. And then the difference is between [GEMM on tensorcore] vs [GEMV without tensorcore].

And the second case is faster, because GEMV has much less flops than GEMM. It can reach bandwidth limit without utilizing tensorcore.

last mauve
last mauve
obsidian quest
#

Note there is "loop over T" in GPT attention formula.

outer vine
obsidian quest
#

So if we consider GPT to be parallelizable, that means "loop over T" is totally fine.

obsidian quest
#

It trains xx% faster than GPT on my A100s

obsidian quest
last mauve
# obsidian quest Attention is not parallelizable in time dimension (I mean going beyond O(T)), un...

Ah I think I'm grasping what you're recommending then. So is it accurate then to say:

Neither RWKV nor attention-based architectures GPT can improve in the time dimension beyond O(T) where T is the sequence length. Therefore, either both RWKV and GPT are parallelizable in the time dimension, or they both are not. We note that RWKV has a notable decrease in time and space complexity as T increases compared to competing architectures (see table 1), and this is a key strength of our approach.
outer vine
#

maybe longer context would give more advantanges

outer vine
#

with comparable setting(bfloat, model size, context length)

last mauve
#

@obsidian quest -- I want to be explicit since I don't think we're noting it anywhere in the paper: What is the pretraining ctxlen for all the models pretrained in Table 2?

obsidian quest
#

Pile models - ctx 1024 (and then finetuned to 8192)
World models - ctx 4096 (and the community finetuned it to 128k)

last mauve
obsidian quest
#

i finetune them to 2k and then 4k and then 8k

last mauve
#

Hmm, that also needs updated in the paper then :/

last mauve
#

I will update this

obsidian quest
obsidian quest
last mauve
#

Because what we have now is not accurate if it's actually 8k

obsidian quest
last mauve
#

And those were repeated for both 7B and 14B in Figure 6. Got it.

last mauve
#

Ok the rebuttal is in a good spot I think. I would appreciate if someone did a pass and left comments before tonight.

#

Also, there are still a lot of work items that need done before the final paper version can be published. See #1103039376184852622 message and #1103039376184852622 message. I would appreciate help with these over the next few days.

snow zealot
#

So the cuda was 11.8 and torch was 2

snow zealot
#

@last mauve Do you want me to write a phrase stating this?

obsidian quest
#

@outer vine @last mauve
L=32 D=2560 VocabSize=65536, params count = 3.1B
Here all models are using the same FFN (RWKV-style, with sigmoid gate)

DeepSpeed ZERO2 + gradCP on 4x8 A100 40G, bf16
ctxlen=4096, bsz 4x8x6x4096 = 0.78M

RWKV, speed = 229kt/s

GPT w/ rotary, 20 heads, speed = 103kt/s

GPT (FlashAttention2) w/ rotary, 20 heads, speed = 210kt/s

last mauve
#

I'm going through and posting the rebuttals now

#

They'll be posted to reviewers today AoE, so if there are any glaring issues feel free to edit through openreview

outer vine
gusty condor
fickle hare
fickle hare
fickle hare
# last mauve Ah I think I'm grasping what you're recommending then. So is it accurate then to...

All matmuls in RWKV TimeMix are parallelizable just as in Self-Attention; the only difference is the current non-parallel-scan-style WKV is not yet parallelizable through sequence dimension. But it doesn't hurt, because:

  1. in timemix the hotspot is in matmul instead of WKV, due to WKV is already sufficiently parallelized through the channels dimension;
  2. if we hit the scalability issue in the future (like over 100k seqlen, distributed over multiple GPUs), just do parallel scan and it becomes parallelizable through time dimension.
#

I don't really have the time to work on the rebuttal, hope the above comments help. Let me know if anything still not clear.

obsidian quest
#

We should emphasize that the training speed (token/s) of RWKV is constant regardless of seqlen. So seqlen scalability is never an issue.

It's reasonable that a 100k seqlen sample trains 100 times slower than a 1k seqlen sample, because the token/s is still constant.

@fickle hare @last mauve

gusty condor
# last mauve Here are the current TODOs. Grab an item or two: 1.~~ (**HIGH-IMPORTANCE)** Fil...

Just like any other RNN, RWKV cannot directly look back previous information, and have to answer questions solely based on its state (memory). The Winogrande task explicitly requires at least one lookback of the reference of the pronoun, while the ReCoRD reading comprehension requires recalling information from the previous passage. The underperformance of RNN and the need of special designation of prompts is further studied in Section 10 and Appendix I.

obsidian quest
void quartz
#

😅 if there is a follow up paper for v5 (and its much larger state), i have mountains of data on how the lookback is a huge jump and quantified - doubt thats usable in the current paper though

gusty condor
#

RWKV-5: Watch out the Revenge of RNNs😆

void quartz
spiral minnow
#

Just putting this thought out there. Looks like an average score of 3 (soundness) at EMNLP, even after the rebuttal/response period. The soundness score isn't the only factor for acceptance, and the excitement score is quite high, but I think it's a very borderline assessment meaning it's definitely possible that it ends up being rejected. Based on reviewer responses it seems that the presentation is what needs to be improved most, and I think the work has been out there long enough that we have new ways to explain the architecture which are clearer, and additional experiments to address some of the issues that have been raised since the paper was first released.

So, my question is: Is it worth spending 2 weeks to improve/update the paper writing/plots to address reviewer concerns, and then submit to ICLR (abstract deadline sept. 21, paper deadline sept. 28) with a version of the paper that will be significantly improved?

Some considerations: We can't wait and see what the outcome of EMNLP is, we would have to pull from EMNLP before finding out the decision. But, if we end up getting rejected from EMNLP, then we won't be able to submit to anything until ACL/ICML in january/february. If we get into findings at EMNLP, it's unknown whether we'd get a spot for a poster presentation as they did at ACL, so we could just end up with no opportunity to present at all.

#

Thoughts? @young sparrow @last mauve @obsidian quest

obsidian quest
#

ok maybe let's go for ICLR?

young sparrow
#

I don't have any info one way or the other, but that seems to strongly determine your analysis and unless you have a reason to beleive it is disqualifying I would shy away from that.

last mauve
gusty condor
#

I suggest that we work toward arXiv version 2, once the anonymity period is over (accepted or rejected), we can submit arXiv version 2 with better presentations.

#

Just to stay prepared in case there are any changes

spiral minnow
# young sparrow I think it would be wrong to assume that a paper that recieves all 3s for soundn...

I'm not assuming it won't be accepted, but my opinion is that it's highly unlikely to be accepted to the main conference, and possible that it will be accepted to findings. Also, the soundness weren't all 3s, it got 3 3 4 2, and 3 4 5 4 for excitement. So a lot of this judgement is up to the AC/SAC who will determine if a high excitement is enough for a paper to get accepted to main conference.

I hear what you're saying though. I don't have any extra information one way or the other either. I'm just concerned that if it does get rejected, the next conference deadline after ICLR is ~4 months out, and we currently have the capability of significantly improving the paper quality.

Sounds like everybody else is pretty confident it will get in though 👍

gusty condor
steady ether
gusty condor
#

This is just a rough estimation. Given RWKV's influence, I believe that RWKV has a much higher chance of being accepted.

last mauve
#

We'll have to do the work anyway for arxiv v2 + camera-ready if accepted, and resubmission if rejected

last mauve
young sparrow
#

If you are a co-author of the RWKV paper (or any other EleutherAI research paper) and you live in a country not colored green or blue on this map please let me know.

void quartz
#

btw, while its not peer review citations - you can already see them happening on arxiv (for the RWKV paper)

hushed flare
#

RRWKV makes an architecture change but doesn't even benchmark to show it does anything useful over the original implementation.

tender karma
#

ahhh

#

no sorry

#

RRWKV

hushed flare
#

The paper ^ citing it

tender karma
#

yeah sorry for increasing the entropy 🙂

young sparrow
#

This will likely be a 100-citation paper by EOY

void quartz
#

guess we are on track to a small 9000 😉

gusty condor
celest barn
#

I just saw the video on Yannic's channel! Congrats guys this is super cool!

fossil halo
#

Is there a simple pytorch implementation of RWKV? The implementations in the github are naturally super optimized

young sparrow
fossil halo
#

There's a "raw wkv function" but I'm not sure whether it does the same thing, since it says "only for generation"

hushed flare
#

The raw function is just a conv1d.

fossil halo
#

Is the raw function not like the RNN for loop over the sequence length?

#

Could I use the raw function for training as well? (Just slower,) or is it fundementally different?

obsidian quest
gusty condor
#

My intuition is that RWKV is much more easier to comprehend than GPT if you already know LSTM 🤔

young sparrow
outer vine
outer vine
fossil halo
#

Ok, but the cuda kernel still contains some equivalent of the for current_index in range(seq_length) loop?
I'm asking because I'm trying to understand to what degree RWKV can be trained "in parallel" like a transformer or Retnet

spiral minnow
# fossil halo Ok, but the cuda kernel still contains some equivalent of the `for current_index...

I think the answer is that RWKV is parallelizable, but the code is actually not currently written in a fully parallelized way. Instead, it's written in a sort of cascading parallelism, as demonstrated in the gif here (https://wiki.rwkv.com/advance/architecture.html#how-does-rwkv-differ-from-classic-rnn), which I believe in practice is very similar efficiency to if you wrote it in the "fully parallelized" method

I could be wrong, so somebody correct me if needed.

obsidian quest
fossil halo
outer vine
obsidian quest
fossil halo
#

In Hannibal's gif there's a sequence of wkv computations that look like it's going to take time proportional to the sequence length. Is this not so, even in Cuda?

#

I can see how you can do O(layers+seq) parallel time, but not O(layers) like transformers. Is this not right? I'm not saying it's a problem. In practice the number of layers is probably not that different from the sequence length.

obsidian quest
#

"take time proportional to the sequence length" is expected. that's how you get constant token/s regardless of ctxlen.

fossil halo
#

Yes, but when people say transformers are "parallelizable", in this context, they mean that you only need a number of steps proportional to the number of layers. Every cell in the sequence dimension can be done in parallel/batched.
I'm not saying this means RWKV is bad, or that this is an important difference. I'm just trying to understand if RWKV is like transformers in this way, or like RNNs.

misty cedar
outer vine
#

for easy understanding, you could simply take RWKV as an RNN

fossil halo
void quartz
#

not sure if im allowed to post this in the general channel haha
( any mods, let me know where i can repost this )

spiral minnow
young sparrow
#

Submitting to two conferences simultaneously is against the rules everywhere and grounds for rejection from both

spiral minnow
#

I'm fairly confident that it is within the rules of EMNLP

celest barn
#

It doesn’t really fit well with the NeurIPS workshops and as far as I can tell this is a conference track paper in caliber anyways.

#

Submitting to a nonarchival and archival one is also against the policies of a lot of workshops

spiral minnow
gusty condor
#

Wait, are we still in anonymity period? Anonymity period lasts until the final results (accept/reject) are out, on Oct 6, 2023.

tender karma
#

I’ve the same understanding of the anonymity period

void quartz
#

Some friends of RWKV at Frontier super computing clusters, is asking "RWKV under the Linux Foundation" to apply from SummitPLUS : https://www.olcf.ornl.gov/summit-plus/

So that we could potentially use this to train larger foundation models for RWKV v5

As it would help the application process if we have a PI / CoPI of

prefereble someone from a University or research center. And in US

Would anyone be interested in doing a joint application with me and blink ?

jade lotus
#

if no hits here, you might try #general and #off-topic too, but maybe give the people in this channel preference - i'd put money on someone being available though

young sparrow
void quartz
young sparrow
#

Yes! We were very excited to win the only INCITE grant for pure AI research last year with LAION and Mila 🙂

void quartz
young sparrow
#

Tagging @last mauve for his awareness as he also has experience with OLCF applications

last mauve
sharp sonnet
obsidian quest
tough crane
#

@void quartz FYI: If you and blink are seeking for discounted computing resources for v5, it might be a possible choice to apply a competition to use Japanese government's computing cluster called ABCI whose price of single node is 6.64 USD/hour. (up to 60 nodes and 8 A100s(40GB-RAM) and 480GB CPU-memory per node) This is less than 1/4 of 32.77 USD/hour of p4d.24xlarge . Someone in an academic institution or a corporation inside the state is needed to apply the competition.

void quartz
#

Sorry for the delay, i drafted the following - after bouncing some ideas with the folks at oakland - they felt it was best to highlight RWKV energy efficiency

https://docs.google.com/document/d/17JBx_h-8k5S36Z5d1rggLL3wFL8iLXSGjvLUNm0F5AM/edit?usp=sharing

void quartz
young sparrow
void quartz
void quartz
void quartz
#

(asking for the HPC application)

obsidian quest
#

current code can support lots of nodes. i only tried 12x8 A100 40g

void quartz
#

i think they want to project how long it would take on the HPC cluster

obsidian quest
#

RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G (ZERO2+GradCP)

#

RWKV-5 is a bit slower because of suboptimal CUDA kernel

tough crane
void quartz
gusty condor
#

Is the RWKV paper acceped?

tough crane
void quartz
#

When that happens it means peer review process is completed?

tough crane
young sparrow
spiral minnow
void quartz
obsidian quest
#

we can work on an RWKV-5 paper

misty cedar
#

We should definitely add all the memory experiment data to show how much it improved

obsidian quest
misty cedar
#

also

#

after a small amount of testing

#

I have found that almost none of the information for rwkv5 is stored in the time_shifts

#

also

#

the state is huge

obsidian quest
#

state is 32x of rwkv4

misty cedar
#

for 1b5, does
32*64*64
=131072 values per layer
seem right?

obsidian quest
#

yes

#

D * Headsz (64)

misty cedar
#

absolute insanity lol

misty igloo
# obsidian quest we can work on an RWKV-5 paper

I'd love to help with an RWKV5 paper. Been writing and experimenting with my own related models and modular system for comparison training of similar components since the retnet paper was released, so I'm very familiar with both it and related architectures.

young sparrow
#

@obsidian quest where can I find a detailed breakdown of the training data?

subtle oak
#

Seems we are rejected by EMNLP😅

young sparrow
#

How did you see that?

subtle oak
#

You can see revision of our manuscript

#

And you will find the submission venue ID has been changed to rejected..

young sparrow
#

😦

subtle oak
#

Github 18k+ stars project rejected by EMNLP😅

#

That’s a joke haha

#

Maybe we need to wait the meta review to see what happens…

young sparrow
#

There's a trlX paper under review at EMNLP that shows this too

subtle oak
#

Oh yeah I find that

#

My reviewer console also shows that, all papers change to the Rejected🤣

young sparrow
#

Oh yeah same. I suppose it's a bug then

subtle oak
#

I reviewed 3 papers and the Meta reviews showed that these papers should be accepted to main conference, but now all in Rejected

#

Yeah I think it’s a bug haha

last mauve
#

EMNLP is killing me

#

Delayed results, no communication, then this bug that gives everyone a heart attack with no announcement, forcing us to compare notes

last mauve
# obsidian quest we can work on an RWKV-5 paper

Can you summarize what's different between RWKV-5 and the RWKV-4 arch we submitted to EMNLP?

We need to decide whether we want a bunch of small followup papers, or build them up into a big paper like our first EMNLP submission.

outer vine
spiral minnow
#

Congratulations to everybody 🎉 !!

sharp sonnet
#

🥳

tropic minnow
#

Wohooo

#

🙌 Accepted🚀

jade lotus
#

Awesome!

tropic minnow
#

there will be a chance for a poster it seems!

subtle oak
sharp sonnet
#

Just to confirm, @obsidian quest are you okay with EMNLP Findings?
Or do you prefer a main conference?

Findings means we cannot present the work at the actual conference

fickle hare
#

#1083107245971226685 message

last mauve
# sharp sonnet Just to confirm, <@870137517020688415> are you okay with EMNLP *Findings*? Or d...

My opinion matters less than Bo's, but I think that findings are fine. I think we'll fall into the "highly-cited findings papers" (context: https://twitter.com/gneubig/status/1451317435278270466?lang=en), and the primary benefit of being accepted into EMNLP is the stamp of approval that the RWKV arch is technically sound and can withstand the scrutiny of peer-review.

Presenting at the main conference would be a nice-to-have, but we don't have the issue of people not knowing RWKV exists like many other papers do.

sharp sonnet
#

I agree with this too. I see the current publication as a credibility stamp and the number of citations this is accumulating would help us with any further academic-ish grants

misty igloo
# fickle hare afaik only wkv replaced with that new mechanism (named wkv5 in the code)

wkv is now w*transpose(k)*v so it's a matrix rather than a vector, and the numerator/denominator in rwkv1-4 no longer need to be tracked separately
The matrix version of wkv lets you store way more state data, so it has much larger memory abilities, and is more analogous to how you can adjust traditional attention's softmax(q*transpose(k))*v into linear attention style q*(transpose(k)*v) via associativity if you remove the softmax

fickle hare
#

(yeah, but remember the exponentially decaying 'position embedding')

misty cedar
#

V4 was a legitimately terrible architecture, it's a miracle it did so well.
V5 is going to decimate other similar models

obsidian quest
sharp sonnet
young sparrow
#

@obsidian quest Did you ever run the extended scaling laws grid we had discussed? I think there's a good chance that that could turn into a paper too.

obsidian quest
young sparrow
#

IDC which we do it on 🙂 But I'm very interested in seeing if optimal data:param ratio is the same for transformers and RWKV. It looks like it could be, but we need more data.

#

Is there a reason to not use the same data we were using before? Seems like a waste to change the dataset

young sparrow
last mauve
young sparrow
last mauve
young sparrow
#

Yeah for some reason I thought that didn't work but it does

spiral minnow
last mauve
#

Now that we're accepted, time to work on the following (in order):

  1. Prepare the camera-ready for EMNLP (by Oct 20)
  2. Update the arxiv version with the same changes. I suspect this will be the last time we touch this submission so that we can move forward.
  3. Announce on Twitter with a thread of major results?
  4. Create the poster for EMNLP
  5. Start brainstorming on the next RWKV paper's outline. Can contain completed (e.g. v5) and in-progress work items. I suspect this submission will start crystallizing around EOY
#

I'll send out the latest work items for #1 and #2 on Monday.

young sparrow
#

I was fiddling with the author block, alphabetizing authors and adjusting formatting a little. It needs a little more love but I'll be done by the end of the day

#

Looking over the reviews, I don't understand what this is asking about

All the main information in the paper is shown in graphs in terms of scaling. While I understand why the authors want to show their model in this way, as a reader I want to see standard tables showing tokens / ppl (or bpc). Please include these tables in the paper so I can understand the data efficiency without trying to extrapolate from tables.
What is supposed to be measured in tokens / ppl?

misty cedar
#

( Context size training -> accuracy ) relationship?

young sparrow
#

Wow I forgot how much of a crab Reviewer 85wr was.

Figure 1 needs to have actual references to datasets and calculations. Having unlabeled graphs is not okay in a published paper. Languages need to be provided as well (BLOOM is multilingual, are these English tasks?)
It's labeled "average across 12 tasks" and in the experiments we list... 12 tasks. Surely it's not beyond this person's reading comp to figure this out...

spiral minnow
# young sparrow What kind of feedback?

The big picture of the feedback (my words, not his), we need more science.

Specifically, he asked for ablations on individual portions of the architecture to try and tease out what role each part contributes to find perplexity / flop savings.

Also, he suggested (and I agree) that it could be written a little less like marketing material. Meaning that we should have more description of what we did, and why we did it. So things like, an explanation of which parts of the architecture were chosen for speed vs. accuracy. And more here is where it improves over the transformer, but here is where it lags behind, discussing the tradeoffs.

Some of these may have been improved in the version submitted to EMNLP, but he only had access to the ArXiv version

young sparrow
# spiral minnow The big picture of the feedback (my words, not his), we need more science. Spec...

Specifically, he asked for ablations on individual portions of the architecture to try and tease out what role each part contributes to find perplexity / flop savings.

I don't really view this as viable, nor is it a very common thing to do. The level of rigor we hold ourselves to here is comparable to other LLM papers IMO (GPT-NeoX-20B, PaLM, LLaMA).

Also, he suggested (and I agree) that it could be written a little less like marketing material. Meaning that we should have more description of what we did, and why we did it. So things like, an explanation of which parts of the architecture were chosen for speed vs. accuracy. And more here is where it improves over the transformer, but here is where it lags behind, discussing the tradeoffs.
I'm not sure what parts you think read like marketing material, but those should absolutely be cut. Can you point them out?

Maybe you mean the chat stuff? I had assumed we had run out of time with that. I agree that at present it doesn't add anything to the paper, but think that's a reason to improve it not delete it. Rather than compare to ChatGPT-4, we should probably be comparing to other OS models.

Some of these may have been improved in the version submitted to EMNLP, but he only had access to the ArXiv version
No, they had access to the EMNLP version when reviewing a submission to EMNLP.

#

I noticed that there's a lot of experiments in the appendix that aren't even referenced in the main text, such as the wikitext perplexity and LRA evaluations. This was because we ran out of space, though I continue to think Sec 2 is unnecessary and can be removed and/or merged with Sec 3. These results may need to stay in the appendix, but they should absolutely be referenced in the main text when talking about long contexts.

tough crane
#

What kind of negative ratings could decrease the score level from main-conference accepting level to findings one?

  • Component wise detailed ablation study at pre-training phase ??
  • Significant margin of benchmark performance against other LMs against other competitive models like RetNet ??
  • Or any other aspects to be improved ???
gusty condor
#
  1. Ablation studies: possible, but I don't believe that it's the key reason. It would be better if we add some ablation studies, since there are tons of new tricks, like WKV CUDA kernel, token shift, small init embedding, etc. These new tricks might be of interest to someone, but it's still unsure how they really work. (For example, I once questioned the coefficients in the token shift about its numerical instability)
#
  1. Significant margin of benchmark performance against Retnet: This is really unlikely, since Retnet is later work than RWKV, cited RWKV, and is posted after EMNLP deadline.
#
  1. Other aspects: I suspect that it's the nature of extreme competitiveness of top AI conferences. Of course, there are many articles better than RWKV with better soundness and presentation (i.e. Story-telling).
obsidian quest
#

my previous experiment, data = SlimPajama

retnet official repo ("torchscale", gray) vs older and weaker rwkv5 ("r2r3", cyan)
it will nan in fp16 too (the small circle on x-axis around 0.6 G tokens)

my implementation of retnet wont nan, and performs better, but still no match for rwkv5

probably that's why they havent released any models

gusty condor
#

same amount of parameters?

#

L24 D2048 is around 1.5B

obsidian quest
#

same amt

#

i found their design does not scale well beyond 0.4b params

young sparrow
# tough crane What kind of negative ratings could decrease the score level from main-confere...

I think it's mostly bad luck with reviewers. We got shafted pretty hard, and many of their complaints are extremely unreasonable. I expect that this is going to be one of the most cited papers coming out of EMNLP this year.

The paper isn't the best written thing and could present our results in a better or more compelling light. But in my mind the most compelling version of this paper is award-worthy, not just main-track worthy.

#

IMO the things we should change for the camera-ready are:

  1. We need to do a better job with the experiments for long-context. We have LRA results in the appendix that are never mentioned, but we should eval on actual long-context benchmarks for text models and extend our analysis to much longer sequences than we did. If this is actually "infinite context," let's show evals with 100k+ sequence length. I'm also still unsure what the long context evals in the main body are supposed to show.
  2. We should add the S4 variant that's been scaled to > 1B params to our primary NLP evaluations
  3. We should eval on MMLU
  4. The stuff about the chat model in the appendix seems largely irrelevant to the paper. We should either cut it or work it into the narrative better. If we keep it, we should be comparing against similarly sized models not GPT-4. IIRC Raven was at the top of the open model on some chat benchmarks... we should show that off!
  5. General principle: everything in the appendix needs to be at least referenced in the main body.
tender karma
#

I largely agree with @spiral minnow and we can take a less improvised approach for the paper describing the v5 (which I assume is the v 5.2).

misty cedar
#

5.2 ( aka revision 4 )
is the finalized rwkv v5 algorithm

tough crane
fickle hare
#

yea, kinda like that

#

it's already the case in v4, where the softmax is taken on a decayed k, after the exp it becomes exponential

obsidian quest
gusty condor
# tough crane What kind of negative ratings could decrease the score level from main-confere...

Another reason is that the topic of RWKV is a little far from the main focuses and topics of EMNLP. EMNLP does not really suit RWKV.
Look at this (Mostly in Chinese, just see the titles): https://mp.weixin.qq.com/s?__biz=MzI1ODI2ODI1MA==&mid=2247484873&idx=1&sn=00fe41a7da8f0544d050c84a2ee0fbff&chksm=ea0b88fcdd7c01ea815c3a44620279f457d6821b39e7d9ec96260952f9234ae782fba9471061&mpshare=1&scene=23&srcid=1009TP3yfdFSLUtUYr0q66Pu&sharer_shareinfo=3d763bdae0c3c483c1a7643fafe6d90d&sharer_shareinfo_first=3d763bdae0c3c483c1a7643fafe6d90d#rd
There is not so much related to model architecture, just using models to solve problems like speech transcription, multilingual translation and some more. Therefore, RWKV seemed to be of little interest to EMNLP.

spiral minnow
# young sparrow > Specifically, he asked for ablations on individual portions of the architectur...

I don't really view this as viable, nor is it a very common thing to do.

That's a fair point, I'm not sure how expensive it is to run the main experiments with more variations on the architecture. But maybe we can do some smaller scale experiments? I don't have a lot of concrete ideas here, just passing it on from Sasha.

I'm not sure what parts you think read like marketing material, but those should absolutely be cut. Can you point them out?

I think his point on this wasn't that any specific section was written as marketing material, but more suggesting that not enough of the paper was dedicated to analysis.
Directly from him: "I think a lot of the experiments could be trimmed down to a less marketing version of how do RNN models work on real language that is honest and clear about what works and what doesn't".
I see both sides of this, I think a lot of the paper is spent on background and methods, which makes sense because there are a lot of details to the method which the reviewers/readers may not be familiar with. On the other hand, if I were reviewing this, I would agree that evaluations section really only touches on the high-level results and includes very minimal discussion. It feels like there are so many results and there could be some analysis of all of it to better understand when RWKV improves over transformers and when it does worse, and then trying to propose reasoning for why we think that happens.
Maybe this paper is a better fit for a journal because 8-10 pages isn't enough space to go into much depth.

spiral minnow
tough crane
#

Start brainstorming on the next RWKV paper's outline. Can contain completed (e.g. v5) and in-progress work items. I suspect this submission will start crystallizing around EOY

Could we split paper's ideas of v5 (or later) into narrower scopes, RQs and desirable supporting experiments including ones that should be conducted in the future? And could we consider the venue to be submitted for each portion of ideas??

Relatively smaller and specific portions could be better to submit to the conference length.

gusty condor
#

Any overleaf links for new papers? I have more spare time this semester to help with the article.🤔

tough crane
void quartz
#

Right now one of the common criticism was how we lack more details and depth for each segment. And I’m like - at that point it’s a book

tough crane
#

yeah, a text-book is a structured and assembled collection of many papers.

obsidian quest
gusty condor
void quartz
#

Related to our Oakland HPC compute application.

We are trying to frame it as an worlds most energy efficient model at 40B param scale

So a possible paper path is comparing the energy consumption on inference between various models with different input and output context length

silver leaf
#

Would be nice to have something like https://arxiv.org/abs/2310.06839 side-by-side comparison with RWKV vs GPT

#

RWKV doesn't really provide much of compelling case (aside from memory saving) for just simple chatbots that can keep prior context mostly in cache

young sparrow
void quartz
#

Current benchmark for 7B models put us well ahead on a joules per token basis compared to other models

young sparrow
#

Huh

void quartz
#

This should still hold on higher param count, due to the lower gpu usage on inference (compared to models of same param count)

young sparrow
#

That's quite interesting

#

Though I'm a little suspicious about the amount of variability that's shown for 7B models... those are mostly basic decoder-only models and should be the same right?

void quartz
#

I suspect it’s the lower vram usage

young sparrow
#

Why is StableLM substantially lower cost than Alpaca? Aren't they literally the same architecture?

void quartz
#

Ahh that. We’ll have to investigate further I suppose into their methodology

#

Since none of us @ rwkv were involved in this benchmark

young sparrow
#

Yeah sorry. My skepticism isn't about RWKV, but all the transformers are nearly identical algorithms but show variance of ~ 20%

void quartz
#

TBH considering how we observed perf difference in inference libraries even within rwkv and llama

It might even just be that

young sparrow
#

I suspect it is, or minor implementation differences in the HF library leading to different efficiencies

#

If that's the case, it's "not real" in the sense that if you are running at scale with an optimized implementation the difference goes away

#

Heck, our advantage could just be from custom CUDA kernels

void quartz
#

Yup HF has its own optimisation. And our libraries has a difference between custom cuda optimised and non cuda optimised code

#

Hmm. I guess there is lots more to explore on this angle then I expected

young sparrow
#

RWKV and a transformer are the same number of FLOPs for a forward pass. So while it's certainly possible to be lower energy my prior is that it wouldn't be if you optimize them equally... unless there's something in the architecture that's a better fit for GPU computing

void quartz
#

Lower vram usage?

young sparrow
#

Does that equate to lower power draw? I don't know.

silver leaf
#

somewhat, but it's not really that strong case

#

it translates indirectly due to having clear the cache and recompute the prompt, then you burn a lot of co2

void quartz
#

As much as I understand gpu and shader code. I never looked at it from a per watt basis before 😂

#

Game development never really cared about that

young sparrow
silver leaf
#

there's some hard numbers for this for consumers GPUs if you look around, but its been mostly issue with older GDDR5/6, not the ultra efficient HBM2s

young sparrow
#

That said, if the goal is to get the ORNL grant there's a sense in which it deosnf matter. If the independent benchmark says you're way better you can cite that without feeling bad about it

void quartz
#
igor´sLAB

Well, meanwhile there are several leaks of "pre-release" models of the upcoming GeForce RTX 3080, but I don't really trust the roast published here, because I just assume design validation samples.

#

I also wonder how much of that 230 watt is to transferring data from vram to gpu and back

silver leaf
#

and it doesn't matter how much memory you're using when you're inferring, it will always dial the mem clk, and subsequently power usage, full throttle

void quartz
#

I also wonder if there is big difference between consumer and DC cards

silver leaf
#

yes, huge

void quartz
#

As the vram is tuned very very differently from what I understand

silver leaf
#

entirely different memory architecture, for starters lol

void quartz
#

😂 we keep getting more questions at every layer we peel of this onion

silver leaf
#

best data you can get is if you look around hardware forums with people troubleshooting idle power usage

#

turns out its just clk spiking due to desktop tasks and what not, and their giant radeon/nvidia with 16gb eating 30w doing nothing

void quartz
#

Yea cause I know a100 idle is huge. And 7B is definitely underusing the gpu

young sparrow
void quartz
#

Yea. Just using it as an approximate of how big of an impact vram can possibly be

silver leaf
#

A100 memory frequency is just locked to 1ghz. DC cards are just made with the presumption of running full throttle at all times (meaning you burn all your flops doing parallel inference tasks, too), a reasonable assumption.

void quartz
#

Then the numbers advantage makes less sense 😂

young sparrow
#

@silver leaf You seem to know your shit. Are you a CUDA or data center engineer by any chance?

young sparrow
silver leaf
# void quartz Then the numbers advantage makes less sense 😂

As I said earlier, I'd focus on the angle using less memory -> you can cache more/run more inferences in parallel -> which can be useful for a lot of specialied tasks like QA retrieval and other sorts of prompt engineering, but translates poorly to just plain chatbots.

void quartz
#

I might be wrong on this. But AI models are somewhat constant energy usage on a per token basis (assuming same input token length) ?

silver leaf
#

There's also the issue of plain GPT models being ultimately memory bandwidth bound. No matter how you parallelize inference, you end up with all that K/V cache traffic on your hands.

void quartz
#

I think we can validate this train of thought by simply initialising empty models at a specific param count.

And just measuring energy usage across X K token inference

silver leaf
#

which implementation should I be looking at to find parallel inference server for RWKV?

void quartz
#

So for better or worse it includes all of huggingface optimizations for each models

silver leaf
void quartz
#

Is it possible to measure an architecture potential efficiency?

Cause down this path it can end up being who writes the best cuda/Vulkan code

silver leaf
#

ye, its sort of do you really want to be in this race, theres a lot of resources thrown to microoptimize gpt inference

void quartz
#

There will probably be different numbers for batched and unbatched modes lol

silver leaf
#

but then again, most of it can be reused, ie adding rwkv into vllm

void quartz
#

And we might just end up being more efficient because we can cram in more batches in same number of vram lol

silver leaf
#

ye i'm pretty certain rwkv could be huge win in large model / low vram situation

#

even 40g A100s probably

void quartz
#

Ok my plan tentatively is

  • proceed with the compute grant application
  • do some benchmarks to replicate in non batching mode (HF implementation), using empty init model for larger models if we dun have one
  • (stretch) benchmark batched mode
  • when the training completes rerun with trained model
#

I agree that the numbers do seem off for models which should be the same architecture. So replication seems to be the only route to figure this out further

#

Besides the grant if given is for next year. So there is time in between 😂

tough crane
void quartz
tough crane
# void quartz I need someone to confirm this for me. If I have 2 different prompt of same leng...

I agree to this statement.

I assume that the J depends only arithmetic operation type and data type (float16, float32, int8, int16 etc).

An example of worst case scenarios (very very very unlikely) is as follows:

1: If RWKV is quantized via 3-bit int, task accuracy inevitably decreases largely.
2: Someone invented a novel 3-bit operation which is extremely (pays quadratic number of operations) energy efficient than the other operations (fp, or int8, int16) "only" for 3-bit arithmetic.
3: Quadratic attentions with 3-bit quantization can keep good task accuracy.

Then, energy drawbacks of quadratic attention are paid off...

void quartz
#

I think we can approach it without quantization first haha

#

Cause quantisation techniques in concept applies to all models

tough crane
void quartz
#

Yea we are like < 20 watt haha

obsidian quest
misty igloo
#

there's a reason they charge a lot more for inference on chatgpt4 long context edition 🙂

young sparrow
#

@everyone the camera ready deadline is in one week. The major to-do items are:

  1. Do a better job with the experiments for long-context. We have LRA results in the appendix, but we should really evaluate on an actual long-context benchmark and compare with other recent technqiues for extending the context length of a transformer.
  2. Compare to S4, if possible. I've contacted the people who claim to have trained a 1.3B parameter S4 model as they didn't release anything larger than 125M.
  3. The stuff about the chat model in the appendix seems largely irrelevant to the paper. We should either cut it or work it into the narrative better. If we keep it, we should be comparing against similarly sized models not GPT-4. IIRC Raven was at the top of the open model on some chat benchmarks... we should show that off!

Maybe some other things? These seem like the main areas of concern to me, but maybe @obsidian quest @tropic minnow @last mauve disagree.

Who has bandwidth to volunteer to work on these items as soon as possible. We should have a target deadline of Wednesday for getting the results in.

obsidian quest
jade lotus
#

temperature of .8 seems to be a little better, with fewer 4th wall continuations

tropic minnow
tropic minnow
# young sparrow @everyone the camera ready deadline is in one week. The major to-do items are: 1...

imo chat stuff is highly subjective and hard to assess scientifically as it's very easy to 🍒 pick. the way i see it is more for showcasing applications and for a "shock/PR/marketing" for scientific community. An example that RNNs can also be assistants/chat interfaces; not just transformers. i think RWKV is the first to show this at sufficient quality. After all, RWKV community is alive bc people are interested for its "industry" applications given its efficiency, etc.

#

I agree we should try to integrate the narrative better and compare to similar sized transformers

tropic minnow
young sparrow
#

This is probably easiest to do quickly, from https://arxiv.org/abs/2309.00071

hushed flare
remote elbow
hushed flare
# remote elbow

Is there a link to code? Unclear how the matrix valued adjustment is being done.

misty igloo
misty igloo
#

I wouldn't characterize it that way. But it does work more like linear attention this way, with r replacing q in q@(k^T@v)

hushed flare
#

What is the difference? It's an element-wise comparison across the entire sequence which seems to use group-norm instad of softmax at the end?

misty igloo
#

softmax is only applied at the (q@k^T) part in traditional attention, and group norm doesn't perform a related function

#

softmax causes negative dot product (cosine similarity) results between and query and the keys to become nearly zero, while emphasizing ones that are aligned

#

and that resulting set of attention 'weights' is used to select from values

hushed flare
#

That's mostly semantics. Changing Softmax to ReLU or just using the raw linear dot product with a final gate multiplication still yields comparable lookups. #research message

misty igloo
#

it's not semantics at all - this math doesn't do anything like that

#

I agree that other functions that squash the negative dot products can work well (I've tried)

#

as for using the raw linear dot product with final gate, I don't agree that works the same

#

I've seen plenty of linear attention papers that use it raw, or apply nonlinearities to q and k before multiplying, but my experience is that it's way less effective

#

and not the same kind of thing, mathematically

hushed flare
#

I completely agree that it's not the same mathematically, but functionally the models seem to learn and perform very similarly.

misty igloo
#

not in my experience! (don't get me wrong, I love rwkv)

#

but everything I've ever tried, which is a lot, points towards linear attention learning much more slowly than traditional

#

my description of the difference in this attention part of the models would be:
traditional attention is a mushy hashtable, where similarity betweek q's and k's chooses a mush of v's to return
rwkv5 style attention is a mushy decaying memory storage device, where 'k' chooses what address lines to store 'v' values in for later consumption, and 'r' selects a mush of address lines to return

#

hopefully we can get the rwkv5 paper to give that intuitive understanding - I think it's really useful for understanding why the model works so well

obsidian quest
obsidian quest
jade lotus
#

That seems like it could have crazy potential

misty igloo
# obsidian quest rwkv learns fast. try it

I've tried it many many times, but my attention based models learn much faster per token IF they're given the same advantages like tokenshift, smallinit embed, etc.
I know you've also tried this comparison and I've seen your graphs - I'll do another run using mine vs the latest rwkv5 code at some point soon and report back

obsidian quest
misty igloo
#

this is always rwkv5 (past versions with per head decay instead of per channel decay and headsize 64)

hushed flare
remote elbow
hushed flare
#

I didn't like the dependency on all these custom kernels for numerical stability so I built something more accessible 🙂 I did also swap out the Pytha attention modules with my RNN version at one point and freeze the rest of the module and just tune those. Can be used as a drop-in replacement but still doesn't do super well at long-form QA in few-shot learning.

misty igloo
hushed flare
misty igloo
#

lol i accidentally happen to be working on a non-cuda kernel version of the latest rwkv5 right this second

#

due to trying to upgrade my whole codebase to support MQA

hushed flare
misty igloo
#

not sure I understand... maybe you weren't referring to blink's custom cuda kernels used in rwkv?

#

i dont use fft or conv1d for anything at all in this model

hushed flare
#

V4 could be implemented using both FFT and conv1d, haven't looked super closely if V5 can be.

misty igloo
#

you could implement tokenshift with conv1d...

#

but id love to know how u implement the rest with FFT! (for V4)

hushed flare
misty igloo
#

oh like the same trick hyena uses

#

gotta think about that some more

misty igloo
#

other problem w/ FFT in terms of speed is you can't use torch.compile with it bc it involves complex numbers

gusty condor
#

I have several concerns:

  1. The formulas in time-mixing and channel-mixing modules are presented in a mixed manner, rather than listed sequentially. It's therefore difficult to understand exactly how time-mixing and channel-mixing modules work separately, especially if several formulas only differ slightly by an apostrophe. Anyway, it is not as clear as the ArXiv version.
  2. Due to token shift, the channel mixing module is also an RNN module. Could the channel mix module be added to figure 8 of Appendix D too?
  3. (Small) Add more details about the structure of RWKV internal states, including the total size, wkv numerator, wkv denominator and last token embedding.
obsidian quest
hushed flare
tropic minnow
young sparrow
#

@tropic minnow Do you know how much sequence length finetuning has been done? Both in terms of # of tokens and in terms of total length. Doing an apples-to-apples comparison will likely require some care.

tropic minnow
young sparrow
#

I think that the explosion in perplexity is connected with the particular PE / PE Extension used in those papers, and wouldn't be seen with other PEs

#

You can test this by running evals on BLOOM, which uses alibi

obsidian quest
#

if rwkv is trained using the correct method (chunkwise BPTT), it will naturally have infinite ctxlen

young sparrow
#

@obsidian quest But you didn't train the models we evaluated in the paper using that method right

obsidian quest
young sparrow
#

Also, they don't seem to have pretrained models at this scale?

obsidian quest
#

we can finetune existing models

young sparrow
#

We can't introduce a new technique after the paper has been accepted for publication. If we were going to use this we should have trained the models with it originally

tropic minnow
# obsidian quest we can finetune existing models

certainly we can do so for rwkv-v5 or in future papers. for now, just evaluating V4 as they were trained is the right thing to do imo. It shows that "you dont need to worry about ctx len extension methods that much if you use RWKV architecture" and that rwkv can handle very long context lengths by default.

last mauve
#

Thanks for these @gusty condor and @young sparrow! Time to buckle down for the camera-ready and arxiv-v2. My understanding is that our outstanding tasks are the following:

  1. (HIGH IMPORTANCE) Long-context experiments (see #1103039376184852622 message) - (In-Progress by @tropic minnow and @snow zealot)
    2. (Stretch-Goal) Compare to S4 (see #1103039376184852622 message). This would be a nice-to-have for the camera-ready, but we can push it to later work if necessary imo.
    3. Massage the chat appendix M section. I think that we should both reference the appendix where appropriate in the paper, and add a short paragraph at the start of the appendix justifying its existence.
  2. Clear up our time-mixing and channel-mixing modules as reported by @gusty condor in #1103039376184852622 message. I agree these have become less clear.
  3. (Stretch-Goal) Add the channel mix module to figure 8 of appendix D as reported by @gusty condor in #1103039376184852622 message. I agree this would be nice to have, but it's not necessary for camera-ready
    6. (Stretch-Goal) Add more details about the structure of RWKV internal states as reported by @gusty condor in #1103039376184852622 message. Not sure about the specific shortcomings here, so whoever picks this up will need to check with @gusty condor (or you can pick this up yourself @gusty condor)
#

(To clarify, all items I labeled (Stretch-Goal) are important and should at least go in the arxiv-v2, but were not explicitly pointed out by reviewers and are not absolutely necessary for the camera-ready)

#

Here are the rest of the work items that we haven't addressed yet for camera-ready:

7. Update the text to have a sentence defending the following from reviewer rSzx: Two specific tasks (ReCoRD and Winogrande, as shown in Figure 5) see the model underperforming other models. This underperformance requires further investigation.
8. Update figure 1 to fix reviewer 85wr's comment: Figure 1 needs to have actual references to datasets and calculations. Having unlabeled graphs is not okay in a published paper. Languages need to be provided as well (BLOOM is multilingual, are these English tasks?)
9. (Stretch Goal) Add tables in an appendix to address reviewer 85wr's suggestion: All the main information in the paper is shown in graphs in terms of scaling. While I understand why the authors want to show their model in this way, as a reader I want to see standard tables showing tokens / ppl (or bpc). Please include these tables in the paper so I can understand the data efficiency without trying to extrapolate from tables.
10. Update the fonts to address reviewer 85wr's comment: Generally the graph labels are much too small to read, please increase these to be similar to the text itself.
11. Add a sentence or two clarifying the inference experimental setup, addressing 85wr's comment: Can you provide more details on exactly the inference method / software hardware used for the text generation results? From the text it is unclear whether it is even cpu or gpu.
12. Table 1 is overlapping the middle margin. Needs fixed.
13. Several missing references in the contributions section

young sparrow
#

rSzx: Two specific tasks (ReCoRD and Winogrande, as shown in Figure 5) see the model underperforming other models. This underperformance requires further investigation.
I think it's a stretch to say we underpreform on Winogrande. In particular, RWKV and Pythia (which are trained on the same dataset) seem to trade off which is ahead.

We do underperform slightly on ReCoRD, but I don't particularly see what there is to explain. We're a little worse at ReCoRD, a little better at OpenBookQA, HeadQA, ARC (challenge), and nearly identical on the others. That's what happens though... all of the models have some tasks they're better at and some they're worse at. I think it would be irresponsible to posit an "explanation" based on such little data and don't think one is necessary at all.

last mauve
young sparrow
#

RE: "Can you provide more details on exactly the inference method / software hardware used for the text generation results? From the text it is unclear whether it is even cpu or gpu."

I think they just missed it. We write:

Specifically, we evaluate text generation speed and memory requirements on typical compute platforms including CPU (x86) and GPU (NVIDIA A100 80 GB). For all our experiments we use float32 precision. We include all model parameters in the parameter count, including both embedding and non-embedding layers. Performance under different quantization setups is left to further work. See Appendix H for more results.
It would be good to mention that this is the transformers library specifically though

young sparrow
last mauve
young sparrow
#

I made two notable changes to the EMNLP overleaf:

  1. I moved the related work to the appendix, in anticipation of needing the space for our extened experiments. We can move it back if that doesn't turn out to be necessary, but we're already half way down the ninth page.
  2. I added a second way of formatting the related work that doesn't lead to nearly as much wasted space (namely grouping by activity instead of lisitng each author individually)
young sparrow
last mauve
#

To clarify, when I say "arxiv-v2" I mean "our arxiv paper + the emnlp edits applied + any fixes along the way we couldn't make due to anonymity"

young sparrow
#

The arxiv version is here though I recommend we put the camera ready version on arXiv as well

last mauve
young sparrow
#

I would do the EMNLP version, submit that, and then just move the sections to the main body

last mauve
# young sparrow The difference is just bumping a couple things to the appendix for page limits r...

I'm of the opinion that the EMNLP and arxiv versions are separate retellings of the RWKV storyline for different audiences:

  • Arxiv: Broader audience, where we make things longer and more detailed, and advertisements like the chat appendix are OK
  • EMNLP: Academic audience, where we keep things brief and purely technical

It's not as simple as bumping entire sections. Many of the sections themselves were reworded or shortened for EMNLP. The two versions have drifted a bit and I'm proposing we keep them that way.

young sparrow
#

I see

#

I fixed all the missing refs

#

This is the last warning but I can't find an actual instance of this

#

Got it

#

@last mauve I have handled 8, 11, and 12. I don't think we need to do anything about 7. I've concluded that the big S4 model is unreleased and have reached out to the authors. I would expect this to not come to anything, but it's probably worth explaining that that is why we don't compare to it.

#

I think for 9, they're looking for how quickly the model improves over the course of training? So, something like training loss over time vs Pythia's would make sense? Is that your read too?

obsidian quest
young sparrow
obsidian quest
#

or you can simply scale rwkv4 🙂 i predict the gap will be filled just like what happens to lambada. probably need 100b params for that lol

#

my intuition is rwkv will spend more efforts on easier tasks when its capacity is limited by state size, and that's why it's doing better than gpt in some other benchmarks

void quartz
young sparrow
young sparrow
obsidian quest
#

@hushed flare @misty igloo Try my RWKV-6 first step: dynamic TokenShiftMix (likely works for RWKV-4 too) #1083107245971226685 message

gusty condor
#

Original RWKV-6 is postponed to RWKV-7? Or will they be implemented together?

fickle hare
#

My opinion toward long context experiments is to leave it afterwards (so just remove the LRA experiments and say sth in future work). The relationship between trained length and practically available length in inference is still unknown; while there are some reports from the community about seemingly extending to much longer once trained to ~100k, we have no formal result on that.

#

InfCtx is just a cheap method tuning to >100k on consumer cards, which backed up the abovementioned community reports.

young sparrow
misty igloo
gusty condor
# last mauve Thanks for these <@803473343705514025> and <@193204646687408129>! Time to buckle...
  1. I have done it
The total size of the RWKV internal state can be computed as $4DL$ in mathematical theory or $5DL$ in practice, where $D$ is the model dimension and $L$ denotes the amount of layers. The internal state in each layer consists of five vectors of size $D$. The five vectors are respectively listed as follows.
\begin{itemize}
    \item The current input of the Time-mix block $x_t$;
    \item The current input of the Channel-mix block $y_t$;
    \item The numerator of the $WKV$ value $a_t$ in \eqref{eq:statea}, or $a'_t$ in practice \eqref{eq:stateaa} for numerical stability;
    \item The denominator of the $WKV$ value $b_t$ \eqref{eq:stateb}, or $b'_t$ in practice \eqref{eq:statebb};
    \item A helper state $p_t$ in \eqref{eq:statepp}, which is implemented solely for numerical stability.
\end{itemize}
young sparrow
gusty condor
#
The RWKV model has an internal state that stores some previous information. In each layer, the internal state consists five parts, each of which is a vector with $D$ numbers, where $D$ is the model dimension. The five parts are:
\begin{itemize}
    \item The current input of the Time-mix block $x_t$;
    \item The current input of the Channel-mix block $y_t$;
    \item The numerator of the $WKV$ value $a'_t$, as defined in equation \eqref{eq:stateaa};
    \item The denominator of the $WKV$ value $b'_t$, as defined in equation \eqref{eq:statebb};
    \item A helper state $p_t$ in \eqref{eq:statepp}, which is used for $WKV$ computation to maintain numerical precision.
\end{itemize}
Which yields a total size of $5DL$ parameters. It is worth noting that in an algebraic context with infinite precision, the helper state $p_t$ can be ignored, and the $WKV$ numerator and denominator can be computed directly using equations \eqref{eq:statea} and \eqref{eq:stateb}, reducing the size of the internal state to $4DL$.
last mauve
last mauve
last mauve
last mauve
#

@everyone -- Does anyone know who "Jiaju Lin" is? They're listed as an EMNLP author but their contributions section is empty, they're not on the arxiv verison, and I can't track down anything they've done -- Resolved!

snow zealot
#

the data I collected is the cross_entropy at each token for a sequence of 128k tokens

last mauve
young sparrow
last mauve
#

Another thing, figure fonts should be increased.

**All **-- If you contributed a figure (Figures 2, 3, 8, 9, and 11 are fine and don't need updated), please bump up the fonts a bit and reupload the updated figure to the EMNLP overleaf.

gusty condor
#

Yesterday someone proposed that CoLM https://colmweb.org/ is a good conference for RWKV. The deadline is March 2024, so we could prepare for RWKV-5 or even RWKV-6.

obsidian quest
spiral minnow
void quartz
spiral minnow
#

Wow, that's really interesting. RetNet seems to do well on "easy" tasks (not sure how the authors define easy vs hard), but does significantly worse on hard tasks

void quartz
#

didn't expect this one

tough crane
#

LLama2 is the weakest LM?? 🥹

remote elbow
#

Strongest, it's the same color as the weakest for some reason

tough crane
fickle hare
#

I'm curious how they used RWKV, only with the WKV recurrent unit or including all the tricks

last mauve
#

I'll be submitting a version tonight for camera ready

#

If ppl can update figure fonts if they haven't already, that'd be great

#

@snow zealot and @tropic minnow did those long context results get resolved or are they unable to make it for camera ready?

tropic minnow
#

this, coupled to RWKV not having pos_emb, [[which means that length dependence is entirely driven by training. thus training on longer sequences might make it "grok" on longer term memory and address this effectively for virtually any ctx (but this is more speculation); ]] imo makes the argument that RWKV handles longer ctxs better

tropic minnow
#

this would be the summary. wonder if its best displayed as table or as plot

young sparrow
#

@tropic minnow Okay that's a positive signal, but there's a lot uncontrolled for. In particular, I would expect LLaMA 2 and the derived models to be much better than RWKV in general. If we can confirm this, that would be good evidence that we aren't just leveraging a more powerful model

#

Is the 16384 score for RWKV correct, or is there a missing decimal point

tropic minnow
young sparrow
#

What happened there

tropic minnow
young sparrow
tropic minnow
last mauve
tough crane
young sparrow
tropic minnow
void quartz
gusty condor
#

Is it unfair?

  1. This model is trained after EMNLP submission deadline.
  2. This model is not Pile model, vocab size V=65536 rather than 50277. If this model is listed then previous descriptions should be modified too.
  3. Shouldn't compare this model with other 2k or 4k pretrained context length models, which is extremely unfair.
tropic minnow
void quartz
#

I’m slightly worried that it end up being quoted as proof of rwkv being unable to scale past 16k tbh 😅

#

But agree that the newer models is out of scope for the reasons listed above

#

I at least can confirm ur observation is consistent with what we know of the older models 🙂

tropic minnow
void quartz
#

Yea. Framing that this model was trained only up to 8k is fair

young sparrow
#

Mathematically it's actually not possible to maintain accuracy for arbitrary sequence lengths beyond the train set on sufficiently complicated test sets

#

What's relevant is a) the memory usage as you lengthen the sequence and b) how quickly performance falls apart

#

@tropic minnow Can you also quickly make a plot showing memory usage as sequence length increases for both Llongma and RWKV

tropic minnow
last mauve
#

New camera-ready deadline Oct 22 AoE

gusty condor
tropic minnow
fickle hare
#

The result is essentially presenting the extrapolation IMO. Extrapolating from 8k to 14k without any changes is already impressive.

#

IMO it should really fair compare with original llama, instead of those long variants; I think the table can list as two parts, one llama & rwkv, the next those long variants

#

and we can claim RWKV to be naturally extrapolating (nearly same quality to 10k, not “exploding” up to 14k)

tropic minnow
young sparrow
#

That's outlandish

#

How did it possibly take that long

tropic minnow
young sparrow
#

@proper raven is the something seriously wrong with the efficiency of this code?

snow zealot
#

This for a sequence of size 128k for 10 sequences

#

You could try to batch this but it is a trade off between memory and speed

fickle hare
#

it compares sliding window results to full context ones? that does cost a lot then...

last mauve
#

Just submitted the camera-ready

#

We can submit v2 of the arxiv this week once the long-context results are in

#

Then we can begin brainstorming the followup paper

gusty condor
#

Stretch-goal 5: add Channel mix block as a figure too.

#

Which application did you use to produce those figures?

tropic minnow
gusty condor
#

Yes (I mean figure 8)

gusty condor
slow palm
#

Quick question: as the training of the smaller RWLV v5 models is getting close to the end, will the datasets used to train them be available somewhere ?

last mauve
#

@snow zealot @young sparrow @tropic minnow -- what are we doing from long context? I don't see any actionable conclusions from your previous discussion.

young sparrow
#

I was under the impression we were going with what we had

#

It's not my first choice but it's pretty good and running more apples-to-apples models appears to be prohibitively expensive. It would be nice to augment with one of the long context evals I linked to earlier but I don't have bandwidth to do that and nobody seemed interested.

last mauve
#

Gotcha. Ok let's get the table into the arxiv overleaf then @tropic minnow

gusty condor
slow palm
gusty condor
#

Some data are from people PMed to Bo Peng, which are not released

tropic minnow
young sparrow
#

We restructured / reorganized Sections 4 through 6 between the arXiv preprint and the EMNLP version. I think that the structuring in the EMNLP version is better (though I'm open to disagreement!). We should make a decision about if we are going to back-port that to the arxiv version or not

proper raven
# young sparrow <@153017054545444864> is the something seriously wrong with the efficiency of th...

we used FA2 for inference which made it better, but yeah sliding window is extremely inefficient since you're recalculating perplexity for the context size (so like 8k, 10k token inferences) every 256 tokens. you end up calculating the entire document several dozens of times, but it corrects for the first tokens having outsized weight on the ppl since essentially all tokens (mod 256) get to be "first" at some point

tough crane
gusty condor
#

Anyway, I think this is clearer
Legend:

  • Circles: operators
  • Arrows and rounded rectangles: vectors (dimension D unless bolded or explicitly stated)
  • Squares and rectangles: matrices (with respect to their shapes)
  • Purple: trainable parameters
  • Red: internal states
    (Note that this is solely a mathematical implementation)
#

Any suggestions on it?

tough crane
tropic minnow
# gusty condor

what does LN1, LN2 mean in the layernorms? weights and bias of the affine transform?

tropic minnow
# gusty condor Any suggestions on it?

i think it is quite correct, but found it quite hard to read at first glance😅 maybe using different line styles for the vertical (GPT) and horizontal (RNN) modes? maybe grouping the items under different sections (token-shift, etc) could help as well

steady ether
#

Should we add a sentence or two referencing RWKV-1-3? The paper started with 4, and moving to 5 might confuse some readers.

#

Or actually, just pointing the GitHub link to the v5 folder should be fine.

young sparrow
#

@steady ether I thought we removed all reference to "4" from the paper, but we can footnote it if not

gusty condor
gusty condor
gusty condor
#

Is this diagram better?

tropic minnow
gusty condor
#

Yes, I can

gusty condor
gusty condor
#

This is the original version of RWKV5, slightly better than RWKV4

misty cedar
gusty condor
gusty condor
steady ether
#

Due Nov. 12. Everything is optional, but it probably helps.

last mauve
steady ether
#

Happy to help with the slides and/or video. Can start on the slides this weekend.

gusty condor
tropic minnow
stray locust
misty cedar
steady ether
#

Here is a quick draft of the slides. Anyone with the link can edit them. Please feel free to make updates.

https://docs.google.com/presentation/d/1ABvKYRQos8Sihn5m3zZXCHcg0h7j6tMX/edit?usp=sharing&ouid=114859025232119518796&rtpof=true&sd=true

stray locust
misty cedar
stray locust
gusty condor
gusty condor
#

A0 is so large

misty cedar
stray locust
#

Howdy again. I've submitted a draft here: https://github.com/labmlai/annotated_deep_learning_paper_implementations/pull/222 with @last mauve and I was hoping someone could help me implement a minimal training loop here: https://github.com/jahatef/annotated_deep_learning_paper_implementations/blob/master/labml_nn/RWKV/experiment.py#L136. The code there is nonfunctional. We've been looking at https://github.com/Hannibal046/nanoRWKV/blob/main/train.py, but this training script is fairly complex, and it would take us a long time to boil it down

misty cedar
gusty condor
obsidian quest
#

https://arxiv.org/abs/2311.01981 nice trick to boost rwkv4 performance

stray locust
#

To be more clear, can someone either:

  • help me implement the training loop here to complete the labml submission. or
  • commit to completing this loop, and I can add you to my fork so that you can work with us on this.
    Appreciate the help!
mossy cipher
stray locust
#

Great! Would you like to be added to the gh fork?

mossy cipher
#

Sure, that will be great

tough crane
steady ether
#

Would be cool if @obsidian quest can go wow everyone and answer people's questions 😍

paper dove
steady ether
#

It looks like they might email us

gusty condor
#

Should this poster be vertical or horizontal?

young sparrow
#

Horizontal

tropic minnow
#

so @gusty condor has made an amazing work with the first draft of the poster and we'd like to ask for feedback / suggestions (mine are annotated in purple and i'll be adding them in the next hours)

steady ether
# tropic minnow so <@803473343705514025> has made an amazing work with the first draft of the po...

Wow, that looks amazing. Just a few nitpicks in chronological order:

  1. Shouldn't it be 'Attention-Free Transformer (AFT)' instead of 'AFT (Attention-Free Transformer)'?

  2. Not sure if 'tricks' is the best word here to describe our improvements over AFT: "Although RWKV is inspired by AFT, this is not the final form of the RWKV model, which includes many additional tricks explained below."

  3. Words in titles can be capitalized. E.g., 'RWKV Architecture: Summary.'

  4. In the diagrams, we used 'Time Mixing' and 'Channel-Mixing,' but here we use the hyphenated 'Time-Mix' and 'Channel-Mix.'

  5. We called it 'output gating' in the paper but 'self-gating' here.

  6. Maybe we can bold the Left/Right/Middle text in diagrams to make it more readable?

stark pilot
#

Hey, can someone share the code that was used to evalue RWKV and the other models from the arXiv paper?

Also was the base model tested or the falcon variant, cause we're unable to reproduce the results, we are getting 35% on ARC-Easy instead of the 48% claimed for the smallest model.

steady ether
# stark pilot Hey, can someone share the code that was used to evalue RWKV and the other model...

Someone please correct me if I'm wrong but I think we used

Code: https://github.com/EleutherAI/lm-evaluation-harness

Pile models: https://huggingface.co/RWKV

I just ran it and got:

hf-causal (pretrained=RWKV/rwkv-4-169m-pile), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
|  Task  |Version| Metric |Value |   |Stderr|
|--------|------:|--------|-----:|---|-----:|
|arc_easy|      0|acc     |0.4752|±  |0.0102|
|        |       |acc_norm|0.4150|±  |0.0101|
young sparrow
gusty condor
#

Which documents is the loss tested on?

young sparrow
#

@obsidian quest you ran the loss calculations and sent me the numbers to use right? Are these train, validation, or test loss numbers?

obsidian quest
#

training loss

stark pilot
steady ether
stark pilot
#

I'm running it directly using the lm_evaluation.py file but I think I figured it out, thanks a lot!

gusty condor
tropic minnow
#

did anyone fill this? otherwise i'm going to do so

#

Anyone plans to be in singapore for EMNLP? @void quartz @paper dove ?

gusty condor
tropic minnow
gusty condor
#

We can work on new articles such as arxiv-v2 or even RWKV-5

misty igloo
#

I'd love to work on a rwkv5 article

stray locust
tropic minnow
stray locust
misty cedar
#

Naive unfused wkv5 module

gusty condor
young sparrow
last mauve
#

Ok now that the poster is in, we have the next two broad targets:

  1. arxiv v2 that's in-sync with the EMNLP submission
    • Update the author list (Can anyone pick this up?)
    • Merge in changes from EMNLP draft. Varies by section and I've been putting it off, but hope to finish it by Monday
    • Push to arxiv
  2. Start setting up for RWKV-5
    • Create an overleaf (looks like @gusty condor already did this for everyone, but this link is not shareable. Also, @gusty condor -- are you using an overleaf premium account? If not, I can put this under my account so that we get more compile time)
    • Come up with a list of new contributions that RWKV-v5 introduces, and what results we want to include given those contributions. @obsidian quest and others, do you have a list of v4 --> v5 differences you can point me to? If it doesn't exist, let's add one to the new overleaf so that we can start planning design sections
    • Once the above two tasks are done, I'll start creating task lists like I did for v1 and we can start working on the writeup together.
misty igloo
# last mauve Ok now that the poster is in, we have the next two broad targets: 1. arxiv v2 th...

Here is a list of the main changes I'm aware of from rwkv4->rwkv5.2:

  • now multi-headed, with per-head [decaying] state
  • r@(wk)@v instead of rwkv, so the [decaying] state is now a K channel memory bank of values <- this is similar to retnet
  • per-channel learned decay and boost (w and u) <- retnet does not have this, but rwkv4 did
  • per-head grouped normalization <- various other models have this, including transnormer and I think retnet
  • added a silu gate in WKV <- other models use gating as well
#

I'd be happy to integrate those whenever we have an accessible overleaf
The biggest question is what kind of claims to make about them, since they're excellent together but individually they are largely pieces that exist in other models that all fit very nicely into the rwkv puzzle and improve its performance dramatically

#

hard to say any of them were 'invented here' - the specific usage in concert with the underlying recurrent rwkv4 mechanisms is what's new

#

the only one I would think might have been 'invented here' independently is #2, the r@(wk)@v part which is like a recurrent decaying version of linear attention

#

and is probably at most concurrent work with retnet

last mauve
# misty igloo I'd be happy to integrate those whenever we have an accessible overleaf The bigg...

As long as we motivate why these elements are suited to RWKV, I think that's OK. Bringing existing pieces together in a unique way with solid motivation is still a new contribution and requires enough insight to justify a paper submission.

We'll only face paper review scrutiny if we make it look like we're randomly throwing things at RWKV. Since that's not what we're doing, we just need to make sure our writing reflects that.

gusty condor
gusty condor
young sparrow
#

@gusty condor I'm confused, this looks like it's the previous paper?

gusty condor
#

It's different

young sparrow
#

Ah it default compiled v4 for me

#

Possibly hot take: a history of RWKV would make a great blog post but doesn't make sense being crammed into the "background" section of a paper

gusty condor
#
  1. I think some of the improvements in RWKV-1 in August 2021 are still pioneering even compared to the current transformer architecture.
  2. There have been some debates questioning the originality of RWKV. We can post out entire history of RWKV to resolve the debate.
  3. If architecture evolves so fast, at some time in the future we have to review the history again.
obsidian quest
void quartz
obsidian quest
last mauve
# gusty condor 1. I think some of the improvements in RWKV-1 in August 2021 are still pioneerin...

I also don't think that framing these as a historic subsection would be appropriate. We can point to prior internal RWKV works in the "related work" section if we want to establish ourselves.

We can explain the previous RWKV in the background, but we should frame those as "getting the reader up to speed on what the architecture is" and not "a trip down RWKV memory lane". This reframing is as simple as taking "history" out of the name and replacing with "the RWKV architecture" or something, and the content should be purely on the architecture. No personal or organizational stories should be included

gusty condor
#

OK, I agree with that

misty igloo
gusty condor
# void quartz haha added to things i need to write on list 😄 for the RWKV blog (will crawl th...

Also Zhihu history, the original idea of RWKV is posted here https://zhuanlan.zhihu.com/p/397985790

misty igloo
#

Just to clarify in case there was any misunderstanding, I am not questioning that rwkv is original work 🙂 (Also, I think it's amazingly great!) My question about what can be considered to be new inventions for the purposes of a new paper was intended to be specifically regarding what makes version 5 different from 4. My apologies if that came off badly!

gusty condor
misty igloo
#

Do you know if the matrix valued multi-headed module was developed/discovered concurrently with or following retnet? My impression from seeing the rwkv discord at around that time was it was immediately following, but I'm not at all certain about the timeline. It's of course part of the whole rwkv5 model improvements either way - I'm just asking if it could be additionally claimed as an independent invention on its own.

pale nexus
gusty condor
#

Actually, RWKV and RetNet followed each other. RWKV-5 followed RetNet, and RetNet followed RWKV-4.

spiral minnow
spiral minnow
gusty condor
#

That was BlinkDL's experiments

misty igloo
#

matrix valued states from retnet
groupnorm from transnormer and maybe retnet
gating from various others

gusty condor
#

Yes! Token shift is not however, therefore might need an ablation study

misty igloo
#

I figured as long as I was putting in the formulas for 5 we might as well have 6 ready to go... also, who knows when this gets published so maybe by then we'll want to show 6 as well

misty igloo
#

@gusty condor notation style in your edits is a bit different from the rwkv4 paper, probably more precise but not sure if we want it to be standardized between the papers?
one other question, do you think it might be easier to read if we keep everything specified per-head throughout the main equations since that way there would be fewer subscripts?

gusty condor
# misty igloo <@803473343705514025> notation style in your edits is a bit different from the r...

Style:

  1. All matrices are bolded, vectors are not.
  2. \cdot (or written together) is matrix multiplication, \odot is element-wise multiplication. Two operands of \odot must have the same shape.
  3. All vectors are row vectors, unless explicitly stated, so matrices must operate at the vector's right side.
    These conventions make it easier to track the shapes of matrices and vectors, which helps sanity checking.
obsidian quest
#

retnet = linear transformer + exponential decay (i was doing it first) + xpos. nothing new 😉

obsidian quest
hushed flare
obsidian quest
#

data-dependent shift & data-dependent decay

hushed flare
#

That's going to be an interesting flow chart to draw for the architecture.

misty igloo
#

Yeah the formulas are a bit intense bc of lots of lora weightings. I guess I gotta make functions for all that

misty igloo
nova marsh
#

Guys if you need some help I can give my contribution

gusty condor
young sparrow
#

We should use single column regardless

tough crane
misty igloo
#

just describing the full architecture and clarifying what changed in v5 and v6 (and why) seems fine so far

misty igloo
#

@obsidian quest what factors do you use for the LoRA reduction in v6 right now? I know it might change, I just need something to put in the paper as a placeholder

obsidian quest
#

fixed size 5*32 for time_mix (32 for each of w/k/v/r/g), 64 for time_decay

gusty condor
#

So, not D/4, Since D* (D/4) is a large amount

misty igloo
#

okay I updated that in overleaf

obsidian quest
# obsidian quest

pls show a similar table so everyone can see v4 v5 v6 are natural evolutions

misty igloo
gusty condor
misty igloo
# gusty condor

cool! also, thanks for noticing and fixing my mistake w/ lambda vs W on DDlerp

#

it seems like the formula for rwkv6 w got changed and lora_\omega became missing but I somehow don't see the changelog on it. I tried to put it back to what I think it should be. not sure if we should change the d naming to something else since omega is now a little odd
see https://github.com/BlinkDL/ChatRWKV/blob/0f9fd50b7a8b4d317a87e4f1ad7e713a275df11e/rwkv_pip_package/src/rwkv/model.py#L846C5-L846C5 for reference

GitHub

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source. - BlinkDL/ChatRWKV

misty igloo
# gusty condor

per my above comment, this graph appears to be missing the second lora (the 64 sized one) on the results of wx

#

the initial lora that's shown is still size 32 in blink's code, it's just missing the second lora on the result of that

gusty condor
#

Not missing, I have taken into account

gusty condor
gusty condor
#

Let's see if this looks better

obsidian quest
#

the timemix lerp part is wrong 🙂

#

should be x & x_prev == [ lerp ] ==> xxx == [ lora ] ==> w/k/v/r/g lerp factors => xw/xk/xv/xr/xg
and then xw == [ lora ] ==> w

misty igloo
gusty condor
#

extremely complicated 🙂

gusty condor
misty igloo
gusty condor
#

How to add these data? As tables or plots?

#

Also, I'm not entirely sure about the model parameter count. I counted 13D^2L + 598DL + 4D + 2DV, but the actual number might be different

misty igloo
subtle oak
#

Maybe we can also plot some figures like this in first RWKV paper and put them into appendix? it makes the scaling more clear maybe...

gusty condor
#

Added a subsection to introduce the tokenizer

obsidian quest
#

not using lora in channelmix

misty igloo
# obsidian quest not using lora in channelmix

oh you changed it back so it's same as v4 and v5 now?
I see that here https://github.com/BlinkDL/ChatRWKV/blob/0f9fd50b7a8b4d317a87e4f1ad7e713a275df11e/rwkv_pip_package/src/rwkv/model.py#L579
so the only difference really is that k_maa is the amount of x_t-1 to use, while in v4-5 k_mix is the amount of x_t to use, correct? which is really just an implementation detail

GitHub

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source. - BlinkDL/ChatRWKV

obsidian quest
misty igloo
misty igloo
# gusty condor

sorry, I was mistaken and blink apparently isn't using ddlerp in chanmix any more so your chart can revert to v4-5 chanmix
I updated overleaf accordingly

#

(was originally going off old comments in rwkv discord, and hadnt seen the actual new code for it)

gusty condor
obsidian quest
#

seems redundant after training for a while

gusty condor
#

RWKV-4-World, RWKV-5-World and RWKV-6-World

misty igloo
misty igloo
#

on a different note, Stella was saying we should move to single column layout, but I'm too new to latex to understand how to bridge the incompatibility between \onecolumn and \maketitle - maybe someone else here knows?

misty igloo
#

@obsidian quest one more question, I realized I made an assumption that w_maa, k_maa etc. in rwkv6 are learnable... are they parameters or fixed values? We have them listed as learned parameters currently

obsidian quest
#

learnable

gusty condor
#

The problem is due to our template (which is EMNLP2023). Feel free to change a template.

#

Neurips_2023, but with line numbers? Trying to remove that.

obsidian quest
#

Matrix-valued states

gusty condor
gusty condor
#

@misty igloo I found this article interesting: https://arxiv.org/abs/2207.02098
Can we try some on RWKV-5 and 6?
My expectation is that RWKV will outperform both Transformer and LSTM on these tasks, but if you want titles like this:

% RWKV-5 and 6: Towards Neural Turing Machines as LLMs
% RWKV-5 and 6: Enhanced Neural Turing Machines as Recurrent Attention
% RWKV-5 and 6: Modified Neural Turing Machines are All You Need

Then the evaluations on Chomsky Hierarchy is crucial (which shows how powerful a neural Turing machine is).

misty igloo
#

also, not married to any particular title... was just suggesting ideas on what might make it more interesting (and still hopefully be accurate and descriptive)
but you're right that we need to validate it experimentally

misty igloo
#

my hunch is that the current state mechanism acts as a fixed-size random access memory for the purposes of the chomsky hierarchy

#

especially in v6 where we now have a real data-driven forget mechanism

#

from a theory perspective, what mechanism(s) is v6 missing that an NTM contains? for writing they use erase and add, where in rwkv6 we have decay and bonus
but I suppose while we do have content-based addressing we're missing the location-based addressing mechanism
@obsidian quest rotational location-based addressing might be interesting for v7

gusty condor
# misty igloo from a theory perspective, what mechanism(s) is v6 missing that an NTM contains?...

Adaptive computation time (https://arxiv.org/abs/1603.08983) and reusage of parameters (https://arxiv.org/abs/1807.03819) (Turing machine is the same function iterated over and over again on a tape)

#

We can make a Universal RWKV or something, but that's another article.

obsidian quest
#

can try adding [pause] token first

misty igloo
misty igloo
# gusty condor Adaptive computation time (https://arxiv.org/abs/1603.08983) and reusage of para...

Thanks, that's helpful for my understanding of the remaining differences.
imho the problem with reusage of parameters is that a single function/layer isn't a lot of 'algorithm' for the machine to run... it's like having a very short program that can run on a long tape. We've all written programs and the code often needs to be longish even if you have lots of RAM available
[pause] is one way of keeping the program code longer while allowing multiple iterations but I'm sure there exist other alternatives

#

and traditional software of course allows loops for specific subregions of the code, not just the whole program

#

maybe each layer needs the equivalent of repetition until 'halt'

#

this is of course going way off track from discussion of the rwkv5/6 paper 😉

#

sorry hehe

gusty condor
misty igloo
#

my assumption was that it was going to be in the parameters, since the tape usually doesn't start out with anything extra on it that doesn't come from the input text [embeddings]

uneven blade
tough crane
uneven blade
#

@misty igloo Could you explain location based addressing in short and how does it help? Thanks!

gusty condor
#

If anyone wants to mention Turing machine, I think it's necessary to benchmark the Chomsky Hierarchy since it quantitatively tests how powerful an Automata is.

young sparrow
tough crane
#

I am just saying as an analogy. hehe

young sparrow
#

I updated the paper to use the authblk library as I find that for papers with many authors it's the easiest and cleanest way to manage an author block

#

@void quartz is "the RWKV Foundation" an entity? My understanding is that the actual org is called the Generative AI Commons

gusty condor
#

No the actual entity is 深圳元始智能有限公司 (Shenzhen Yuanshi Intelligent Co., Ltd.)

young sparrow
#

I don't understand. I'm talking about the non-profit research foundation that RWKV joined.

misty igloo
obsidian quest
#

my understanding is, RWKV Foundation is now a virtual entity under LFAI

#

@void quartz let's find the best method to say this

void quartz
#

Been using “RWKV project under the Linux Foundation” in compute grant application. And I cleared that phrase with the LF team

young sparrow
#

I actually have a call with Matt White and Lucy Hyde tomorrow and can ask them

misty igloo
#

Didn't mean to open pandora's box with the NTM mentions. But I still think we need a better title, since a) the models do more than add larger internal states and b) retnet already uses similar matrix valued decay state.
The other ideas I wrote in as comments were:
RWKV-5 and 6: Enhanced Recurrent State Mechanisms for LLMs
Matrix-valued and LSTM-like States for LLMs
RWKV-5 and 6: 2D LSTM State for LLMs
I'm not necessarily recommending these in this form - they are just spitball ideas to get things rolling.

young sparrow
#

I know that there's a hierarchy of:
LF -> LF AI & Data -> GenAI Commons -> RWKV
I'm just not sure what level of that hierarchy makes sense to use to refer to an entity (this was promoted by seeing "RWKV Foundation" as an affiliation on the paper)

void quartz
fickle hare
#

Another question: WKV6 is very similar to GateLoop, though it started training way earlier than the GateLoop preprint. How to treat that work?

gusty condor
#

Never heard GateLoop

fickle hare
#

Its title accurately describes RWKV6 as well

gusty condor
#

I see, but they didn't even cite RWKV

fickle hare
#

if they were to cite, it must be RWKV6, but there're nowhere to cite RWKV6 up to now

#

unless you'd accept a reference pointed to a github commit

#

As to the title, I'd prefer one mentioning multi-head linear attention and data-dependent decay/gate

#

(over NTM)

void quartz
misty igloo
void quartz
#

Or ur intending to just focus on the state size increase and compare them side by side

gusty condor
void quartz
void quartz
#

PS: that’s technically their AMD mi100 cluster not the cluster we applied

fickle hare
#

that's only <40 TFLOPS per GPU...

void quartz
#

Yea driver bottlenecks is a real problem

fickle hare
#

oh well, MI100 then it makes sense

#

MI100 peak ~90TFLOPS

gusty condor
fickle hare
#

~50% MFU is good enough

#

thought it was A100

void quartz
#

Anyway since there is still time till decision. Pushing to go past the 160 node barrier haha

#

All the numbers were from <1hr test runs

fickle hare
void quartz
#

On that note. To compare the models …. Do we need a v5 pile?

#

Not sure how we plan to compare v4 to v5 - different tokenizer and dataset

#

Around specific evals like memory it can be very clear its architecture change. Beyond that, a criticism could be the change in dataset

gusty condor
#

Compare with World models

fickle hare
#

I thought v5 world is not trained on exactly the same dataset than v4 world?

tough crane
young sparrow
young sparrow
#

(Or v5 -> v5.1, v6 -> v5.2)

tough crane
nova marsh
#

I guys, I would like to give my contribution to this project. There is something I can do?

tough crane
misty igloo
#

or I guess the RWKV5.3 idea works (there was already technically a 5.1 and 5.2)

#

I'm a little worried about this minor versioning idea though, since to end users it may not be at all obvious that the model weights are totally incompatible

#

likely to cause significant support problems

void quartz
#

haha, i think we need to have a discussion on verisoning numbers, cause likewise i think folks are confused as well

since genie is out of the bottle, maybe we can do something like nodejs or many other projects Stable / Unstable versioning (added to the agenda for TSC later)

#

v5 can be stable, while v6 is still unstable, then v7, when its out should be stable when its out

tough crane
#

Indeed, the difference among GPT-1, 2, 3, 4 is just increasing the parameter size. 😆

void quartz
#

ours is still define as having code changes, so its not compatible (without conversion)

misty igloo
#

none of this addresses Stella's concern about paper numbering but I think the compatibility is more important to signal properly

void quartz
# tough crane Did we fail to get NVIDIA GPU accesses??

technically the AMD cluster is an upgrade
( the nvidia cluster is the much older v100s, the only benefit is they have scale )
uncertain what we will actually get at the end (if any) - they tested both, but it seems like the direction they are testing towards is the new AMD cluster

tough crane
young sparrow
tough crane
#

I personally agree to this concern.

ML/DL model "versioning" seems to be different from the stricter semantic versioning of usuall software like python 3.11.x

Even just increasing the # of params gives GPT's "major" versions.

#

I personally think that huggingface's transformers numerous foo_modeling.py shows the difficulty of "strict semantic versioning" of DL models.

young sparrow
void quartz
misty igloo
#

On an unrelated note, I think it'd be useful for accept all changes on the overleaf so we can start seeing new differences easily but I didn't want to do it without asking first

young sparrow
misty igloo
void quartz
#

i have already met folks in person who are confused about v6, when they thought we are launching v5 😅 and asking if they should wait and use v6

#

and this is for them to play with the model (not evals)

misty igloo
#

yeah but imagine the confusion if 5.3 (previously 6) is like totally incompatible with 5.2... we actually already have that problem with 5.1 which is much less different but still have to support in the same codebase everywhere since there's a small model in the '5' range that relies on it

void quartz
#

request. can we move this convo to the main discord

#

not sure if its paper related anymore

misty igloo
#

the fundamental related questions, in terms of the paper, are:
single paper for 5 and 6?
name it differently to avoid confusion?

obsidian quest
misty igloo
#

does anyone think we should NOT press 'accept' on all revisions on the paper at this point? I think it will help us track actual changes going forward

spiral minnow
#

Of course, it may be complicated to fit all the details into 8-10 pages, so we should be careful that we're not overloading it

gusty condor
misty igloo
#

okay I went through and accepted all the changes to date - should be a lot easier to see what changes from now on

void quartz
#

regarding versioning
Details to be finalized, but we will be splitting versioning on two tracks. A more experimental branch (rwkv-x-???), and a more stable branch (rwkv-vK)

So in this flow, the current v6 will be renamed to an -x variant, till it is finalized, stable and gets promoted to the stable branch. This allow blinkDL and others to make as much changes as they like in the "experimental branch". And promote to stable when its finalized

This allow a clearer, more stable release, with clearer communication / coordination.
This would also reduce the confusion like V5, R1, R2, R3, and R4 varients

obsidian quest
#

current models will be like
rwkv-x060-3b-world-v2-14%trained-20231129-ctx4k.pth
rwkv-x060-1b6-world-v2-42%trained-20231130-ctx4k.pth
p.s. x061 is coming 🙂

misty igloo
obsidian quest
#

current mix is faster than pytorch lerp

misty igloo
#

it switched directions between v5 and v6 when you changed the code to be more optimized

#

see #1097928558309036042 message

#

(We can talk about this in rwkv discord if needed)

steady ether
#

I thought we had a 169M model.

jade lotus
gusty condor
# steady ether I thought we had a 169M model.

V040(the new versioning)-Pile has 169M
V040-World is 193M
V050-World is 193M
V052-World starts from 462M (the article is currently describing)
X060 is under development (estimated at 197M and 473M)

obsidian quest
obsidian quest
regal basalt
#

woa

paper dove
steady ether
young sparrow
#

Also the performance grades look incredibly suspect

subtle oak
#

Yeah RetNet and this paper seems mostly from Microsoft and they just inflated their own product…

spiral minnow
#

Any seen the Mamba paper yet? https://arxiv.org/abs/2312.00752 They incorporated a gating mechanism similar (in purpose) to the updates in RWKV-v5/6

remote elbow
#

this was posted here back when it was in review here
#1103039376184852622 message

obsidian quest
#

yeah mamba has great numbers but i still cant get it to run yet

remote elbow
#

why? some installation issue?

misty igloo
obsidian quest
remote elbow
obsidian quest
misty igloo
#

love to know how it compares w rwkv x6 on same dataset, even in early going

obsidian quest
#

testing benchmarks

#

cant train yet

tough crane
young sparrow
obsidian quest
#

rwkv has same kind of parallelism as mamba

obsidian quest
#

tipping works for v5 (but not for v4)

gusty condor
gusty condor
subtle oak
steady ether
#

To be fair, none of the RetNet authors are on this, so maybe they just cited other papers directly without checking.

subtle oak
#

Yeah maybe just ignore the detailed checking and just using the RetNet results, sorry I just guess

weak urchin
obsidian quest
#

from community berk

gusty condor
#

Ravens, Mambas and Transformers

gusty condor
#

By the way, let's hurry for the RWKV-5 article (Mamba is still citing RWKV4)

void quartz
#

Even if retnet paper refused to change. You can push the other papers to clarify what they mean, and push for amendment

young sparrow
#

Oh boy I forgot how much dicks they were about this

#

The promised "next version of our paper" never happened

misty igloo
obsidian quest
#

3b finished

#

7b before christmas

#

rwkv6 1.6b in 13 days

jade lotus
# obsidian quest rwkv6 1.6b in 13 days

Hey, have you guys tried any softmax variations like sigsoftmax or multifaceted softmax? It seems like this would be a natural enhancement, with a potentially big impact.

young sparrow
jade lotus
#

Seems like it could give you flexibility in how high level concepts are prioritized over time and directly tied into decay and attention gating

last mauve
steady ether
steady ether
#

Got a very quick response. They will update arXiv in late December.

...
The term "parallelization" is meant to refer to parallelization within sequences or chunks. To avoid any future misunderstandings, we will omit the parallelization column in our revision.

As for the performance indicators, they are majorly sourced from Table 5 in RetNet (as the attached image), which reports perplexity numbers on both in-domain validation sets and various out-of-domain corpora. From Table 5, we can see H3 slightly outperforms RWKV and Hyena in general, thus we assign it with one more '+' sign.
...
#

The table ^

young sparrow
steady ether
#

Not too familiar with these benchmarks. I think lower is better for perplexity

young sparrow
#

Oh I didn't realize they were ppl values

#

(reading is OP)

misty igloo
#

even their limited claim is annoying - you absolutely can parallelize rwkv within chunks by using parallel scan, it's just not necessarily desirable to bother

spiral minnow
# steady ether The table ^

This evaluation is just weird in general. Each model is trained on different data, their performance on each individual "out-of-domain" corpus is a function of the data just as much as the architecture. Unless I've misunderstood and they actually trained each model from scratch on the same data

young sparrow
# spiral minnow This evaluation is just weird in general. Each model is trained on different dat...

I was going to say this but stopped myself because I wanted to look at the paper again. If the evals are framed as being about the architectures you're correct that they're entirely invalid. If they're framed as being about which model artifact to use that's mostly fine. However in such a context it's still the case that comparing in-distribution loss (does that mean validation set from the training corpus?) is meaningless

obsidian quest
#

mamba paper showed more results on this

obsidian quest
#

at 2x10^20 flops in their test:

hyena < vanilla transformer < rwkv4 < retnet < h3+attention < mamba < modern transformer

however the slope of rwkv4 is the best among all models, so it may catch up and surpass more models, similar to how it surpasses vanilla transformer

#

all papers should mention they are comparing with RWKV-4

spiral minnow
young sparrow
tough crane
young sparrow
tough crane
tough crane
gusty condor
gusty condor
obsidian quest
#

@tough crane @gusty condor

spiral minnow
#

BTW, what's going on at EMNLP, is somebody presenting the paper? It would be great to see how it's going 😄

steady ether
# spiral minnow BTW, what's going on at EMNLP, is somebody presenting the paper? It would be gre...
gusty condor
#

It seems that the time has passed

steady ether
pale nexus
gusty condor