#RWKV-papers
1 messages · Page 2 of 1
yes
ok. Please lemme finish setting up before you edit
added points in contributions section for that and rephrased/shortened mine.
created a motivation section after intro with some bullet points. feel free to edit
Thanks @tropic minnow! I'll take a look at these
I just added an author list. Lemme know if I missed anyone.
It's formatted kinda ugly rn. If @zealous snow or anyone wants to take a crack at making it less ugly, feel free.
could be interesting to say that RWKV has linear attention without any approximation unlike linformer and co ?
see this (draft) sentence in 2.motivation: => Address quadratic cost of attention by a reformulation to get "scalar attention" with linear cost
maybe add "with no approximation involved`. I think this is important because I believe when you scale your model, approximations start to take a lot of importance
while here, if there is no approximation, scaling shouldnt be a "problem" (at least you are not limited by your attention calculation)
thanks, let me try to make it more ugly, or hopefully chatgpt could help me make it less
good point. you're free to convert the bullet points into text as u find better; we can always revise/discuss later. i formulated the current one this way bc didnt want to give the impression we're computing the QK attention, but our own variant of "scalar attn". So imo its not an approximation but its not the transformer formulation either.
Some long author list papers from EMNLP https://arxiv.org/pdf/2109.04650.pdf https://arxiv.org/pdf/2104.08200.pdf
Shouldn't we also have author affiliations?
is there anyone working on in-context learning examples?
i think does @rustic rivet
I have a short gist demo showing how rwkv can read a paragraph and store the state variable, then you can ask a lot of questions by utilizing the state
But this might not be a formal in context learning example, more like a showing off with the statefulness of this LM
i am not quite sure what kind of ICL examples need in the paper
Microsoft published a paper on why ICL work and they believe it's the attention mechanism inside shifted the attention to mimic a meta optimizer. By tweaking with the attention further, they verified the conjecture some how
But for us there is no attention, however we can still few shot, then save the state, as shown in the gist
is this something we want to put in the paper? I remember @obsidian quest has mentioned this, not sure if any follow-ups
Large pretrained language models have shown surprising in-context learning
(ICL) ability. With a few demonstration input-label pairs, they can predict the
label for an unseen input without parameter updates. Despite the great success
in performance, its working mechanism still remains an open question. In this
paper, we explain language models a...
I am also trying to visualize these inside the state variable, to uncover how timemix and channelmix meta optimized the model, too. However this feels like a follow-up blog instead of the current overleaf manuscript for a major introduction
This is as far as I go now
this is interesting if you could link these two together, meta optimizer in RWKV
Considering the whole meta thingy came from next token prediction training it's really fascinating
yes, this paper has been accepted as ACL2023 findings
and this is also a concurrent work, https://arxiv.org/abs/2212.07677
Transformers have become the state-of-the-art neural network architecture
across numerous domains of machine learning. This is partly due to their
celebrated ability to transfer and to learn in-context based on few examples.
Nevertheless, the mechanisms by which Transformers become in-context learners
are not well understood and remain mostly an...
By the way, it seems to be a comment in compiled PDF as blue colored chars
Yes
I just wanted to get an author list up so people could add themselves and point out issues before we're deadline-constrained
Okay, maybe we can have folks add their affiliation to the contributions list then and we can add it so it looks nice later
Yes. Please do this, all
What is your opinion about ICL examples? @last mauve @spiral minnow
Is there anything I can help out with? Y'all need some SuperGLUE fine-tuning experiments? :p
Sorry if I'm a bit out of the loop, haven't been in the office for a few days. What exactly is the question here? Whether to include some case studies showing the ability of RWKV to do ICL?
Yes, I do think some examples of the model output would be very beneficial to the paper. It currently has a lot of quantitative analysis but is lacking qualitative analysis
Do we have some example outputs from LAMBADA? It looks like the paper is very nearly full at the moment, so maybe we can add a bunch of example outputs in the appendix, but just highlight 1-2 of them in the main paper.
It would be really good to have some example continuations that demonstrate the key qualities of this model: fluent and coherent text continuations that maintain quality over long contexts
in that case, maybe this could go beyond ICL examples. I would try if i could find something. I would first put it in the appendix.
BTW, do we have a RWKV icon now?
blinkdl profile picture lol
yeah we do, it's a raven, it's used in the huggingface integration
The first RNN in transformers! 🤯
Announcing the integration of RWKV models in transformers with @BlinkDL_AI and RWKV community!
RWKV is an attention free model that combines the best from RNNs and transformers.
Learn more about the model in this blogpost: https://t.co/0FQmsaRVZw
1149
265
cool, do you have the original image? maybe we could put it in the showcase
@outer vine
Would you pls provide an example so we can follow a consistent format?
Affiliation: EleutherAI
For me it's University of California, Santa Barbara
I agree with @spiral minnow here
If you aren't part of an academic institution, you can use your company (if they agree to it). Or if you have no institution, maybe we can ask @young sparrow if it's okay to use the EleutherAI affiliation
- I think the request was more about the general contributions statement than how to word an affiliation
- Everyone is welcome to use an EleutherAI affiliation if they wish to
It looks like we're using multiple phrases to refer to the attention used in this work (scalar attention and linear attention). I think it would be a good idea to concentrate on only one of those terms to not confuse readers. I'm not sure why it's referred to as scalar attention though, as far as I can tell it's actually a vector?
There's nothing I can think of at the moment. We're more focusing on tightening up the storyline for now.
There'll definitely be some followup papers though, so check in after this goes to EMNLP.
anyone knows the author of the last 2 sentences in 5.7 Context? overleaf username: kinetical
In the context of LLM applications, injecting the context into the model is equivalent to prompt engineering or p-Tuning(Liu et al., 2022). This feature enables one copy of RWKV to serve multiple domains or purposes with an implementation of state cache, minimizing computation overhead essentially these lines
@tropic minnow I reviewed the 5.7 Context referencing to the Appendix for details and clarifying the concept in that sentence
I don't understand the current Table 2. Actually, there are two tables with the same "tab:model_flop_count" label. Is it just a placeholder for the inference results?
Same.
I made a pass of the article. It seems that the introduction and the motivation overlaps quite a bit. Maybe we should merge then into a more concise section?
I just realized token shift is not exactly a residual connection, but more like the structure of casual convolution in WaveNet 🤯
A revised section 5.6 is available.
I think my part of work is done. I prefer to use an Eleuther AI affiliation. 😁
+1. It's unclear what the numbers in the first row represent.
besides, the caption of Figure 5 lacks information on what kind of test the figure is representing.
I like it it is more robust for the paper. Still, I think we can maintain some soft statement like "% Intuitively, by assigning each token the dual tasks of (1) aggregating all previous information and (2) predicting the next token, shifted channels can focus on the former task, enhancing information propagation." or so
I guess this is for the old version of token shift that replaces half of the channels by the previous channel. 🤔
Fixed.
I think even in the old version, token shift cannot "aggregate all previous information" in a single layer. It relies multiple layers to do so. Like WaveNet.
Fear enough. Agreed 👍🏼
I propose to cut off section 8. To make it effective I would insert comparison graphs for each experimented task but not bringing significant value at the end. I still like the concept of that section, however.
Also, excuse me, but I think that the description in the LAMBADA is not accurate enough. AFAIK, there is not "a set of candidate words" or something. LAMBADA is an open cloze where one needs to guess the last word of the target sentence by context, without given any choices.
One of the tables there is indeed a placeholder for inference results. On the phone now so cant check number. Will be updating it today
Is Section 8 unfinished?
If for section 8 you mean Fundamental Experiments, yes it is unfinished as it would take much space to insert graphs comparing to LSTM and GRU without creating significant benefit. I commented it.
Please all, in the Author Contributions use labels and not explicit numbers 🙃
ah ok, the new 8 🙂
@uneven blade would you mind adding a causal trace for the same example using some transformer model, to provide a comparison against the transformer about the information propagation?
And is it LAMBADA or LAMBDA? It's renamed to LAMBDA throughout everywhere now, even including the file name acc_lambda.png
lambada
LAMBDA without A occurs in Section 6, Figure 4 caption, Appendix H, and several labels and file names
I substitute them to LAMBADA .
Why are section 2 Motivation and section 1 Introduction separated?
yes it is. it needs a plot and a reference to an appendix where a table will capture the numbers
I'll DM @uneven blade the plotting script I use.
The Scaling Laws figure (currently Figure 6) seems lossy. May someone plot svg/pdf for the three plots?
I think it is better to use the same color scheme as the referenced paper.
maybe gather the plotting script and redo all the plots
I am also in favor of combining the Introduction with Motivation. I can take care of this if you think it makes sense. We will save a lot of space.
ask @last mauve
it's like a tiny convolution
You can also call it temporal residual connection, I've searched this term and some video AI papers do use this concept.
all RWKV models are trained with ctx1024 by default, and then some of them are finetuned to longer ctxlens
Note longer ctxlen usually slightly hurts (!) these benchmark tasks because they only care abt short ctxlens
Note long ctx models have seen more tokens (1+ epoch)
params LAMBADA AVERAGE LAMBADA PIQA StoryCloze16 Hellaswag WinoGrande arc_challenge arc_easy headQA openbookQA sciq triviaQA ReCoRD COPA
RWKV-4,ctx1k 3 5.24 57.52% 63.94% 73.72% 70.28% 59.63% 59.43% 31.83% 64.27% 28.74% 37.60% 85.70% 11.07% 80.56% 81.00%
RWKV-4,ctx4k 3 5.25 57.93% 63.96% 74.16% 70.71% 59.89% 59.59% 33.11% 65.19% 28.45% 37.00% 86.50% 11.68% 80.87% 82.00%
params LAMBADA AVERAGE LAMBADA PIQA StoryCloze16 Hellaswag WinoGrande arc_challenge arc_easy headQA openbookQA sciq triviaQA ReCoRD COPA
RWKV-4,ctx1k 14.2 3.81 63.54% 71.05% 77.42% 75.57% 70.24% 62.98% 38.31% 70.71% 32.28% 40.60% 90.10% 24.06% 85.73% 87.00%
RWKV-4,ctx4k 14.2 3.88 63.46% 70.10% 77.64% 75.52% 70.66% 64.17% 38.82% 70.29% 32.35% 40.40% 89.90% 24.42% 85.67% 85.00%
RWKV-4,ctx8k 14.2 3.86 63.71% 70.83% 77.48% 76.06% 70.65% 63.85% 38.99% 70.24% 32.64% 41.80% 90.40% 24.58% 85.67% 85.00%
However the 14B ctx8k model seems quite better when interacting with users
This can not be shown in any current benchmark tasks unfortunately
hi @obsidian quest , do you have some personally preferred cases/examples to be shown in the paper?
@uneven blade has plenty of cool examples
For some reason RWKV is somehow very good with math, especially marking-down things @obsidian quest
cool, just put here and i will make it on the paper appendix
i believe examples with long ctx would be more illuminating
what is the model size?
14b
would be interesting to test RWKV performance on long range arena, but perhaps out of scope for this paper
I'm not sure what's meant by combine here. If the motivation and intro have lots of overlap, move the intro material from motivation and the motivation material from intro. Then remove duplicates.
This is how to make Introduction and Motivation section not overlap:
Introduction starts with "The rapid advancements in..." basically positive things; Motivation starts with a twist "Despite the significant progress..."
However, I do think the total space of Introduction and Motivation needs to be constrained
Base on our title "RWKV: Reinventing RNNs for the Transformer Era", the Introduction part should immediately address aspects like the first coming of RNN and the Transformer Era we are now in.
I can make a copy of the current intro/motivation somewhere at the end and propose the shorter variant of both without duplicated information
You can also add a contribution part, basically anything that is not introduction and motivation goes into contribution
Who is atsushi.saito.dec17? I currently work on the introduction/motivation, but I see a lot of changes going on. I am not sure that it is a good idea to remove the names of most recognizable LLMs (GPT-3, GPT-4, ChatGPT, LLaMA) if we want this paper to be easily found on Google Scholar
I have the code, I can replot
@rich raptor
sure
Wow this paper is coming along really well, y'all're doing great work.
I can go through and do an editing pass, leaving comments and suggestions, later today
atsushi.saito.dec17 is my account. I did not deleted these cites of GPT-3, GPT-4, ChatGPT, LLaMA but Overleaf is sometime wired if we are editing at the same time.
I didn't mean citations, but model names, Google indexes more by the paper's content.
For the moment, I found that it would be difficult to separate Motivation from Introduction so directly. It prolongs the content because it is hard to avoid repeating the information from Introduction in Motivation. I added a paragraph before the contribution to include the most important part of Motivation. It can be extra separated, but let me know if you need it done.
Um, how many output instances will be presented?
A page or two is common, but it doesn’t really matter? Like, however much people want to do
Thanks!! I'm reading the first two paragraphs in section 1. and I am not editing the first two pages now...
for now, feel free to add cases to the appendix

Fixed as "to predict the most probable target token."
all, pls if you see some issue or conflict or have some suggestion, pls use the comment feature (select a text -> right click -> comment) to provide non-urgent feedback before changing if possible
Track changes is on, but basically the entire text is marked as changed. If we accept all the changes that’ll make tracking new changes easier as well
I'll fix the formatting in the cases later
duplicate
For Figure 4, could it be converted to .pdf format with larger font size? Now it is hard to read the texts in the picture. Same for Figure 5, 6 and 8. I don't mind fixing them if anyone can share the plotting script.
In Table 4, we should only compare RWKV 14B with "GPT-level" 14B (which is an interpolation of Pythia and NeoX numbers)
In Cases J, show the last 3 samples + a coding sample + a chat sample
Mention RWKV-4 tricks to solve exp(k) overflow
Figure 6 needs to be vectorized
should we add https://github.com/BlinkDL/RWKV-LM and https://github.com/BlinkDL/ChatRWKV somewhere
I believe the numerical trick is in the 150 line version of RWKV, right? from line 80 to line 90 that a variable qq is first subtracted from pp and ww.
OK let me try to do this in a draft
returning the favor for once explaining the difference of 100 line version and 150 line version kindly
How's the RWKV scaling law comparing with GPT
Ok all of these are complete except for improving high-level coherence and the inference results (@tropic minnow -- Daily check-in here. Can we get these by Friday or Saturday?).
Some new small work items:
1. Figures 4-6 are too late by a page. Can we bring these up closer to their content?
2. Most people haven't included their affiliations to their contributor appendix section (e.g. "Affiliation: EleutherAI"). If you don't have an organization, university, or company that you'd like to link to this work, you're welcome to put EleutherAI. PLEASE GET THIS DONE WITHIN THE NEXT TWO DAYS. Also, if you contributed to this work and haven't put your contribution section, do so within the next two days. If you forget, we won't be able to add your name to the arxiv release until after the EMNLP double-blind deadline.
3. The first-page author block needs affiliations added. If someone could take care of that it'd help
4. The contributions need numbered, and the section describing each contribution should be added to the list item.
5. New paragraphs are all indented. Someone needs to go through and add \noindent or something to remove these.
6. Minor nit, but I think Figure 8 is low resolution?
All of these items also need handled as well
Bo PENG: built RWKV and scaled it from 0.1B to 14B.
Affiliation: can I write my github link 😉
yes we can ( 😆 ). have them in CSVs. need to organize and clean
No, affiliation needs to be some entity. If you have your own entity, you're free to point to that, but if it's just a link to your personal github it would read as "Affiliation: Myself".
If you want to go that route, we can just leave affiliation off your name entirely. You're also free to make up a new entity for your RWKV work.
And you came up with the RWKV idea right? I'm thinking:
Bo PENG: Invented, built the model and training code, and trained RWKV model scaling suite.
Excellent
I'm not sure who made the Fig 4. But Fig 5 is based on @obsidian quest 's excel sheet
Affiliation: RWKV Foundation (non-existent as of now, will be a nonprofit in the spirit of Linux Foundation)
created RWKV, built the model and training code, optimized its performance, and trained RWKV models from 0.1B to 14B.
Yep this works. We'll add it.
Linux has penguin 🐧 and your one has 🐱
Does anyone know the person who made Fig4??
Probably BlinkDL paste the Fig5's spread sheet in this channel. Where??
We need to give more credit to Attention Free Transformer because it is an inspiration of RWKV.
#1103039376184852622 message
@obsidian quest I rewrote your recursion into formula as in equation 27 to 33, can you confirm I didn't miss anything?
q = max(k-w, k)
7. Can someone add this to Related Work?
Working on it. ATF is already in Related Works but it is worthwhile to talk more about it then.
Sure. Just a sentence or two max.
AFT: introduces the sigmoid gate (called receptance in RWKV) in linear attention
and the sum(exp(K) V) / sum(exp(K)) formulation
@rich raptor
we have it in Attention Free Models paragraph. i guess we can make more explicit mentions when we talk about components of RWKV throughout the text
@obsidian quest I re-read your shared png and realized that I got it wrong the first time, here is a correction:
the starting point for the recursion is:
yeah now it's correct
compared to your shared new RNN formula, the only difference is that in your PNG the sign for w is positive, and in here it's negative (I followed the notations earlier in the paper)
Cool
It's in the paper now, in Appendix B right after introducing the RNN cell.
In Cases J, show 2 typical responses for each question using https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio model
Can mention https://github.com/ridgerchu/SpikeGPT which shows RWKV is good for Spiking Neural Networks too
@obsidian quest -- Can you do a pass and make sure there are no technical errors in any figures/equations?
I just put the equations I just added into the huggingface link for a code implementation, damn
yeah this will be a cool example
I have a quick newbie question out of curiosity: can RWKV be seen as an instance of a GNN
hmm no. RWKV don't featurize "vertices" nor "edges" and it doesn't have very strong locality inductive bias as typical GNNs
Does anyone want to fix this? If not, I can proceed. "RWKV is a large language model (LLM) architecture that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference."
RWKV enables the development of LLM, it is not a LLM per se
In addition, this "GPT" mode is something we understand because it is in our own project vocabulary... do you think is clear said that way?
see my comment next to that
the idea of GPT mode was sorta confusing the first time when I looked into this. We should define it, the gist of it ircc is that we have all the tokens available to us so we can train in parallel, theres apart of the training that requires the scan operation I don't remember off the top of my head though
gpt mode also allows for building the state from a set of tokens in one forward pass
This response is incomplete and contains little information.
Should it be removed?
Also, PIQA, which stands for "Physical Interaction: Question Answering", should be totally capitalized, "PIQA" not "PiQA".
Fixed as PIQA
I doubt whether we need concrete examples to demonstrate this part of the limitation. What do carefully designed prompts look like? How do responses vary by different prompts?
this is just a template holder for now, we would remove it
I think you are right. You could leave a comment on the overleaf.
But is this a verified conclusion? The linear attention makes RWKV more sensitive to prompt?
I didn't see any concrete demonstration that linear attention to RWKV makes it more sensitive to prompts.
But according to some experiments conducted in the RWKV chat group, RWKV is likely to be sensitive to prompts. We just need several concrete examples.
For example:
Prompt 1: Please summarize the following paragraph: <paragraph>
Prompt 2: <paragraph>
Summarize the paragraph above.
Comment added.
I would personally describe it as (mostly) parallel training along temporal dimension. Transformer supports such parallelism, while the RNNs with non-linearity in recursion certainly not.
IMO such parallelism significantly improve the scalability of training, thus the model parameters
I’ll add a comment shortly
Scaling Laws figure (currently Figure 6) updated !
I'm not completely grasping your comment. RWKV recursion seems to contain \sigma(R) term. Do you say that RWKV uses only arithmetic (add, dot-prod, sum, etc) along time?
Could you share a script or notebook to update scaling law plotting?
yes, there is matmul + activation along time (unlike GRU/LSTM)
thus the time-sequential part is negligible during temporally parallel training (and for WKV it can be even further parallelized, though unnecessary at this point)
@fickle hare @burnt gulch @tropic minnow please check my draft attempt. It is not finished but seeking for approval on the direction (it is in the main as well):
Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM). The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "GPT"-mode as the model's temporal context surpasses the inherently sequential nature of recurrent networks that in theory precludes parallelization.
RWKV equipped with simple a softmax linear projection layer on top allows to build large language models (LLMs) that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference.
I think that part of what I am proposing here can be moved effectively in 4.2 Transformer-like Parallelization so to introduce there this "GPT"-mode
Maybe it's worth mentioning that the "sequential" scan is elementwise (thus embarrassingly parallel) in batch samples and channels, thus already exposes sufficient parallelism (though not in the time dimension)
Could we GPT mode describe as follows? "At first construct the computational graph along the entire time-range (i.e. whole sentences and/or documents ) for i-the layer and then iterate the same construction for the (i+1)-th layer from i=0(the bottom) to d(before LM head)"
If my description is correct, GPT mode could have an alias like "time first graph construction mode".
that may miss the point that most computaion along the time-range iteration is in parallel
Ah, one might regard the word "time first" as "not parallel along time-axis"
yeah that's my worrying
@here The paper looks great now
- should we add https://github.com/BlinkDL/RWKV-LM and https://github.com/BlinkDL/ChatRWKV somewhere
- Mention https://github.com/ridgerchu/SpikeGPT which shows RWKV is good for Spiking Neural Networks too
- Mention that the interpretable and fixed-size RWKV state is beneficial for AGI safety, and we are working on a series of RWKV interpretability & steerability papers
- Cases J: add a multi-round chat sample. @uneven blade
- Show more time-xxx curves. example: https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/
- Add loss curves to show the training of RWKV is spike-free (https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-loss.png)
- Mention RWKV inference only requires gemv (no need for gemm)
I would not touch the computational graph @tough crane . Indeed the most efficient implementation just execute operations without cg
Only for training, I think that CG is still built inside pytorch because of back-prop.
we have implementations in Rust, C++, Go and also the model_run.py executes operations with numpy without pythorch
I'm personally against computation graph on any algorithmic topic since they are just for autograd and performance optimization, unrelated to the model
besides, the construction order of computation graph is unrelated to the execution order, and we are talking about execution order in this context
I see that it's off-topic at here.
it's like, we want to say our execution order of both forward and backward is defined by the loop order (layer, t), while loop t is mostly parallel
layer by layer, then time-parallel
Why we call it "GPT", I was a bit confusing when I read this naming at first.
GPT-like mode
because decoder only transformer naturally behaves like this
Just cite the paper Resurrecting Recurrent Neural Networks for Long Sequences whose main point is precisely that non-linearity activation in the RNN recurrence equation can be removed to enable parallel training
and gpt is the representative brand among them
Ummm, branding...
that's another problem though...
uh, I mean, when you want to say "this new mode is like the decoder-only transformers", the first short name come into your mind will be gpt...
the GPT mode will predict seq[i] given seq[:i] for all i in one forward pass iirc
update cases template
oh I see it mentions a bit about parallelism...
maybe a bit smaller inner margin for code blocks?
i think the final format should be in line with the whole paper, so the research lead should give a final decision by directly changing the first template in Appendix J, and i will help change the rest.
sure
@neon night we need to coordinate a bit to guarantee consistency. we are using rnn mode, gpt mode, parallelization..
@fickle hare He majors in parallel computing. You can coordinate with him about these things.
Fantastic, thanks
The current terminology on these things throughout RWKV community does mess a lot...
GPT mode: During training and prompt preprocessing in inference, we do time-parallel execution for all matmul (stack along the time axis, thus embeddings (B * T, C) @ weight (C, C)), only leaving time-sequential WKV (yet fused in a custom CUDA kernel), making it more bandwidth-effective
RNN mode: During decoding in inference, we do the timesteps one by one, like in Transformer decoding with KV cache, thus not using the custom CUDA kernel for WKV as well
Is that clear enough? I don't know if we are to keep the names in the paper though, maybe it's up to @obsidian quest's decision
Once we decided that we will need to go through the paper to make it consistent
this saved my ocd
deep-learning community creates a lot of "fancy" jargons (e.g. hallucination 🤣 is pointed out. It should be called "confabulations" in phycology according to G. Hinton)
we can use parallel & sequential mode to avoid mentioning GPT
Then "time-parallel mode" for training and prompt processing, "time-sequential mode" for decoding?
Yes, I agree.
LGTM!
Please check if you'd like to keep part of my content here (I'm okay with you throwing it away):
Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM)\footnote{https://github.com/BlinkDL/RWKV-LM}. The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "time-parallel"-mode as the model's temporal context surpasses the inherently sequential nature of a recurrent network that in theory precludes parallelization.
I'd say it depends on the space budget
It makes things much clearer, but maybe not really necessary as @neon night has pointed out that the LRU paper already mentioned that
in my opinion (but I am biased) we can move part on that on the appropriate time-parallel mode section (replacing the Transformer-like..)
mention that, but we do that differently. Citing LRU is a must of course
seriously, I'm already glad that the point I raised about terminology was also taken up by @obsidian quest. As well that RWKV can be used for but it is not a LM per se. For writing, I know I am long-winded and don't want to force with little space 🙂
I don't really see a difference between the parallelization of LRU and RWKV yet; although RWKV started much earlier than LRU publication
The difference I see is in the underlying motivation that allow that, which is IMO the model's temporal context
(w)
btw no questioning that the practical point is that non-linearity activation in the RNN recurrence equation can be removed to enable parallel training.
This seems to lack context? I'll try update a version to see if it gets better
@obsidian quest Fixed table4,5 in appendix.
We are all biased 😆. Really if we can grab a random person and see if he/she can understand the paper, is much better.
I think the paper is basically finished but I'm also biased
theres also the other parralel mode, if you unchain the time-mixes and wkv function, and have them point to seperate states, you can use the acceleration provided by (BLAS?) to run thousands of rnn threads at the same time, allowing for hyperscaled inference in production enviroments
that might be another paper though
4.2 and 4.3 updated a lot. Please check if it reads good, thx
My Grammarly is not working on Overleaf right now and my English is not so good, so alter the text on your will
The Attention Transformer (AFT) (Zhai et al., 2021) replaces dot-product self-attention with a computationally efficient alternative based on factorized attention coefficients that maintains global interactions between inputs and the contex
Should we rather says here that ATF is in fact a multi head attention where 1 feature dimension = 1 head ?
AFT is channelwise linear "attention". However it's using the same W for all channels
No I didn't?
yeah but saying on top of that it looks like a MHA with 1 head = 1 feature dimension can be beneficial to visualize why this works RWKV uses the same principal as well
yeah you can mention that
As it's already the 4th section, we are expected to talk more about the details I think?
Sorry, I was giving it a unfair prompt. Now it says it depends
Yes, I was modifying it to match the subsection name better. But basically, they are talking about the same "advantage"...
I think this limitation hurts. It's too strong
(Is this really the case?)
I think the section title is better to be "Transformer-like Parallelization" and "RNN-like Inference"
Maybe "Transformer-like Parallelization in Time" and "RNN-like Sequential Decoding"?
Figure 6 needs to be png or it is loading very slowly 😩
Maybe downsample the points?
Actually IMO LRU doesn't need to be cited. mainly because it is recent paper, we don't have time to compare with them in eval section. If we mentioned LRU, no reason not to compare with it also.
I changed my mind 🙂 normally I would have already added a sentence or two about LRU into overleaf but in this case I prefer doing nothing
@rich raptor ICould you share Fig 4 csv file and script to plot, if you have them? To re plot with a bit large font according to @mortal latch ‘s comment
no problem but next time I will condescend less 😇
DMed
I'm on your side this time. 😇 I suggest you move these words into section 4.2 and change the title from Transformer-like Parallelization to maybe Efficient Parallelization. They are better in section 4.2 than in 4.4. Better to not cite LRU.
And I think in case the paper is more than the page limit, the section 6 "Scaling Laws" needs to be moved into appendix.
Moved to 4.2.. About title for those sections, I see pro and cons in any of our proposals 🙂
sure, use pdf just for better resolution. also can change to png with dpi=300, I think it is good enough
In this case I suggest keep the original Transformer-like title. Although it is not exactly like Transformer, but certainly not unlike Transformer 😁
Agree
Figure 6 has been updated to png
This for the section titles, than I think we agreed to call the two modes as "time parallel"-mode and "time-sequential"-mode.
This makes a lot of sense to me and we are all happy: consistent and robust names and we also make the connection with transformers and GPT "style"
I must say that "is implemented as a simple offset in the temporal dimension at each block implemented in PyTorch \citep{paszke2019pytorch} library as \texttt{nn.ZeroPad2d((0,0,1,-1))}." Respectfully, with this PyTorch code reference, it seems to me a bit randomly thrown in there
@neon night @fickle hare did we remove intentionally the "Context" section?
(I'm not online when it got removed so..
(I don't know what happened to that section
Again intrinsic bias, but I liked it -reason for removal? if too week it is a good fit for the RNN-style
I do think just say "It's a simple offset & add" would be better
I have to help shorten 4.2 and 4.3 because it's getting longer than I expected again 😅
ping @neon night
I don't remove it but people want to, because I don't have enough time to revise every section. I work way slower
alright soft pushing for reintegrating probably before talking about the RNN-style -but up to you
Ok everyone, we're reaching the finish line for the v1 arxiv. A few new temporary rules:
1. No major changes without explicit approval by me or @tropic minnow.
2. If you remove anything, it needs to be commented so that it remains in the latex. No more deleting from the latex outright.
3. No new authors will be accepted for the arxiv version
If any authors are looking for things to do, many of these have not been addressed
And section 7 Inference Experiments - speed & vram
Yeah @tropic minnow -- what's the status of inference? I'm targeting a monday morning arxiv submission so that it goes live before the EMNLP anonymity deadline
The context section is removed by people because it's already mentioned in gist by the last two paragraphs in sec 4.6.
Although I think it is not ripe to mention cross attention there. We don't have equivalent things for cross attention.
Heh, I woke up this morning and went “huh, I guess I don’t have any obligations today I could sit down and seriously contribute to the RWKV paper!”
I’ll still do an editing pass and leave my suggestions, and I want to stress that I’m not asking for special treatment. Congrats everyone on the hard work
fair enough! thanks
Although I think that part about cross attention is not justified also. 😩 @obsidian quest Does RWKV have capacity to do things similar to what cross attention can do?
my point is just that, working with the state, the state itself containing the information e.g. of the prompt eliminates the need for cross-attention
look at the BART NLI task for zero-shot classification; this is a case where RWKV skip cross attention intrinsically
I think it might be a good idea to make a list of such claims / intuitions, remove them from the arXiv version, and add it with real experimental evidence to the EMNLP version
Yes. I think cross attention is very powerful, can do multimodal things like text2image, text2audio. The phrase "eliminates the need for cross-attention" is too strong
A lot of papers like this overclaim, and the rigor of our analysis and the scale of the models trained is one of the biggest factors in our favor
We can easily train a small CLIP model with RWKV to see what happens though
(just not by monday)
You're absolutely right
I'll make the claim softer by now, until further investigation
makes sense. this is what i have so far. i think it makes the point that RWKV is more efficient for inference. will try to bring the number of tokens generated from 100 to 256 (probs more realistic of chat), and complement with RWKV @7B and 14B. Will do similar plots for memory. My idea is to add a plot like this to main text and tables showing details in appendix. sounds like a plan?
#992372861924823080 message
my idea: I think RWKV can support Encoder-Decoder via this: for each decoder token, use a learned mixture of [previous decoder hidden state] & [encoder final hidden state].
(sorry the plot needs to be improved, there are some cpu-cuda misplacements, its in the making)
My flaming hot take is that arXiv preprints can be 15 pages if that’s what’s necessary and you can just push things to appendices for the submission. I have no issue reading 15 pages papers if they’re good.
the RWKV 169M data point seems wrong
yupp. repeating soon. will post an updated version
(modulus rwkv-169, this is roughly the state @100 toks generation. will repeat with 256 for all)
is >= 1k possible? that might expose a huge difference
yea idk we can try
how to fix the huge space gap 
On the lately updated 4.2, there are still some issues:
a) 4.1 is still mentioning GPT mode, need to get fixed
b) 3.1 overlaps with the new 4.2, need to dedup at either side
c) 4.2 is mostly explaining the fig 1c, so add a ref would be better
since we would have more examples, i think there is no need for a perfect arrangement for now
alright
Also, the current fig 1c does not really demonstrate how channel-mix executes (just a long green box)...
I've tried use the vspace to control it, but does not work
honestly, i don't understand this figure
there is not even the explanation for green color
me neither (if without my pre-knowledge
tried helping, see the float=h added
yea i see. i dont see the point of talking so much about rnns (figure and even putting their equations from papers 20yrs ago) when even the formulation of RWKV as an rnn is in the appendinx, and our own rnn-like equations are in the appendix. i would look at shortening that section and push some content into appendices. curious to see what others think. we could also do it for EMNLP and have it like this on arxiv
can't agree more
imo, a figure like this in AFT paper would help better illustration
yupp fixed. thx for reporting. could you add comments to latex for other things you might see?
ty
and i think the key point we should emphasis would be the wkv formulation and its relation with attention and recurrence. things like token shift, custom cuda kernel, specific implementation like nn.ZeroPad2d((0,0,1,-1)) are like tricks to improve the performance and efficiency. all my personal opinions. curious to see what would you think of this
pushing the zeroPad and cuda Kernel to 4.7 Additional Optimizations seems reasonable. will do soon. In parallel, what do you think about shortening a bit the QRNN section in 3. background, perhaps keeping it more high level (removing equations or pushing them to an appendix) . i think we could also expand a bit on 2. Background -> Attention Free Models for the Attention-Free transformer given its parallelism with RWKV time-mixing block
+1 on shortening QRNN
agree
personal view, i would expect a picture like this to better show RWKV (apologize for the poor quality of this drawing.)
(the red line is a equals sign
Where are you leaving these suggestions?
I haven't done so yet but was going to use the overleaf
@paper dove @rich raptor the main argument against using RNNs to my knowledge is this plot from Scaling Laws for Neural Language Models (plus convergence issues?). I think we should have the data to replicate it with Pythia + RWKV? Would that be a light lift to add to the Scaling Laws section?
That's easy. You shorten the background and move Appendix B to 4.3.
This could be totally wrong and I’ve never trained an RNN in my life
It was rejected to include AFT into the background section when I suggested. At first, templates has section skelton with title RNN(3.1) and Transformers(3.2).
The figure is only for comparison against old RNNs and new RNNs. Similar comparison figure appears in QRNNs paper in ICLR 2017 .
@tender karma Your points about state and cross attention can be added as future work. 😌 and AGI safety
I appreciate -do you want me to touch or at this point is just more efficient if you do?
Should we replace section 3.1 with AFT instead of RNNs ?
Now I'm going to sleep and I have to deal with 3.1 tomorrow.
but this figure doesn't even explain the green color ??
Alright let me take a look
Following the pinned note: shall I just write as a comment and then you see, or directly as text?
I rejected but now it seems background should be about AFT
plan to do it tomorrow
have a good night or day !!
I will add an explanation about green part. If we replace 3.1 with AFT, then we will just remove it and replace AFT related fig.
@obsidian quest said AGI safety can be added so text
Perfect, on it
I think the figure makes it point in the QRNN paper, but personally i don't think this similar one makes much sense in this paper by simply using different color to differentiate QRNN and RWKV
Don't add anywhere except 4.6, where I made a draft for you. Don't make new sections @tender karma
@obsidian quest Do you want to include AFT's formulation and figure into the background section?
Possible choices are: (1) replacing 3.1(RNN) with AFT, or (2) adding AFT section into background section 3, or (3) not including (current status).
Done👍
@tropic minnow do you have insight into this?
He wants. AFT inspired this work. Adding is best, we can delete RNN anytime.
Yes sir 😎
I think rwkv would behave like transformers in this plot. However i dont have the exact data to reproduce this plot. I think behavior could be inferred from nlp benchmarks and current scaling laws. If someone gets the raw data for this im happy to do the plot.
(I dont have test loss by sequence position in test data for rwkv)
Ah yes. If there isn’t time to compute this before arXiv it’s not a big deal
I've made it about half way through the paper, but my editting has been derailed by needing to go find many citations that should be in the paper but aren't. This paper doesn't currently cite:
- Pythia
- the Pile
- the Eval Harnss
- OPT
- BLOOM
to name a few. You cannot use or compare to other people's work in your paper without citing it. The entire paper needs to be reread with an explicit goal of identifying missing citations.
will be cited
I left a bunch of comments, I hope they’re helpful.
shown in Figure 5 - should use log(ctxlen) too
So the elephant in the room is the scaling laws section. This section is wrong as-is because it follows Kaplan et al’s flawed methodology rather than Hoffman et al’s improved one, and my original plan was to frame this as an initial exploration with more to come. However the more I think about it the less I think these are really the right plots to show anyways.
- The exact parameters of the scaling laws are so context-specific that nobody cares what your numbers are in general.
- We know that the optimal trade off for tokens to parameters is likely to change (and specifically shift more in favor of tokens) compared to how it currently is but not by how much
- “Scaling laws for RNNs” is not a novel or interesting thing, and is in the original scaling laws papers.
Based on these three points, I think that the best thing to do for this paper is probably do the same analysis again (how long did it take?) using Pythia models and plot them on the same axes hopefully this will show no gap, and therefore provide additional evidence of good scaling. If that can’t be done, we can still replicate this plot from the Cerebras GPT paper because we have the Pythia test set loss value
To be clear by “replicate this plot” what I mean is take this plot and add Pythia to it
But I do think that the explicit scaling laws calculations should be:
a) pushed to the appendix
b) clearly labeled as a Kaplan et al-style experiment that we plan on following up on in the next version of the paper
should x-axis in the comparison to pythia feature params instead of flops right?
also, have these models (rwkv, pythia) followed the same token count training? bc otherwise the comparison wouldnt be apples to apples right?
Within 10%… which probably doesn’t matter too much.
You’re right, and ideally it would be exactly 300B tokens but it’s not. If the current evaluation tables use a checkpoint at ~300B tokens (IDK if they do, but this was talked about by BlinkDL at one point) it’s slightly biased against RWKV.
I think FLOPs on the x-axis and params as point labels make more sense, but we can look at both plots and decide.
Acknowledging my last message, if the evals of RWKV are the fully trained model then FLOPs is more justified. If they show mostly trained models that are token matched then params is probably more justified but it’s close
@obsidian quest @neon night I enjoyed talking about AGI, but here I really let myself go (although reasoned). It is in 4.6. If you use it, well, if you throw it away, I am just fine 🙂
We speculate that exploration of RWKV state-centric designs can enhance AGI safety. The state (or \textit{context}), summarizing past inputs, it might offer not only predictability but also an enhancement in interpretability. Its manipulation can guide behavior and enforce safety. Recurrence supported by temporal "awareness" could lead to stable systems and state-initiated generation may boost computational efficiency\footnote{In language models, initiating generation from the final post-prompt state could obviate prompt reprocessing, thereby bolstering both efficiency and data security.}. Despite challenges in managing high-dimensional states, these promising leads merit further investigation.
We can show dim(state) = 4 x d_emb x n_layer (namely x, a, b, x for each layer)
I agree. We have discussed before, and comparing models under different parameters would be more convincing. But at that time, it was uncertain whether there were data available from the same test set. RWKV seems to have been evaluated on the pile test set, which makes it possible to directly compare it with Pythia.
Do you have the Pythia data(compute vs loss and parameters vs loss)?
I am a little bit confusing, is this pythia 14B data or RWKV 14B data?
RWKV
Pile test set loss for Pythia models:
70M -> 2.504
160M -> 2.186
410M -> 1.971
1B -> 1.845
1.4B -> 1.793
2.8B -> 1.720
6.9B -> 1.626
12B -> 1.582
For non-embedding param counts vs model label
You’ll need to math FLOPs yourself but it’s easy and there’s a calculator pinned in #scaling-laws if you don’t know how
nice, I will add pythia data and update figure 6 later
I think appendix B can be back to main text if scaling laws calculation is pushed to appendix
This is for arXiv, there is no page limit in the main. That said, I think B and F are the oblivious candidates for promotions. B is useful info for many readers and F seems to be a key property of the architecture
Does the Author Contribution section appear on the final paper?
just out of curiosity, why Johan S. Wind is not the co first author? I learn a lot from his blog, and he wrote the cuda kernel for RWKV
Not in the version submitted for peer review, but yes it’ll be in the paper once it’s accepted
Waiting for @neon night to first check the current content and his opinion of if/where is the best section to talk about investigating more the different meaning of those state vectors, and also if it makes sense for me doing that
@mortal latch Increased font size in fig:4.
A new 3.2 highlighting AFT 😇 3.2 needs a new title
https://openreview.net/pdf?id=HyUNwulC- found a paper about parallel scan, I think WKV CUDA doesn't use this paper's technique yet? @fickle hare
parallel scan is simple but requires an additional sweep over VRAM
it's not always helpful
I thought about modifying the impl but end up finding that we already have sufficient channels to parallelize
if you want to mention that, maybe cite it and say "with longer sequences there is potential ..."
You know this paper right? LRU also uses this technique
I don't know this paper but parallel scan is so simple a technique...
I mean parallel scan over the time dimension
yeah I don't know this paper in specific before your message but I always knew the linear recurrent/wkv can be parallel over time dimension
https://arxiv.org/pdf/1709.02755.pdf in contrast, this is the paper who do parallelization over channel dimension in linear RNN. I cited this paper. I think we're using precisely this paper's method
(always curious why simply applying some well-known implementation technique will produce a paper in the AI research field, since the day of ring-allreduce introduced to distributed training)

@tender karma based on that we don't really have a page limit on arXiv, I don't have anything particular against this now. Maybe others can revise it from the perspective of interpretability of state.
because this paper knows it and has the word "simple" in its title?
accepted. adding reference is always good
Seems not really the same. The SRU cell has nonlinear in its recurrent path, thus it cannot be further parallelized through time
While WKV is possible to get parallelized through time but we simply didn't do that, due to the already sufficient parallelism
The current status is the same though.
I have a question, why is the mode called "time-parallel" while it's not really parallel over time 
Anyway I wrote this
bc you train with all timesteps at the same time
anyway i didnt write it
yessir
a100 80gb
do we have the final training loss for pythia models (20b tokenizer)
mostly
there is really little flops in wkv anyway
The lowest test loss on RWKV is 1.75, while on Pythia, the lowest loss is 1.582. Cerebras paper states, "Pile test loss is crossentropy in nats/token. We correct all crossentropy results for different vocabularies to be comparable to the GPT-2 vocabulary." Is it because of the difference in vocabulary size? If so, direct comparison may not be appropriate. May I ask if you have the uncorrected loss from Pythia?
someone pls test pile loss of the two models
Have we got any agreements on naming consistency? For example, using either time mix, Time Mix, Time Mixing, etc. throughout the paper.
im happy to run on an a100 - 80gbs if someone has the script
whos author of figure 5? could we see it in log scale for y?
this
I am not completely sure about the following comment:
"Edward Raff: This needs a call-forward that RWKV will have parallels/relation to QRNN's design, otherwise section 3.1 reads very weirdly."
No need to compare QRNNs and RWKV in the context of parallelizing RNNs ?
No sure but the original data is here
a bit meh but it shows some improvement even in the final part
i think it is already compared in the qrnn paragraph
I can do this this evening, but if anyone has a couple GPUs there’s a fun life hack here: if you launch training in GPT-NeoX with a checkpoint that is finished it will spit out the train and test set evals and stop.
Updated in the paper. The y-axis of the original plot was in log scale but it didn't make a large difference.
updated to better represent token-shift
Please update the "Tell me about ravens." result because I have never seen such bad responses on https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
a better example:
Ravens are large, black birds with a distinctive white head and neck. They are found in most parts of the world, including North America, Europe, Asia, and Australia. Ravens are known for their intelligence and problem-solving abilities. They are also considered to be symbols of death and transformation in many cultures. Ravens are often associated with the afterlife or death because they have been known to eat carrion or even other birds. In some cultures, ravens are also believed to be messengers of the gods or guardians of treasure.
kk done. examples look quite cool now
done
Possible title is : "Transformers and its Attention Free Variant"
3.2 Transformers and an Attention Free Variant any reason for equation 8 duplication?
The last 2 paragraphs in Section 4.6 (Harnessing Temporal Structure for Sequential Data Processing) seem like they belong much more in a future work section, or possibly in the conclusion, right?
Peng Zhou, Qihang Zhao, Rui-Jie Zhu, Jiaming Kong, Johan S. Wind, Samuel Arcadinho @bronze frost @snow zealot pls add affiliation to authors section
I agree but I've been instructed by @neon night to put the content there and not creating new sections. I would rather move the two last paragraphs in a new Future Work section
I think because this is just the arxiv deadline it's okay. And looks like Eric ( @tropic minnow ?) is already putting it into a dedicated section.
perfect and thanks
@mortal latch objections to moving figure 5 to appendix? its currently in scaling laws but i think it illustrates more the long-context side of rwkv rather than scaling?
pls tag others if u know their usernames. lets aim at having this done tmrw. want to send to arxiv on monday
Sure! No objections.
FYI you're probably aware but it needs to be submitted to Arxiv by 2pm EST on monday in order for Arxiv to post it on Monday night, then we can promote the work on Tuesday and still be ahead of the anonymity deadline
Should we add a footnote stating that the order of authors other than the cofirst authors is alphabetical by last name?
and can anyone help to add author affiliation?
I made a pass and realized that the original figure 5 is an answer to the RQ3 in Section 5 Evaluations. Maybe keeping it in main text?
The appendix about gradient is flawed 😅 let me fix it
Suggestion: use log2(context_length).
Also, should the x-axis label be 'Context length' instead of 'Token position'?
It was context length before. I can fix it. However, using log(context length) can make the difference between lines a bit hard to tell..
Also, the first sentence in the abstract:
Transformers have "revolutionalized" almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length
should be revolutionized, not revolutionalized
I don't think so. The difference is on y-axis, which is not related to the scaling of x-axis.
It's really clear.
"Context length" not "content length"
Also, in the y-axis, "x 10^0" is not necessary
still missing some authors affiliation
I have added the affiliation info in the main text. However, for authors without affiliations, they are affiliated with EleutherAI for now. This information will be updated later.
Thanks! Should I add my contribution?
\paragraph{Ruichong Zhang - Tsinghua University} Proofreading and typo corrections; Advices on \ref{fig:ctxlen_rwkv_loss}.
Hi all, I add my affiliation institute (20,21), but I found that the space between the author list and abstract is extremely tight and use the vspace command can not solve that, can anyone help to fix it?
Yes we should
I will, once all authors affiliations are there
ok i fix it
i fix it
Thanks a lot!
It will look best if you add this to the “equal contribution” footnote. Something like \footnote{Equal first authorship. All other authors are listed alphabetically}
Also, in cases where people have the same last name, the standard in English is to alphabetize by first name. So the end of the list should go Jian Zhu, Peng Zhu, Rui-Jie Zhu.
@obsidian quest I know people said you can put whatever affiliation you want, but listing an “organization” that doesn’t exist will cause confusion because people will try to look it up.
ok can we use RWKV.com which exists
done
It seems like that redirects to GitHub?
Is there a reason you don’t want to put either “independent researcher” or “EleutherAI”? I was expecting you to put one of those
there will be a landing page very soon
to promote the nonprofit independent RWKV foundation when it is formed
maybe try this? \setlength\titlebox{5.5cm} rather then \small ?
Well, anyway, if you think it's ugly, do your best to improve it
When do you expect to create it?
how to join this foundation
By the way, may I ask what is your timeline for scaling RWKV to 100B?
and the 1.7T data version
IMO, training for 100B params could be after 20B(GPT-NeoX), 30B(OPT, LLaMA), 60B(OPT, LLaMA), 70B(Chinchilla)
But BlinkDL might have a more agressive plan.
We need correct scaling laws studies before making decisions about substantially larger models
it's like a RWKV version of Pythia.
I would love to see RWKV Pythia
@obsidian quest anyone you'd like to acknowledge for the compute? Should we add an acknowledgement to the community in the RWKV server?
Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are not stunning, but still all of them I have included at the end of the appendix with a comment. This is only a draft so if you would agree to attach this section to the final version of the article I will edit it.
author of Figure 9: Effect of small initialization embedding? can we try having it as EPS or PDF format? so quality is preserved under resize
what's your plan for scaling laws figures
sure, I am the author of Figure 9
yeah acknowledge EAI & Stability for compute & support
better compare with open source models too
@plucky crypt ok you can include [RWKV-4 w/ GPT prompt] & [RWKV-4 w/ optimized prompt] in Table 6
And note that P-tuning can be very effective for RWKV because we can directly tune the full state, and we will do this in follow-up papers
I will finish 0.1~14B RWKV-4 "World" (100 langs) and RWKV-5 first
I find even 0.1B RWKV-4 "World" can chat in 100 langs
wow, amazing finding. because pile dataset contain 100 langs ?
"World" is using some https://huggingface.co/datasets/oscar-corpus/OSCAR-2301
updated
already training PilePlus https://huggingface.co/BlinkDL/rwkv-4-pileplus
a ridiculous appendix is on the way.. a trailer
Scientific work published at EMNLP 2023 must comply with the \href{https://www.aclweb.org/portal/content/acl-code-ethics}{ACL Ethics Policy}. We encourage all authors to include an explicit ethics statement on the broader impact of the work, or other ethical considerations after the conclusion but before the references. The ethics statement will not count toward the page limit (8 pages for long, 4 pages for short papers).we can think about this for the EMLP. monday soft deadline is about arxiv
done👍
Excuse me, which specific CPU is used in the experiment of Appendix J?
@snow zealot ?
no ARM for now, just x86. will try arm and AMD gpu experiments for emnlp if possible
better RNN cell graph 🙂 pls update
There are plenty of x86 CPUs. Intel? AMD?
How many cores did it use?
@snow zealot
done
alright i'll make a pass in a few hours to standardize the remaining rough edges. pls make all planned remaining contributions asap.
What's the deadline again?
Lambda cloud instance with 30 CPU 200 GiB and a A100 with 40gb
Lambda cloud instance with 30 CPU 200 GiB and a A100 with 40gb
we're aiming tmrw for arxiv. EMNLP deadline is mid june
It seems like Appendix F is really important, in that it’s part of what allows us to train RWKV models at large scale. If that’s the case, it should be in the main body
but it is of little novelty compared to attention free transformer: https://arxiv.org/abs/2105.14103
We introduce Attention Free Transformer (AFT), an efficient variant of
Transformers that eliminates the need for dot product self attention. In an AFT
layer, the key and value are first combined with a set of learned position
biases, the result of which is multiplied with the query in an element-wise
fashion. This new operation has a memory comp...
Are you saying “gradient stability isn’t novel because other models also have stable gradients”?
cool pls fix position of [r_t] and color of [sigmoid]. move [sigmoid] slightly rightward
Move [3] and (X) slightly upward
At its core, why does this work and ATF doesn’t?
- different w for different channel 2. token-shift
What does the section on Gradient stability show improvement over? What similar models lack stable gradients?
the w in AFT can be ill-posed
while in RWKV it has to be a simple exponential decay
I think AFT is also stable (if w is chosen properly), we are comparing gradient stability against RNNs
a new Appendix F shows that AFT's KV operation is stable
both are much better than usual RNNs
Yes we're not so novel against AFT but the AFT paper doesn't prove stability like us
it's natural to arrive at AFT when we linearize QKV attention - the main contribution of AFT is they find sigmoid[Q] & exp[K] is a great combination
Can you explain this more? Why does it happen? What precisely does it mean?
I think it wont happen in reality when you train an AFT
AFT is stable. It just has less capacity, so the LM performance is not very good
Okay, so Eric’s comments about novelty compared to ATF are irrelevant
We can replace 4.5 by Appendix F. Appendix F is more rigorous than 4.5, just a bit scary
wdym irrelevant? in appendix F we show AFT-like operations are stable (see #1103039376184852622 message ). so my comment about novelty is about us showing more proof that something others did shows nice properties
so its basically this conclusion: #1103039376184852622 message
like it?
perfect if find a slightly brighter color for [sigmoid]
?😆
guys, this topic is very hard to argue. I thought about it whole day
i think your conclusion is quite fair
can we do this pls: #1103039376184852622 message. like that result around 74% you mention, put it in the table
added in 8. Future work paragraph about potential of model state.
for chatgpt, we should probably add a footnote with date accessed/retrieved as its a live changing product
@bronze frost TODO: proofreading this is a good moment. pls leave latex comments wherever you find something wrong/(that could be improved)/(that needs details)
All, we're approaching the soft deadline for monday. Paper is looking very good. Thanks everyone for your contributions. Now it's about improving those rough edges.
Will do a pass later for standardizing affiliations and author contributions to format specified at section start. Will comment the current ones so information is preserved. Pls make sure information is there.
because of limited time and resources I have now only results for 2 datasets with optimized prompts. How many hours do we have?
12-16 would be reasonable
add the current ones and put empty spaces or - for the rest pls
how many would you need
ok, I will try to find good prompts for rest of the datasets and run eperiment, for now I will put -
I’m trying to run the Pile test set eval from scratch on Pythia but something seems to be very wrong with the runtime. Going to do some debugging and report back
Ah I was using a batch size of 1
This seems weirdly low? Pythia 70M
| Task |Version| Metric | Value | |Stderr|
|--------------------------|------:|---------------|-------:|---|------|
|json=train:text:test.jsonl| 0|word_perplexity|133.5446| | |
| | |byte_perplexity| 2.0859| | |
| | |bits_per_byte | 1.0607| | |
@paper dove @obsidian quest how did you compute Pile test loss for RWKV?
I was doing the same on 4xRTX A5000+NvLink and got issues with the runtime as well
batch size?
Pythia-410M
| Task |Version| Metric | Value | |Stderr|
|--------------------------|------:|---------------|------:|---|------|
|json=train:text:test.jsonl| 0|word_perplexity|39.8875| | |
| | |byte_perplexity| 1.7397| | |
| | |bits_per_byte | 0.7988| | |
Now that I am using BS > 1, it's RWKV that's extremely slow
I removed a ton of \vspace commands. Using \vspace is a very crude method for arranging figures. It is a) strongly discouraged in general and b) the absolute last thing you should do on a paper. The removal of over 50 \vspace commands appears to have made no visually obvious changes to the paper
HF rwkv is still buggy. avoid
@young sparrow do you have raw token loss for pythia models
@young sparrow here's the script i used to benchmark time and memory consumption, which downloads wheights from HF and loads using the rwkv pip package. maybe its helpful
It is! Thank you @tropic minnow
credits to @snow zealot for the development hahah
Should we discuss what happened to the Scaling Laws section? I acknowledge there have been previous objections ( #1103039376184852622 message ) and they are commented now. Any reason?
I commented it out because as I reported previously it is
a) done incorrectly
b) doesn’t seem to provide evidence for any of the paper’s claims (even if it were correct)
c) gives equations and guidance that will mislead people (because they’re incorrect)
I’ve spent a lot of time trying to figure out a way to post-hoc correct them and I can’t find one.
I think that a) should be disqualifying in and of itself, but even if it’s not then b) and c) seem to refute any alleged usefulness.
Let’s do it right and put it in the EMNLP submission. But there’s a basic responsibility to not put incorrect and misleading information in the preprint.
okay. this is quite a major change. please lets report here for these kinds of modifications
I did. I discussed it with BlinkDL and Quentin ahead of time. I reported that I wanted to do it and had conversations about how it might be rescued with several people.
I’m sorry that I didn’t mention it explicitly again when I made the change.
so the conclusion here should change. Also I changed "draw parallelisms" to "draw parallels"
We do showcase the scaling
We scale the model to 14B params and compare performance with transformers
The fact that we don’t derive explicit scaling laws doesn’t mean we don’t showcase scaling
This plot doesn't match what I'm getting for Pile test loss? It makes RWKV look worse though
Corrected to 'behavior'
Table 6 is too small, barely identifiable
the same appendix J ends weirdly, almost like abruptly, and maybe the indentation should change
Been going down a RWKV deep dive recently while scouting for good base models to work with. Great coincedence that there happens to be so much discussion around it at this same time 🙂
I honestly think it's doing a big disservice by referring to itself as just an RNN. I feel like the fact that it's ultimately derived from Apples attention free transformer is one of the most interesting aspects but seldomly talked about 
Maybe the AFT isn't the most flattering aspect, but I think that it's just very interesting and catches the eye to warrant a deeper dive, atleast that's what happened for me
The current paper draft stresses that heavily
Have you read it?
nope, I was trying to scroll up to see if I can find the draft, but so many message lol
It’s pinned
Ok I shall read it now, and i'm happy to hear you guys are stressing that part heavily 🙂
Gotta go to bed soon so mainly doing slow skimming through the paper, but I'd just like to say that I have one of the most common types of color-blindness, and I approve the colors used for the charts 👍 Very easy for me to distinguish the lines 🙂
Overall it's a great looking paper and I love that last couple sentences at the conclusion 
and the fact that it significantly beats ChatGPT in MathQA is seriously impressive, and that's not even the RWKV model trained on 1.7 trillion tokens yet. (or is it?)
It's not 🙂
So much potential, I wish the paper great success and I'll do a deeper dive on it tomorrow, can't wait to fine-tune some insane models on RWKV-V12-14B once it's fully trained on almost 2T tokens 🔥
I think that there might be intrinsic "think-step-by-step" mechanisms in RNNs
yea 
Might sound a bit out there in terms of paper discussion, but I saw this mentioned somewhere amongst the HF X Raven announcement a few days ago and found it interesting;
RNN's or atleast the way RWKV does things seems to be more closely mimicking certain aspects of the brain.
specifically in terms of the locality vs non-locality aspects (Transformers being more of the ladder, while RWKV and the human brain tend to be more of the former)
and then that makes me also think though, I recall some research showing that the hippocampus and learning centers of the human brain actually have a strong thing in common with the transformer architecture, it would be interesting if RWKV turns out to conserve this aspect that has similarities with the hippocampus
https://www.quantamagazine.org/how-ai-transformers-mimic-parts-of-the-brain-20220912/
Do you think RWKV could outperform CoT task's performance against transformers because of its architecture?
we should test https://www.pnas.org/doi/10.1073/pnas.2105646118 one day
Not necessarily, RWKV does not look back at previous tokens
Not necessarily, RWKV does not look back at previous tokens
@obsidian quest I added this because I think time decay (fig. 9) is inductive bias?
yes but the model will learn similar patterns with simple initialization so the initialization is just to speed up convergence
...but also come from parameter initialization as it speeds up training.
While many alternatives Transformers have been proposed with similar , ours is the first to back up those claims with models
What are the proposed alternatives ?
AFT transformer? linformer? qrnn? sparse attention? state space models? most papers mentioned in the 2. Related Work
added more citations in 4.5 about tackling gradient problem in RNN
Almost deadline?
yes
I'm doing a final pass and submitting to arxiv over the next hour
Hi, I left two comments yesterday, but they haven't been resolved
may i just split this into blocks? it seems not a consecutive dialog flow
No for reproducibility
@everyone -- There have been a few new authors added since this deadline:
Bartłomiej KoptyraBolun WangRuichong ZhangStanisław Woz
If you're on this list, please DM me and prove that you contributed before the deadline and describe what you did. We want the RWKV community to be authors, but we need to guard against people jumping in before the deadline, adding a comma, and claiming authorship.
If I don't hear back you will be removed.
As a new contributor? We're full for the arxiv version. Followup papers will need more hands though.
are these cases from you? you just type these samples in one dialogue flow?
Does RWKV run with arbitrary context length out of the box, or is there some form of adaptation needed, or what? This experiment has almost no explanation in the paper as to how it was done currently.
from first principles yes the architecture should run with any length. in practice, this is the setup #1103039376184852622 message
for the model evaluated, i think its base rwkv-4 most likely but would be nice to know more @obsidian quest (data comes from: #1103039376184852622 message)
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
can run with arbitrary context length out of the box if trained with correct method
but now i am using a lazy method so limited by training ctxlen
The last conversation is from me, which is summarization
I intentionally set topp=0
This is good for reproducibility
Okay so some important details:
- How many tokens was it finetuned with?
- What finetuning settings were used?
- Is the evals shown in the comparison with Pythia, OPT, etc. done with ctx 1024 or 2048?
these zeroshot tasks only care abt short ctxlen
the rwkv ctx1k (pile 1-epoch model) has similar numbers
I’m not sure which of my questions this is supposed to answer
Yea sure I was thinking about making this a set of logic questions when I added this lol (but if it’s too cluttered then nah)
Typical method:
- ctx1k -> 2k [10B tokens] -> 4k [till almost-plateau] for 1B5 / 3B
- ctx1k -> 2k [10B tokens] -> 4k [10B tokens] -> 6k [10B tokens] -> 8k [till almost-plateau] for 7B / 14B
The zero-shot number are almost unchanged after these.
I computed Pythia numbers with full test samples, and I think all of them are less than 1k tokens.
@worn bloom and Stanislaw Wozniak along with us (Przemyslaw Kazienko na d Jan Kocon) performed comparative experimental studies on ChatGPT (Appendix J).
There is no reference to Fig. 6 (\ref{fig:inference_time}). I would suggest Samuel Arcadinho adding it in Sec. 6.
Do you have a plot showing loss over the course of context length extension training
Yep they have both verified with me as well.
The paper has been submitted to arxiv.
We might be able to see it at 9AM Beijing time tomorrow morning (UTC+8)
can build from wandb records when i am less busy 😂
Bolun Wang is sick atm (covid) 🙃
Can you vouch for his authorship?
yes
Great! Then everyone is accounted for and will keep authorship.
@obsidian quest the current paper is set to be announced on arXiv in 8 hours. Do you have a plan regarding a Twitter thread / announcement?
Bolun Wang: RuoxinTech
yes EAI can tweet and I will retweet
Okay, do you want me to write the thread?
okay
@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread when it goes live in six-ish hours.
Also, does anyone know what the largest RNN ever trained previous to this is?
Including LSTMs?
May be ELMO in EMNLP2018??
Sure
Do you know how big that is?
Small....
All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011.
Ummm.... : 🤨
The 5.5B here refers to the training corpus size 😂
First draft (some image attachments are planned but it's work to appropriately interweave them in discord)
Everyone knows that transformers are synonymous with language modeling at scale… but what if they weren’t? Over the past two years @obsidian quest and team has been hard at work figuring out how to scale RNNs to unprecedented scales. Today we are officially announcing a preprint detailing RWKV: a reinvention of the RNN for the transformer era.
Note that this paper is a work in progress, and its release is forced on up by anonymity deadlines. We are planning on continuing to improve and update the paper (including explicitly deriving scaling laws!) and you can come to the discord server for the latest https://discord.gg/z9SGyZE6EE
Claiming that you can match a transformers’ performance is nothing new, and plenty of other papers put forth that claim. What makes RWKV special is that we actually train models up to 14 billion parameters and show consistently competitive performance with token-matched transformers! As far as we know, the largest previous RNN is two orders of magnitude smaller.
RNNs struggle to scale because of how they parallelize, but making the time decay of each channel data-independent, we are able to parallelize RWKV the same way transformers are during training! After training, it can be used like an RNN for inference.
Our design is largely inspired by the “Attention Free Transformer,” which we realized could be written as an RNN if we use circular matrices as "w" in its formula. AFT alone isn’t able to match GPT’s performance, but inspired by it we continued to make progress on “RNNifying” transformers.
RWKV isn’t without its flaws. While we do approximately match the performance of transformers, our anecdotal experience is that it’s more sensitive to prompts and struggles to incorporate very long range information more than traditional transformers do. We are continuing to work to quantify these phenomena.
Our models are available for download on the @huggingface hub (warning: inference appears to be bugged at time of writing) or you can use our library: https://github.com/BlinkDL/RWKV-LM
[a couple tweets of tags and acknowledgements go here]
Log scale... 🤯
is "flaws" the right word here? I'm thinking limitations or drawbacks
I view them as synonymous but if you think others won’t I can use limitations
I would agree with using limitations instead of flaws. For me, flaw has a connotation of "something is being done incorrectly", rather than just suggesting that this method can be improved
@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread. You have half an hour or whenever I get around to it, whichever happens second.
Can anyone find the preprint on arxiv? I thought it should've been out at 8 PM EDT today but I am unable to find it
DM on twitter or discord?
Discord
I don't see it, and it disappeared from my arxiv profile???
Is that normal? Maybe it's updating?
That happened with Pythia and then it appeared like 20 minutes later
Why's arxiv gotta play with my heart like that smh
arxiv takes a while to update each day
if you're impatient you can watch it slowly process in order of arxiv IDs
arxiv likes playing these games with our emotions 
it's all planned as part of their program to get certain emotional reactions out of authors to train their new emotional sentiment analaysis model they've been working on /s
Current list of authors with names replaced with twitter tags if I have it
@BlinkDL_AI @eric_alcaide @QuentinAnthon15
@AlbalakAlon, @SSamDav, Huanqi Cao, Xin Cheng, Michael Chung, @GrellaMatteo, @kranthigv, Xuzheng He, Haowen Hou, Przemysław Kazienko, kocon_jan, Jiaming Kong, Bartłomiej Koptyra, @lazercuber, @SriIpsit, @FerdinandMom, Atsushi Saito, @XiangruTang, Bolun Wang, Johan S. Wind, Stanisław Wózniak, Ruichong Zhang, @ZhangZhenyuan3, Qihang Zhao, @zp_pengzhou, @lukeZhu20, @Rudd80856040
Transformers have revolutionized almost all natural language processing (NLP)
tasks but suffer from memory and computational complexity that scales
quadratically with sequence length. In contrast, recurrent neural networks
(RNNs) exhibit linear scaling in memory and computational requirements but
struggle to match the same performance as Transfo...
Everyone knows that transformers are synonymous with large language models… but what if they weren’t? Over the past two years @BlinkDL_AI and team have been hard at work scaling RNNs to unprecedented scales. Today we are releasing a preprint on our work
Is it still possible to add a tweet tag now? 
The twitter post mentioned the word "circulant matrix", and I realized suddenly https://www.bmvc2021-virtualconference.com/assets/papers/0296.pdf this paper is very similar to our methodology, see the images:
our circulant matrix looks like this
interesting, so all w here are learnable param. and RWKV only one w is learnable param.
what's more, this "spatial shift" technique from the same lab (v1) https://arxiv.org/pdf/2106.07477.pdf (v2) https://arxiv.org/pdf/2108.01072.pdf @obsidian quest you gotta see this, it's crazy. maybe RWKV for image is on the way
RWKV universe is coming
Table 5 AFT-simple should be 1.046 1.209
I am training L12-D512 rwkv to check test loss
Figure 4 x-axis wrong params scale
Maybe the unit is in billion parameters
missing Hadamard product here?
Hmm its an element wise between vectors
An arbitrary permutation can be expressed as the product of disjoint cycles.
Increasing the depth of RWKV networks might lead to the product of disjoint cycles to represent arbitrary permutations and graphs.
Thanks!! This comment suggests that MEGA could have two modes.
@subtle oak ^^^
If I correctly understand.
Thanks for correcting! So the MEGA also has two more like RWKV haha
I actually do not add the MEGA’s space and time complexity, I add the table with Transformer, Performer, Linear Transformer, Reformer and AFT-full🤣
One of my friends pointed out the missing following reference: https://aclanthology.org/2022.emnlp-main.24.pdf
I've just modified it to O(cd) but should we expand the complexity order comparison table into two columns: one for training mode and one for inference mode??
Thanks for your modifications!
Do you suppose that big-O is for inference?
Hmmm, I think if we separate the inference and training, maybe we need also claim that RWKV has different complexity for training and inference?
Yeah I guess it is okay
Although RWKV still use the recurrent mode for training, but if we need to parallelize it, the complexity will be changed
Ah, are you saying that we could write two rows 1:RWKV(GPT-mode) and 2:RWKV(RNN-mode)?
If RWKV(GPT-mode) inference pass runs in parallel for each layer, it holds O(d) time complexity. Is my understanding right??
RWKV(GPT-mode): O(d), O(Td)
RWKV(RNN-mode): current table
I think if we use the convolution mode instead of RNN mode, its time complexity will become O(Tlog(T)d) by FFT, and it’s space complexity will be O(Td)
Oh, I am wrong because of ignoring reducing/merging costs.
But now in GPT mode, it still uses the RNN backbone (if you check the CUDA code)
So the complexity will become O(Td) and O(Td) I guess...
The convolution mode is actually just a theoretical best approach for parallelization
So maybe if we mentioned the MEGA's two mode, we also need to claim the mode in RWKV
To minimize modification of this big-O table, I modified the caption of the table as follows: from "Complexity comparison" to "Inference complexity comparison". Is it okay for you?
convolution-mode (a.k.a divided conquer mode) could be mentioned in an appendix section as a future TODO.
RNN mode is just a subset of gpt mode where the inference batch size is 1
Great! It make sense for me
Yeah this is the relation between them haha
yes element wise. so it is supposed to be a Hadamard product, just like AFT
EQ 14 is fixed into "\odot"
Hmm, speedy beam search using disconnected gpt mode
2 odot i believe?
Why not exp(k + w + log(v))
What's that equal?
e^(k+w+log(v) - k - w)?
Ehh, I am probably missing something,
Would be interesting if it turned out that
WKV = V
Is it correct to be 0, 1, 10 in Billion??
Assumed this log-Billion-scale
yeah now correct
i would suggest keep the notation in line with AFT
I added the words "in billions" into the caption of Fig4.
as a special case, we can try to adjust the permutation so that prompt ends with question can be answered as well as prompt starts with question (I'm not sure)
What are the sections to roll back, to improve or add for EMNLP? Limit 8 pages right plus Appendix.
By the way, I think that FFT can decrease the time complexity into O(d log T) if O(T) operations are executed in parallel for each layer of FFT's divided and conquer recursion. Is it valid for you ??
the FFT optimization is mentioned in footnote 3 I guess, while its faster in theory, in practice O(T) is enough (or not) 
can FFT be useful if we calculate according to this matrix? I think this can mitigate some of RWKV's limitations.
namely, using a circular matrix without causal attention mask for processing prompts to achieve "ring topology" rather than caring about the ordering of the prompt.
just my two cents
it has been discussed long ago
and is preceded by parallel scan
FFT is O(T log T) BTW
(O(T) operations in parallel isn't real; you cannot really provide parallelism as large as B*T*C, given that would be millions to billions of elements to compute in parallel)
FFT only can accelerate the convolution I think, so if we use the RNN mode (include GPT mode), the complexity could not be decreased
would you be interesting in implementing a CNN inference mode?
a FFT implementation by Jianlin Su 🤔
like S4/MEGA
I'm planning to implement a dumb O(T^2) for this
just to see if the result is good
@sullen horizon will add Long Range Arena numbers
I was talking with Hugging Face a couple months back about writing a HF blog post explainer for RWKV but have been on paternity leave - is anyone doing that? If not, happy to lead and collab on it!
heeyyy fantastic 😄
Table 5 @last mauve
AFT-simple should be: train 1.046 // test 1.209 according to AFT paper
L12-D512 RWKV: train 1.010 (w/dropout) // test 1.178
trained with AdamW wd 0.1, dropout 0.1, bsz 16, initial LR 6e-4
I've tested similar method in pytorch and it exposes significant precision issue
brute force exp(n*u) won't really work
Reminds me of the wkv power triangle implementation
import torch
class wkv_power(torch.nn.Module):
def __init__(self, dims, T):
super(wkv_power, self).__init__()
self.register_parameter(
self.register_buffer("mask", torch.ones(T, T).tril().unsqueeze(-1).to(torch.bool), persistent=False)
self.register_buffer("tri", ((torch.arange(T).expand(T, T)+1).t() -
torch.arange(T)).tril().unsqueeze(-1), persistent=False)
def forward(self, k,v, r):
vx_kx = (k).exp().unsqueeze(0) .expand(
2, k.shape[0], k.shape[1]).clone()
vx_kx[0] *= v
t = ((self.time_decay.expand(self.T,self.T,-1)*self.tri).exp()*self.mask)
# vx_kx[0][0] += state[2]
# vx_kx[1][0] += state[3]
rza = torch.einsum("rki,jki->rji", vx_kx, t)
vx_kx *= self.time_first.exp()
vx_kx += rza
vx_kx[0] = r*vx_kx[0]
vx_kx[1] = 1/vx_kx[1]
wkv = vx_kx.prod(0)
# state[2] = rza[0][-1]
# state[3] = rza[1][-1]
return wkv
kk added👍 . do we have test bpc for RWKV L=6, D=512?
how fast is torch einsum compared to reorders and matmuls?
(i think it might be faster to use more elementary primitives)
the original L6 D512 model is seriously overfitting because it's not using wd/dropout
right. we can say this. still, if we had the number, i think it could be good to report it. do we have it?
@obsidian quest if I want to train a RWKV model of X parameters for Y tokens, do you know how I should set the rest of the h params? Is there an approximate formula?
usual GPT h params are okay - we can search for better h params & lr schedule
Approximately how many A100-hours does it take to train a model with X params and Y tokens?
If you don’t know the number but do know the amount of FLOP/second you get during training we can reverse engineer it
RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G (ZERO2+CP)
What about 2048 context?
same training speed regardless of ctxlen
What about a model half the size?does speed increase linearly?
yes
Okay, so for every 1B params 1B tokens it takes 34 hours?
Does that sound right
(It doesn’t to me…)
No, that would mean a 1B model trained on the pile would take over a year
Did it take 30 days to train the 14B model?
What is “Gt/day”? Gigs tokens per day?
here efficiency = (Gt/day) * (B params) / (#A100s)
So efficiency = B tokens x B params / A100 / Day
That’s exactly the number I was looking for ^_^
probably still have 20% room for optimization
So if we want to spend 15 days doing experiments, we have time for 30 (B params) (B tokens) / A100
Woah what are you running on 336 A100s right now o.O
I am tuning PilePlus and training World for 0.1~7B simultaneously
So if I assume we can get 64 A100s for scaling laws experiments, we get 2,000 (B tokens) (B params)
Okay, can we get all combinations of the following training runs launched @obsidian quest?
Tokens (B): 1, 2, 4, 8, 16, 32
Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8
Should take only 100 A100-days total
(Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate)
we can train on minipile https://arxiv.org/abs/2304.08442 do we have a 20b-tokenized version
The ever-growing diversity of pre-training text corpora has equipped language
models with generalization capabilities across various downstream tasks.
However, such diverse datasets are often too large for academic budgets; hence,
most research on Transformer architectures, training procedures, optimizers,
etc. gets conducted on smaller, homogen...
can use the method to generate minipiles of different sizes
Xingjian Du, Leon Derczynski, Bolun Wang pls add your contributions to Appendix A: Author Contributions
No, we need to train on the same dataset for each of them. It’s okay that we don’t train in the whole pile, that doesn’t matter
https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2
RWKV featured in karpathy's talk (at 20:10)
Link seems to be down
It wasn't an RWKV-specific shout-out
It was this table: #1103039376184852622 message
Does it bother anybody else that the contributions section isn't in the same order as author list?
someone on Zhihu asked
"Why is time complexity of linear transformers said to be O(Td^2)? Do they assume linear transformers use some d^2 kernel functions?"
I don't understand linear transformers so repost here.
still a good link anyways!
okay will reorder
reminder
thanks. did this yesterday, guess they got clobbered, re-adding
@subtle oak
i think the zhihu guy might be right
Yes because they do?
Yeah actually I assume that the kernel complexity is d^2...
The formula of the linear transformer can be represented by this
And some papers just multiply K and V as the first, but do not use the kernel, like cosFormer
I guess they use the same QKV structure, e.g., multiply KV as first and then Q
I apologize for the simplification of the complexity analysis, if we need the precise estimation, the complexity need to be replaced with O(Tk^2)
but there are some papers use the k=d
like cosFormer and Spikformer
Maybe we should describe a more general complexity, so we need to use the k?
Could we re-upload a hot-fixed version to Arxiv?
no, it's anon period now. We can only update it after emnlp review
we will as soon as we can
This kind of thing is completely meaningless
And that paper in particular is extra meaningless because the proof hinges on an assumption that’s not actually true of transformers
If you use their formal model but change arbitrary precision to finite precision it stops working
If people get all their fixes in within the next 4 hrs I can submit a revision this afternoon
what about anonymity #1103039376184852622 message
I think that a silent update without announcement to fix some errors should be fine, but if people feel differently we can hold off.
I think that RNNs with constant numbers of parameters and infinite precision are also universal TM...
I think that this risks desk rejection for little benefit. *CL can be quite anal about this
yes im of the same opinion. i think we should abide strictly by the rules... having it rejected would be quite annoying and we probably gain little
BTW, y'all're featured at the top of eleuther.ai 🙂
Yes, I agree that we should hold this off until the end of review period
Ok we'll wait
Ok so our next work item is the EMNLP deadline on June 23. We need to:
- Condense what we have to 8 pages
- Tighten up the storyline
- Resolve the scaling laws issues that @young sparrow reported
My current thought on a core team for this would be @last mauve, @tropic minnow, @spiral minnow, @zealous snow, @tender karma, @rich raptor, @broken moth since all have enough academic writing experience to lead this rewrite (to clarify, anyone can contribute, but these are the rewrite leads). If you want added to or removed from this list, DM me. Once the core team is finalized by the end of the week, I'm going to start assigning sections and working on the EMNLP version with a new overleaf project.
EMNLP Overleaf: https://www.overleaf.com/9624387813psbrpbqypjfc
Can someone point me to the part of the paper that references the modified wkv forward function to alleviate overflow errors?
Anyone trying to reproduce from scratch is going to run into that.
The unmodified wkv formula only works in float64
@misty cedar This part?
Key search terms are avoid overflow
Thanks:)
One of my friend in SF wants to do a podcast episode on RWKV, specifically to highlight alternatives to transformers
https://www.latent.space/podcast
This is in part, due to the strong positive reception from the paper (and me pestering them on RWKV for weeks)
Anyone interested? They are hosted in SF and prefer to do podcast in person but can be remote.
It is expected to get very technical (time/channel mixing) into how things differ from transformers and the pros and cons (aka the paper)
In overall I do believe it is good exposure for RWKV
(I have asked blink prior to opening up the question here, I also know the host well, and he can prepare in advance the topics so you do not end up surprised or uncomfortable in the podcast)
The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, Data Science, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Guests from Databricks, Glean, ...
if no one else volunteers i could do this but unfortunately in remote🙏
hey @obsidian quest i could help launching these experiments #1103039376184852622 message on the cluster if you're too busy but i would need the training settings you're using for the other RWKV models
ok pls list the experiments you'd like to test
In order to compute scaling laws we need to run these training runs
I think that scaling laws would be a big value add to the paper, but we don't currently have the necessary data to do it correctly
probably these:Tokens (B): 1, 2, 4, 8, 16, 32 Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8 Should take only 100 A100-days total (Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate) as referenced in #1103039376184852622 message by @young sparrow
how abt the LR schedule
hmmm thats why we need to know the settings you're using to train current RWKV hah
My method: const LR_init for 10~20G tokens, then exponential decay to LR_final
I think it's actually fine to use original training data for scaling law, because I am decaying LR faster than cosine-decay
nice. are config params here https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py correct for training models similar to current rwkv-4 models on HF?
actually @obsidian quest i think it would be way easier for everybody (you as well lol) and way more reliable (consistency etc) if you launched the training runs on eleuther cluster. i can patch it if you're too busy but likelihood of an experimental mistake increases a lot lol
Is this LR decay rate calibrated to the number of training tokens in any way?
i decay LR when the loss decrease rate is below a threshold
What is the threshold
I begin the decaying of LR when the loss decrease rate is less than "3e-4 per 40M tokens" - just a random threshold
This happens when the model is trained for 10~20G tokens (more so for larger models)
I'm having trouble following what that means. Can you state it explicitly, like it's an algorithm?
Is it something like this?
if |loss(step[current]) - loss(step[current - 40M tokens])| < 3e-4:
lr is decreased by ???
if smoothed(|loss(step[current]) - loss(step[current - 40M tokens])|) < 3e-4:
begin the exponential decay of LR
okay so that's for starting when the decay happens
And the decay rate aims to reach the target LR after how many tokens? The size of the remaining dataset?
yes size of the remaining dataset
This is done manually right? I see the following comments in the code currently:
# By default we are using exponential LR decay.
# Here are my suggestions for training.
# Let's say you are training a L6-D512 model.
# 1) Set lr_init = lr_final = 8e-4. Let it run for some mini-epochs, until you feel like reducing LR.
# 2) Check epoch_save_frequency and make sure the partially-trained model is saved. Ctrl+C to stop the run.
# 3) Set lr_init = 8e-4, lr_final = 1e-5, betas = (0.9, 0.999).
# 4) Set EPOCH_BEGIN & LOAD_MODEL to load the partially-trained model. Continue the training.
#
# For L12-D768, set lr_init = 6e-4. For L24-D1024, set lr_init = 4e-4. For L24-D2048, set lr_init = 3e-4.
yes manually
however i think this is mostly useful for small batchsz training. cosine decay is fine for large batchsz
bsz = batch size?
I see final_tokens=n_epoch*len(train_dataset)*ctx_len
If I want to train for a pre-specified number of tokens and then stop, how do I determine how to change this? So my dataset will have more tokens that I actually use
The best method will be to work out a formula that can provide good LR schedules for any [ParamSz - DataSz - BatchSz] combination
For example, I believe the best LR schedule for a tiny DataSz is [constant LR]
How big is "tiny"
several G tokens
Where is the LR decay type actually set? I see the initial and final LRs, but where do you set it to exponential decay
around 10~20G tokens for pile models
No, where in the code
manual
You support warm-up right? So if I wanted to make the switch from linear to exponential happen automatically, I can set the warm-up lr to your preferred constant?
do they provide any travel assistance?
I'm in seattle for a bit. @tropic minnow -- where are you located?
@obsidian quest I've added (extremely hacky) support for automatically switching from constant LR to exponential decay and custom dataset sizing in my fork. Can you see if it runs as anticipated?
I dont use warmup (or only 10 steps) - RWKV is very stable
Right, but I hacked warm-up to do the constant LR-then-decay strategy
huh europe. you're closer if anything
How about you'll do it remote unless they support my travel. Sound good?
Great! If it does it’ll make scaling experiments a lot easier
Sounds reasonable to me. Will ask
another person on Zhihu commented that the receptance gate is a gate for output instead of for forgetting
I agree on his opinion toward this, the gate is not even on the time passing route
Agreed.
I agree with this. I think that the following paper's method could be regard as \sigma(R_i) = 1.0 in RWKV. To consider an extreme case, if R_i is either 0 or 1, then RWKV choose one of the two: "take" or "skip".
We present a very simple algorithm for attention that requires $O(1)$ memory
with respect to sequence length and an extension to self-attention that
requires $O(\log n)$ memory. This is in contrast with the frequently stated
belief that self-attention requires $O(n^2)$ memory. While the time complexity
is still $O(n^2)$, device memory rather tha...
Yes lets change. Indeed the receptance is taken on the residual track to add things instead of removing
I mean nothing stops wkv from being negative but yea the “correct” intuition would be “keeping the negative” then
I dont agree with this reference however. Raabe et el (paper u link) computes attention, and they just do a chunking of the matrix and compute iteratively, but they end up with attention; we dont. They dont do any reduction with equal weights. Imo the most straightforward reference for receptance is MLPMixer vs gMLP
Great
Thanks!!
In the context of MLPMixer vs gMLP, does R act like a time-decaying parametrized version of "token mixer"?
not really, R is the gating ("g") in gMLP basically (see from: https://arxiv.org/abs/2105.08050v2) but there's nothing about time in MLPMixer-like models. they were designed for images (or sequences, but without specific time inductive biases)
@paper dove do you have the code/settings for the small init embedding test?
I’ve gotten feedback from a bunch of people that the current explication is too dense and it’s hard to understand why decisions are being made. The best way to make progress on this would be for someone who is very familiar with the architecture and it’s design to flesh out the prose, working in tandem with someone who is less familiar but more experienced with writing. I’m not sure who a good candidate for this would be though.
I also think that having Section 4 reorganized and rewritten by one person would be a big boon to accessibility.
@obsidian quest have you been able to run my adapted implementation? If it works we can start scaling laws experiments with much less manual work.
While doing the aforementioned modifications to the training code I learned several important details that are not described anywhere in the paper currently. I can add them, though I want to note that I’m approaching the level of contribution where I would like to be included as a coauthor (attn: @obsidian quest @tropic minnow @last mauve)
For someone "very familiar with the architecture", I believe @uneven blade @neon night and myself should be capable. I'm willing to help but cry and cycle should both write better than me.
BTW, what exactly weren't mentioned in the paper?
I almost feel that a weaker writing ability for the architecture expert is a plus, insofar as it forces good communication between you and someone who is better at writing lol.
There’s no discussion of the learning rate in the paper, which is pretty problematic as it’s rather non-standard in implementation.
The actual trained models also lack the infinite context that the paper claims, per my convo with BlinkDL. If the models don’t have it we shouldn’t claim it even if a “less lazy” (his words, not mine) implementation would have it
Oh, I see. If speaking on training stuff, I think there are also some customized data loading order (my_pile_stage, etc.), but I don't think Blink have described that anywhere.
There’s also no mention of DeepSpeed or ZeRO in the paper currently
Instead there’s a vague “oh this parallelizes easily” assertion
As general optimizations on distributed data parallelism, I think just mention them during describing the implementation would be okay
Also the gradient checkpointing is implemented via DeepSpeed, but I don't know if Blink has been using it in his pretraining
im ok with that
I do too (though it would also be nice to explain specifically why they can’t be used with RNNs as well)
The point isn’t that it’s a log of work, simply that it’s important details currently missing
i would say the current manuscript focuses on RWKV as a component used to later build a language model and prove that it is effective for it. if i understand correctly, you want to: add more details/specs about RWKV-LM (learning rate, frameworks, training setup, etc) and unify/harmonize/simplify architecture explanation
this is not really true? they dont lack infinite context length. they are just not trained with that. Nothing prevents you from getting a RWKV trained model and start generating sequences of 30K tokens. The problem is that it was not trained with such long sequences, so it might not be very useful. But the good thing about RWKV is there isnt a time dependency in the number of parameters, so the same model can be used for very long or very short sequences, just as an RNN.
I don’t know what it means for this to not be a paper about RWKV-LM
It’s literally a paper about language modeling. That’s the only benchmark used anywhere in the paper and the primary draw
nothing actually. was just describing the current situation and the proposed changes
Blink said it did? IDK I’m deferring to him
yes "they are just not trained with that"
someone in RWKV discord trained with 100k ctxlen without issues
see https://wandb.ai/nathanwilce/raccoonlongctx #998539369919025212 message
That’s fine, but the point is that the paper doesn’t justify the claims about infinite sequence length. We can include these models, we can add mathematical arguments, we can add scaling tests. We need to add something though
You don’t get to appeal to evidence not introduced in the paper to justify claims made in the paper. The fact that evidence exists somewhere doesn’t make the argument correct.
we can train some very long ctxlen models, or improve the cuda to support infinite ctxlen
state chaining kernels + temporal gradient checkpoint would work well enough for any long sequence imo, yet we need to do that training
(if we want to claim the infinite sequence feature)
the simplest thing to do might be weaken "infinite" to "architectural change is not required for extending sequence length", and demonstrate the result from existing models with different supported seqlen
for the paper, i think it would be enough with a mathematical proof, which is "low cost" as it requires no training nor evaluation. and it is basically the level of knowledge we have. bc we havent trained these super long ctx models and rushing them for the paper doesnt seem ideal.
(then next problem would be what is to be proved
IMHO, even if the word "infinite context length" is deleted in this manuscript, linear order complexity in Table 1 is a selling point.
I will try cosine decay (len = data len) first
Can you let me know when it’s launched? Mostly asking because I’m anxious about deadlines 🙂
now testing it
Added you as an author
https://wandb.ai/blinkdl/RWKV-v4-Scaling L12-D768 1/2/4/8/16/32G tokens
Do you know why the first run crashed?
these are preempted from time to time
each crash = increase about 0.001 loss in early training
